Overview
This step-by-step guide shows how to deploy Ollama and Open WebUI on Ubuntu with NVIDIA GPU acceleration using Docker. You will run large language models locally, manage them in a user-friendly web interface, and expose an OpenAI-compatible API for your apps. The tutorial is designed for Ubuntu 22.04 or 24.04 and focuses on a secure, reproducible, and easily maintainable setup.
What You Will Build
You will run two containers: Ollama (the local LLM runtime and API) and Open WebUI (a modern web UI for chat, prompts, and model management). The stack runs on Docker with NVIDIA GPU acceleration via the NVIDIA Container Toolkit, giving you faster inference and the ability to run larger models locally.
Prerequisites
- Ubuntu 22.04 or 24.04 with sudo access
- An NVIDIA GPU with recent drivers (Turing or newer recommended)
- At least 16 GB RAM for medium models; more for large models
- Internet access to pull Docker images and models
Step 1 — Install NVIDIA Drivers and Container Toolkit
If you have not installed NVIDIA drivers, use Ubuntu’s recommended driver installer:
sudo apt update && sudo apt install -y ubuntu-drivers-common
sudo ubuntu-drivers autoinstall
sudo reboot
After reboot, verify the GPU:
nvidia-smi
Install the NVIDIA Container Toolkit so Docker can access the GPU:
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -fsSL https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list |
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list > /dev/null
sudo apt update && sudo apt install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
Step 2 — Install Docker Engine and Compose Plugin
If Docker is not installed, install it from the official repository:
sudo apt update && sudo apt install -y ca-certificates curl gnupg
sudo install -m 0755 -d /etc/apt/keyrings
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg
echo "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu $(. /etc/os-release && echo $VERSION_CODENAME) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
sudo apt update
sudo apt install -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
Verify:
docker --version && docker compose version
Step 3 — Start Ollama with GPU Acceleration
Create a dedicated Docker network and volume, then run Ollama:
docker network create ai || true
docker volume create ollama
docker run -d --name ollama --restart unless-stopped --gpus all \
-p 11434:11434 -v ollama:/root/.ollama \
-e OLLAMA_ORIGINS="http://localhost:3000,http://127.0.0.1:3000" \
--network ai ollama/ollama:latest
Confirm it is running:
docker logs -f ollama
Step 4 — Launch Open WebUI
Run the web UI and point it at the Ollama container:
docker run -d --name open-webui --restart unless-stopped \
-p 3000:8080 --network ai \
-e OLLAMA_API_BASE_URL=http://ollama:11434 \
open-webui/open-webui:latest
Open your browser to http://localhost:3000 (or the server’s IP:3000). Create an admin account and adjust settings as needed.
Step 5 — Pull a Model and Test
Use the Ollama CLI inside the container to download a model, for example Llama 3.1 8B:
docker exec -it ollama ollama pull llama3.1:8b
Generate text via the API to verify everything works:
curl http://localhost:11434/api/generate -d '{"model":"llama3.1:8b","prompt":"Write a haiku about GPUs."}'
In Open WebUI, choose the model from the dropdown and start chatting. You can download multiple models and switch between them.
Step 6 — Use the OpenAI-Compatible API
Ollama exposes an OpenAI-style API. Point your clients to the local endpoint and use any placeholder key:
export OPENAI_API_BASE=http://localhost:11434/v1
export OPENAI_API_KEY=not-needed
Python example with the OpenAI SDK (chat completions):
pip install openai
python - <<'PY'
from openai import OpenAI
client = OpenAI(base_url="http://localhost:11434/v1", api_key="local")
resp = client.chat.completions.create(model="llama3.1:8b", messages=[{"role":"user","content":"Explain vector databases in one paragraph."}])
print(resp.choices[0].message.content)
PY
Troubleshooting
- If nvidia-smi fails inside containers, re-run sudo nvidia-ctk runtime configure --runtime=docker and restart Docker.
- If the UI cannot see models, confirm OLLAMA_API_BASE_URL is correct and both containers are on the same network.
- For model download failures, check disk space and retry: docker exec -it ollama ollama pull MODEL_NAME.
- If ports are in use, change the host ports (for example, -p 11435:11434 and -p 3001:8080).
Security and Best Practices
- Do not expose port 11434 to the internet without a reverse proxy and auth; bind to localhost or your private network only.
- In Open WebUI, enable authentication and restrict sign-ups in the admin settings.
- Keep images updated: docker pull ollama/ollama:latest && docker pull open-webui/open-webui:latest, then recreate containers.
- Backup volumes regularly: docker run --rm -v ollama:/data -v $(pwd):/backup busybox tar czf /backup/ollama-backup.tgz -C / data.
Optional: docker compose
Prefer a single-file deployment? Create compose.yaml:
services:
ollama:
image: ollama/ollama:latest
restart: unless-stopped
ports: ["11434:11434"]
volumes: ["ollama:/root/.ollama"]
deploy: {}
environment:
- OLLAMA_ORIGINS=http://localhost:3000,http://127.0.0.1:3000
runtime: nvidia
open-webui:
image: open-webui/open-webui:latest
restart: unless-stopped
ports: ["3000:8080"]
environment:
- OLLAMA_API_BASE_URL=http://ollama:11434
depends_on: ["ollama"]
volumes:
ollama:
Start with docker compose up -d. This file is easy to version-control and redeploy on another machine.
Wrap-Up
You now have a fast, private, and flexible local AI stack running on Ubuntu with GPU support. Ollama handles model execution and exposes an OpenAI-compatible API; Open WebUI provides a polished interface for daily use. With Docker, upgrades and backups are simple, and you can iterate quickly as new models and features arrive.
Comments
Post a Comment