Local large language models (LLMs) have matured rapidly, and running them with GPU acceleration on your own server is now simple. In this step-by-step tutorial, you will deploy Ollama (model runtime) and Open WebUI (a friendly chat interface) on Ubuntu 22.04/24.04 using Docker Compose and the NVIDIA Container Toolkit.
Prerequisites
- An Ubuntu 22.04 or 24.04 machine with an NVIDIA GPU (Turing or newer recommended) and internet access.
- Administrative (sudo) access.
- Basic familiarity with the terminal and Docker.
1) Install NVIDIA Driver
First, make sure your system is up to date, then install the recommended NVIDIA driver. If you are already on the correct proprietary driver, you can skip this step.
sudo apt update && sudo apt upgrade -y
sudo ubuntu-drivers autoinstall
sudo reboot
After reboot, confirm the GPU is available:
nvidia-smi
You should see a table with your GPU model and driver version.
2) Install Docker Engine
Use the convenience script from Docker to install the latest Docker Engine quickly. Alternatively, follow the official repository instructions if you prefer a locked version.
curl -fsSL https://get.docker.com | sh
sudo usermod -aG docker $USER
newgrp docker
Verify Docker is working:
docker run --rm hello-world
3) Install NVIDIA Container Toolkit
The NVIDIA Container Toolkit lets containers access your GPU. Install it and configure Docker to use the NVIDIA runtime.
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | \
sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -fsSL https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt update
sudo apt install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
Test GPU access inside a container:
docker run --rm --gpus all nvidia/cuda:12.4.1-base-ubuntu22.04 nvidia-smi
If you see your GPU, you are ready for Ollama.
4) Create a Docker Compose file
Make a project directory and create a docker-compose.yml that runs both services with persistent volumes and GPU support.
mkdir -p ~/ollama-webui && cd ~/ollama-webui
nano docker-compose.yml
version: "3.9"
services:
ollama:
image: ollama/ollama:latest
container_name: ollama
restart: unless-stopped
ports:
- "11434:11434"
environment:
- OLLAMA_KEEP_ALIVE=24h
volumes:
- ollama:/root/.ollama
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
open-webui:
image: ghcr.io/open-webui/open-webui:main
container_name: open-webui
restart: unless-stopped
depends_on:
- ollama
environment:
- OLLAMA_API_BASE_URL=http://ollama:11434
- WEBUI_AUTH=true
ports:
- "3000:8080"
volumes:
- openwebui:/app/backend/data
volumes:
ollama:
openwebui:
Note: The deploy.resources.devices block requests GPU access using the NVIDIA driver. Docker Compose v2+ supports this on modern hosts. If your Compose version ignores it, see the troubleshooting section for alternatives.
5) Start the stack
Bring up both containers in the background:
docker compose up -d
docker compose ps
Open your browser to http://localhost:3000 (or the server’s IP on port 3000). The web interface will prompt for account setup if authentication is enabled.
6) Pull and run a model
You can pull models from the UI (Models section), or via CLI inside the Ollama container. Example with Llama 3.1 (8B):
docker exec -it ollama ollama pull llama3.1:8b
Once the model is downloaded, select it in Open WebUI and start chatting. GPU memory matters: 8B models typically need ~6–8 GB VRAM; 70B needs much more. If you are low on VRAM, try smaller or quantized variants (e.g., Q4_K_M builds).
7) Persistence, updates, and backups
- Your models and settings live in Docker volumes named “ollama” and “openwebui.” To back them up, stop the stack and archive /var/lib/docker/volumes/ollama and /var/lib/docker/volumes/openwebui.
- To update images: docker compose pull && docker compose up -d.
- To move the setup to another host, copy the Compose file and restore the volumes.
8) Secure access (optional)
For internet exposure, place Open WebUI behind a reverse proxy with TLS (e.g., Caddy or Nginx) and keep WEBUI_AUTH=true. Consider network ACLs or a VPN like Tailscale/WireGuard for private, zero-trust access.
Troubleshooting
- GPU not detected in containers: ensure nvidia-smi works on the host; re-run sudo nvidia-ctk runtime configure --runtime=docker; restart Docker; try docker compose down && up -d.
- If your Compose version ignores deploy.devices, try adding a profile or running with CLI flags. For example, launch Ollama separately:
docker run -d --name ollama --gpus all -p 11434:11434 \
-v ollama:/root/.ollama --restart unless-stopped ollama/ollama:latest
- Performance tips: set model context length lower in WebUI, avoid running multiple models at once, and monitor VRAM usage with nvidia-smi.
With this setup, you get a modern, GPU-accelerated local LLM stack that is fast, private, and easy to maintain using Docker Compose. Enjoy building AI workflows without sending your data to the cloud.
Comments
Post a Comment