How to Self‑Host Ollama + Open WebUI with NVIDIA GPU in Docker on Ubuntu (2025 Guide)

Overview

This step-by-step guide shows how to self-host Ollama with Open WebUI using Docker on Ubuntu, with optional NVIDIA GPU acceleration. You will get a modern local AI stack that can run LLMs such as Llama 3.1 or Mistral privately, with a clean web interface, persistent storage, and an easy update path. The tutorial targets Ubuntu 22.04/24.04 and works on servers, workstations, and homelabs.

Prerequisites

- Ubuntu 22.04/24.04, sudo access, and basic command-line knowledge.
- For GPU acceleration: an NVIDIA GPU with up-to-date drivers (CUDA-compatible). CPU-only mode also works; you can skip the GPU steps.

1) Install Docker and Compose

Run the following commands to install Docker Engine and the Compose plugin (official repository):

sudo apt update && sudo apt install -y ca-certificates curl gnupg lsb-release sudo install -m 0755 -d /etc/apt/keyrings curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg echo "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null sudo apt update sudo apt install -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin sudo usermod -aG docker $USER && newgrp docker

2) Enable NVIDIA GPU in Containers (optional)

If your system has a supported NVIDIA GPU, install the driver from Ubuntu’s repo or NVIDIA’s site (e.g., sudo apt install nvidia-driver-535), reboot, and verify nvidia-smi works. Then install the NVIDIA Container Toolkit so Docker can pass GPUs into containers:

curl -fsSL https://nvidia.github.io/nvidia-container-toolkit/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit.gpg distribution=$(. /etc/os-release; echo $ID$VERSION_ID) && \
echo "deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit.gpg] https://nvidia.github.io/libnvidia-container/$distribution/$(uname -m) /" | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt update && sudo apt install -y nvidia-container-toolkit sudo nvidia-ctk runtime configure --runtime=docker sudo systemctl restart docker

Test GPU access in Docker:

docker run --rm --gpus all nvidia/cuda:12.3.1-base-ubuntu22.04 nvidia-smi

3) Create a Docker Compose file

We will run two services: Ollama (the model server) and Open WebUI (the web interface). Create a working folder such as ~/ollama-stack and inside it create docker-compose.yml with the following content:

version: "3.9" services: ollama: image: ollama/ollama:latest container_name: ollama ports: - "127.0.0.1:11434:11434" volumes: - ollama:/root/.ollama environment: - OLLAMA_NUM_PARALLEL=2 - OLLAMA_MAX_LOADED_MODELS=2 deploy: resources: reservations: devices: - driver: nvidia count: all capabilities: [gpu] restart: unless-stopped open-webui: image: ghcr.io/open-webui/open-webui:main container_name: open-webui depends_on: - ollama ports: - "127.0.0.1:3000:8080" environment: - OLLAMA_BASE_URL=http://ollama:11434 volumes: - openwebui:/app/backend/data restart: unless-stopped volumes: ollama: openwebui:

Notes: The loopback bindings (127.0.0.1) keep services off the public network; put a reverse proxy in front if you need remote access. If your Docker Compose version complains about the deploy GPU section, remove it and start Ollama with: docker run --gpus all ... or add --gpus all via docker compose overrides.

4) Start the stack and pull a model

Start the services:

docker compose up -d

Pull a model into Ollama (examples for CPU/GPU-capable LLMs):

docker exec -it ollama ollama pull llama3.1:8b docker exec -it ollama ollama pull mistral:7b

Open your browser to http://localhost:3000, create the first admin user, and select your Ollama model in Open WebUI. You can now chat, run prompts, and manage models from the interface.

5) Optional: Reverse proxy and HTTPS

For secure remote access, put Nginx or Caddy in front of Open WebUI with HTTPS and basic auth or OAuth. Example: expose Open WebUI only on 127.0.0.1:3000 and publish a domain via the proxy to terminate TLS with Let’s Encrypt. Always restrict access; these services should not be open on the public internet without authentication.

6) Performance tips

- Use GPU if available: it accelerates inference dramatically, especially for 13B+ models.
- Tune concurrency with OLLAMA_NUM_PARALLEL and limit memory pressure using OLLAMA_MAX_LOADED_MODELS.
- Choose model sizes that match your VRAM/RAM. For 8 GB VRAM, 7B/8B models work well; for 12–24 GB, 13B–30B is more comfortable.
- Keep images updated: docker compose pull && docker compose up -d.

7) Backup and restore

Your data lives in Docker volumes. To back up:

docker run --rm -v ollama:/data -v $(pwd):/backup alpine tar czf /backup/ollama-vol.tar.gz -C /data . docker run --rm -v openwebui:/data -v $(pwd):/backup alpine tar czf /backup/openwebui-vol.tar.gz -C /data .

Restore by creating empty volumes and extracting the archives back into them using the same pattern.

8) Troubleshooting

- No GPU inside container: verify nvidia-smi works on the host, ensure NVIDIA Container Toolkit is installed, and try docker run --gpus all to validate. On WSL2, enable GPU support and CUDA toolkit for WSL.
- High RAM usage: reduce parallel requests and use smaller quantizations (e.g., q4_K_M variants).
- Slow downloads: Ollama model downloads depend on upstream mirrors; retry or prefetch models during off-peak hours.
- Port already in use: change 11434 or 3000 in the compose file.

Conclusion

You now have a modern, private AI stack running locally with Docker: Ollama for efficient model serving and Open WebUI for a friendly interface. This setup is easy to update, secure behind a proxy, and flexible for both CPU-only and GPU-accelerated systems. Add or swap models as your needs grow, and keep your data under your control.

LifeBytes Journal

Search This Blog