Overview
This tutorial shows how to self-host large language models locally by deploying Ollama and Open WebUI on Ubuntu 24.04 LTS with NVIDIA GPU acceleration using Docker Compose. Ollama handles model runtimes and downloads, while Open WebUI provides a friendly web interface, prompt management, and multi-user features. By the end, you will have a reproducible, GPU-enabled AI stack reachable from a browser on your LAN.
Prerequisites
Before you start, confirm: (1) Ubuntu 24.04 LTS installed and updated, (2) An NVIDIA GPU supported by recent drivers, (3) Administrative (sudo) access, and (4) Internet connectivity. If Secure Boot is enabled, you may need to enroll the NVIDIA kernel module signing key during driver installation.
Step 1: Prepare Ubuntu
sudo apt update && sudo apt -y upgradesudo apt -y install curl git ca-certificates gnupg
Step 2: Install the NVIDIA Driver
Use Ubuntu’s built-in tool to select the correct, current driver:
ubuntu-drivers listsudo ubuntu-drivers autoinstallsudo reboot
After reboot, verify the GPU is recognized:
nvidia-smi
Step 3: Install Docker Engine and Compose
Add the official Docker repository and install the engine plus the Compose plugin:
sudo install -m 0755 -d /etc/apt/keyringscurl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpgecho "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu $(. /etc/os-release && echo $VERSION_CODENAME) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/nullsudo apt updatesudo apt -y install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-pluginsudo usermod -aG docker $USERnewgrp docker
Step 4: Install NVIDIA Container Toolkit
This toolkit lets Docker containers access the GPU:
curl -fsSL https://nvidia.github.io/nvidia-container-toolkit/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit.gpgdistribution=$(. /etc/os-release; echo $ID$VERSION_ID)curl -fsSL https://nvidia.github.io/nvidia-container-toolkit/$distribution/nvidia-container-toolkit.list | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.listsudo apt updatesudo apt -y install nvidia-container-toolkitsudo nvidia-ctk runtime configure --runtime=dockersudo systemctl restart docker
Test GPU access in a container:
docker run --rm --gpus all nvidia/cuda:12.4.1-runtime-ubuntu22.04 nvidia-smi
Step 5: Create the Docker Compose project
Make a working directory and create your Compose file:
mkdir -p ~/ai-stack && cd ~/ai-stack
Create a file named docker-compose.yaml with the following content (indentation matters):
version: "3.9"services: ollama: image: ollama/ollama:latest restart: unless-stopped ports: - "11434:11434" environment: - OLLAMA_HOST=0.0.0.0 volumes: - ollama:/root/.ollama gpus: all openwebui: image: ghcr.io/open-webui/open-webui:latest restart: unless-stopped depends_on: - ollama ports: - "3000:8080" environment: - OLLAMA_BASE_URL=http://ollama:11434 - WEBUI_NAME=MyLocalAI volumes: - openwebui:/app/backend/datavolumes: ollama: openwebui:
Step 6: Launch the stack
docker compose up -d
Wait a few seconds. Visit http://localhost:3000 (or http://SERVER_IP:3000) to open Open WebUI. The first user usually becomes the admin. You can manage models from the UI or the CLI.
Step 7: Pull a model and test
Pull a model into the Ollama volume (example: Meta’s Llama 3.1 8B):
docker compose exec ollama ollama pull llama3.1
Generate a quick response from the API:
curl -s http://localhost:11434/api/generate -d '{"model":"llama3.1","prompt":"Say hello from a local GPU!","stream":false}' | jq .response
In Open WebUI, choose the model from the dropdown and start chatting.
Step 8: Use the OpenAI-compatible API
Ollama exposes an OpenAI-style API under /v1. Example with Python’s OpenAI SDK:
pip install openaipython - <<'PY'from openai import OpenAIclient = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")resp = client.chat.completions.create( model="llama3.1", messages=[{"role":"user","content":"Give me a one-line fun fact."}])print(resp.choices[0].message.content)PY
Maintenance and updates
To update images without losing your data or models stored in volumes, run:
cd ~/ai-stackdocker compose pulldocker compose up -d
To update or add models, use the UI or the CLI, for example: docker compose exec ollama ollama pull mistral.
Troubleshooting
nvidia-smi fails inside containers: Ensure the NVIDIA driver is installed and matches your GPU. Reboot after installation. Confirm Docker sees the GPU with docker run --rm --gpus all nvidia/cuda:12.4.1-runtime-ubuntu22.04 nvidia-smi.
Compose “gpus” not recognized: Check your Compose plugin version with docker compose version. Update Docker packages if outdated. As a fallback, configure the NVIDIA runtime as default and remove the gpus key: sudo nvidia-ctk runtime configure --runtime=docker --set-as-default && sudo systemctl restart docker.
Slow downloads or OOM: Models are large; use a fast, stable network and ensure enough VRAM and RAM. If a model does not fit your GPU, choose a smaller variant (e.g., llama3.1:8b or a quantized build like q4_K_M).
Security tips
Bind the services to your LAN or localhost by default and place them behind a reverse proxy with TLS if exposing over the internet. In Open WebUI, enable authentication and restrict new user registration if not needed. Consider a firewall rule to limit access to ports 11434 and 3000.
Remove the stack
To stop the containers while preserving data: docker compose down. To remove everything, including downloaded models and chat history: docker compose down -v.
You now have a modern, GPU-accelerated local AI stack that is easy to manage and update. The same approach works for additional services like vector databases or reverse proxies, making it a flexible foundation for on-prem AI experiments and production prototypes.
Comments
Post a Comment