Running large language models locally is now practical and secure for many teams. In this guide, you will deploy a production-ready stack on Ubuntu using Ollama (for model runtime) and Open WebUI (for a clean, chat-style interface). The tutorial covers both CPU-only and NVIDIA GPU acceleration with the NVIDIA Container Toolkit, plus tips for updates, security, and backups.
Prerequisites
You need an Ubuntu 22.04 or 24.04 machine, at least 16 GB RAM for smooth performance, and optional NVIDIA GPU (Turing or newer recommended). You also need root or sudo access and a public DNS name if you plan to expose the UI securely.
Step 1: Update Ubuntu
Start by updating your system packages to ensure compatibility with recent Docker and NVIDIA components.
sudo apt update && sudo apt -y upgrade
sudo reboot
Step 2: Install Docker Engine and Compose Plugin
Install Docker from the official repository and enable the Compose plugin. This method ensures you receive timely security fixes and new features.
sudo apt -y install ca-certificates curl gnupg
sudo install -m 0755 -d /etc/apt/keyrings
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg
echo \
"deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] \
https://download.docker.com/linux/ubuntu \
$(. /etc/os-release && echo $VERSION_CODENAME) stable" | \
sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
sudo apt update
sudo apt -y install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
sudo usermod -aG docker $USER
newgrp docker
Step 3 (Optional but Recommended): NVIDIA GPU Acceleration
If you have an NVIDIA GPU, install the NVIDIA Container Toolkit to enable GPU pass-through for containers. Make sure you already have the proprietary NVIDIA driver installed (check with nvidia-smi).
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -fsSL https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list \
| sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' \
| sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt update
sudo apt -y install nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
Verify that Docker can see your GPU:
docker run --rm --gpus all nvidia/cuda:12.3.2-base-ubuntu22.04 nvidia-smi
Step 4: Create a Docker Compose File
We will run two services: ollama (the LLM runtime and model manager) and open-webui (a modern web UI that connects to Ollama). Save the file as docker-compose.yml in an empty directory.
version: "3.8"
services:
ollama:
image: ollama/ollama:latest
container_name: ollama
restart: unless-stopped
ports:
- "11434:11434"
volumes:
- ollama:/root/.ollama
# Uncomment the next line if you have an NVIDIA GPU:
# gpus: all
open-webui:
image: ghcr.io/open-webui/open-webui:latest
container_name: open-webui
restart: unless-stopped
ports:
- "3000:8080"
environment:
- OLLAMA_API_BASE_URL=http://ollama:11434
depends_on:
- ollama
volumes:
- openwebui:/app/backend/data
volumes:
ollama:
openwebui:
If you are on CPU-only, keep the file as is. If you have a GPU, uncomment the gpus: all line under the ollama service.
Step 5: Launch the Stack
Start both containers in detached mode:
docker compose up -d
Open WebUI should now be available at http://<your-server-ip>:3000. The first visitor will be asked to create an admin account. Leave the browser open; we will add a model next.
Step 6: Pull and Test a Model
Pull a model using Ollama. You can choose from many OSS models; Llama 3.1 8B is a balanced starter option:
docker exec -it ollama ollama pull llama3.1:8b
Confirm that the model is available:
curl http://localhost:11434/api/tags | jq
Back in Open WebUI, select this model in the top bar and start chatting. If GPU is enabled, generation should be significantly faster.
Optional: Secure Public Access with Caddy
If you want to access Open WebUI over HTTPS on a domain (for example, ai.example.com), a simple approach is to put Caddy in front. Caddy obtains and renews certificates automatically via Let’s Encrypt.
sudo apt -y install debian-keyring debian-archive-keyring apt-transport-https
curl -fsSL https://dl.cloudsmith.io/public/caddy/stable/gpg.key | sudo gpg --dearmor -o /usr/share/keyrings/caddy-stable-archive-keyring.gpg
echo "deb [signed-by=/usr/share/keyrings/caddy-stable-archive-keyring.gpg] \
https://dl.cloudsmith.io/public/caddy/stable/deb/ubuntu all main" | \
sudo tee /etc/apt/sources.list.d/caddy-stable.list
sudo apt update && sudo apt -y install caddy
Create a simple Caddyfile that proxies traffic to Open WebUI on port 3000:
sudo bash -c 'cat >/etc/caddy/Caddyfile' << "EOF"
ai.example.com {
reverse_proxy 127.0.0.1:3000
}
EOF
sudo systemctl reload caddy
Replace ai.example.com with your real domain and make sure DNS A/AAAA records point to your server’s public IP.
Operations: Updates, Backups, and Cleanup
Update containers regularly for new features and security patches:
docker compose pull
docker compose up -d
Backup volumes to keep models and chat history safe. Stop containers briefly, archive volumes, then restart:
docker compose down
docker run --rm -v ollama:/data -v $PWD:/backup alpine tar czf /backup/ollama-vol.tar.gz -C / data
docker run --rm -v openwebui:/data -v $PWD:/backup alpine tar czf /backup/openwebui-vol.tar.gz -C / data
docker compose up -d
Remove everything if you want to reclaim space later (this deletes models and chat data):
docker compose down
docker volume rm $(docker volume ls -q | grep -E "(ollama|openwebui)")
Troubleshooting
GPU not detected: Ensure the NVIDIA driver is installed on the host, the toolkit is configured, and your Compose service includes gpus: all. Validate with docker run --rm --gpus all nvidia/cuda:... nvidia-smi.
Permission denied: If you cannot run Docker without sudo, confirm your user is in the docker group (use id) and re-log in or run newgrp docker.
Port conflicts: If ports 3000 or 11434 are in use, change them in the Compose file and update your reverse proxy accordingly.
Low VRAM or OOM: Prefer 4–8B parameter models or quantized variants (e.g., llama3.1:8b in Q4_K_M). Ollama will automatically pick quantized builds when available.
Logs: Review service logs for errors and performance clues:
docker logs -f ollama
docker logs -f open-webui
Why This Stack?
Ollama offers a consistent way to pull and run many open-source models locally, while Open WebUI gives you a friendly chat experience, prompt presets, file uploads, and team features. Everything stays on your hardware, which improves privacy and often reduces cost. With Docker and a reverse proxy, this setup scales from a single developer laptop to a small team server with SSL and authentication.
You now have a modern, local LLM environment with a clear upgrade path. Add more models with ollama pull, automate backups on a cron schedule, and secure public access with Caddy or another reverse proxy. For most use cases, this stack is fast, reliable, and easy to maintain.
Comments
Post a Comment