Overview
This step-by-step guide shows you how to deploy a fast, private, and GPU-accelerated AI chat on Ubuntu using two popular open-source tools: Ollama (model runner) and Open WebUI (user interface). We will use Docker Compose and the NVIDIA Container Toolkit so your NVIDIA GPU can accelerate large language models (LLMs) locally. By the end, you will have a browser-based chat UI running on top of a local model with persistent storage and easy updates.
Prerequisites
- A 64-bit Ubuntu 22.04 or 24.04 machine with an NVIDIA GPU (6–8 GB VRAM minimum recommended for smaller models, more for larger ones).
- SSH or terminal access with sudo privileges.
- Internet connectivity and at least 20 GB of free disk space.
- Basic familiarity with Docker.
1) Install NVIDIA Driver and Verify GPU
First, install the recommended NVIDIA driver. If you already have a working proprietary NVIDIA driver and the nvidia-smi command runs, you can skip to the next step.
sudo apt update
sudo ubuntu-drivers autoinstall
sudo reboot
After the reboot, confirm the driver:
nvidia-smi
You should see your GPU listed along with driver and CUDA versions. If not, fix the driver before continuing.
2) Install Docker Engine and Compose
Install Docker from the official repository to ensure you get the latest stable version.
sudo apt update
sudo apt install -y ca-certificates curl gnupg
sudo install -m 0755 -d /etc/apt/keyrings
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg
echo "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] \
https://download.docker.com/linux/ubuntu $(. /etc/os-release && echo $VERSION_CODENAME) stable" | \
sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
sudo apt update
sudo apt install -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
sudo usermod -aG docker $USER
newgrp docker
Validate Docker:
docker run --rm hello-world
3) Enable GPU in Containers (NVIDIA Container Toolkit)
Install the NVIDIA Container Toolkit to allow Docker containers to access your GPU.
distribution=$(. /etc/os-release; echo $ID$VERSION_ID)
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | \
sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -fsSL https://nvidia.github.io/libnvidia-container/${distribution}/libnvidia-container.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt update
sudo apt install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
Test GPU access inside a container:
docker run --rm --gpus all nvidia/cuda:12.3.2-base-ubuntu22.04 nvidia-smi
If the output shows your GPU, you are ready to proceed.
4) Create a Docker Compose Stack for Ollama + Open WebUI
Create a project folder and a docker-compose.yml file. This configuration runs Ollama (the model server) and Open WebUI (the frontend), shares data persistently, and enables GPU acceleration for Ollama.
mkdir -p ~/ai-stack && cd ~/ai-stack
nano docker-compose.yml
Paste the following Compose file:
version: "3.8"
services:
ollama:
image: ollama/ollama:latest
container_name: ollama
restart: unless-stopped
ports:
- "11434:11434"
volumes:
- ollama:/root/.ollama
environment:
- OLLAMA_HOST=0.0.0.0
gpus: all
open-webui:
image: ghcr.io/open-webui/open-webui:main
container_name: open-webui
restart: unless-stopped
depends_on:
- ollama
ports:
- "3000:8080"
environment:
- OLLAMA_BASE_URL=http://ollama:11434
volumes:
- open-webui:/app/backend/data
volumes:
ollama:
open-webui:
Bring the stack online:
docker compose up -d
docker compose logs -f
Wait until both containers show as healthy or running without errors.
5) Pull a Model and Run Your First Prompt
Ollama downloads models on demand. Pull a popular, instruction-tuned model. Smaller or quantized models are best for GPUs with less VRAM.
# Example: Llama 3.1 8B Instruct
docker exec -it ollama ollama pull llama3.1:8b
# Lower VRAM option (quantized):
docker exec -it ollama ollama pull llama3.1:8b-instruct-q4_K_M
Test generation via API to confirm everything is working:
curl http://localhost:11434/api/generate \
-d '{"model":"llama3.1:8b","prompt":"Say hello from a local GPU-accelerated LLM."}'
Open your browser to http://<server-ip>:3000, create an account when prompted, select the model you pulled, and start chatting.
6) Performance, Updates, and Autostart
- For best performance, use GPUs with higher VRAM and prefer models that match your hardware capacity. Quantized variants (e.g., q4_K_M) drastically reduce VRAM usage at a small quality trade-off.
- The Compose file uses restart: unless-stopped, so your stack will auto-start after reboots.
- To update images safely, run: docker compose pull && docker compose up -d. Your models and settings persist in the named volumes.
7) Troubleshooting
No GPU in container: Re-check nvidia-smi on the host, verify the NVIDIA Container Toolkit installation, and confirm the gpus: all setting in Compose. Retest with the CUDA container command above. Ensure Secure Boot is disabled if your driver fails to load.
Model fails to load: Choose a smaller or quantized build. For example, use llama3.1:8b-instruct-q4_K_M instead of a full precision model when VRAM is tight.
Port conflicts: Change the mapped ports in docker-compose.yml (for example, 3001:8080 for the UI or 11435:11434 for Ollama) and run docker compose up -d again.
Slow downloads: Models can be large (several GB). Ensure good bandwidth and enough disk space in Docker’s data root and volumes.
8) Security and Remote Access
By default, this setup is intended for local access. If you expose ports to the internet, secure them behind a reverse proxy with TLS (e.g., Caddy, Nginx, or Traefik), enable authentication in Open WebUI, and restrict access with a firewall or a VPN like WireGuard or Tailscale. Keep Docker and base images updated to benefit from security patches.
Wrap-up
You now have a modern, GPU-accelerated local AI stack on Ubuntu with a clean web interface, powered by Ollama and Open WebUI. This setup is easy to maintain, performs well on consumer GPUs, and keeps your data on your own hardware. Add or switch models as needed, tune quantization levels for your GPU, and enjoy private, fast AI inference without relying on external cloud services.
Comments
Post a Comment