How to Self-Host Ollama + Open WebUI on Ubuntu 24.04 with NVIDIA GPU Acceleration

Overview

Running large language models locally is easier than ever. In this guide, you will deploy a private, GPU-accelerated AI stack with Ollama and Open WebUI on Ubuntu 24.04 using Docker. Ollama handles model downloads and inference, while Open WebUI provides a friendly chat interface, prompt library, RAG features, and multi-user management. By the end, you will have a fast, on-prem AI assistant accessible via a browser, without sending data to third parties.

Prerequisites

System: Ubuntu 24.04 (Noble), an NVIDIA GPU (e.g., RTX 3060+), NVIDIA driver 535+ (or latest), at least 16 GB RAM, and reliable internet. You should have sudo access. This tutorial uses Docker; no prior Kubernetes skills required.

Step 1 — Install NVIDIA Driver and Reboot

If you have not installed the proprietary driver, do it now and reboot:

sudo ubuntu-drivers autoinstall
sudo reboot

Step 2 — Install Docker Engine

Add Docker’s official repository and install the engine:

sudo apt update
sudo apt install -y ca-certificates curl gnupg
sudo install -m 0755 -d /etc/apt/keyrings
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg
echo "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu noble stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
sudo apt update
sudo apt install -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin

Optional: allow your user to run Docker without sudo and start a new shell:

sudo usermod -aG docker $USER
newgrp docker

Step 3 — Enable GPU inside Containers

Install NVIDIA Container Toolkit so Docker can use your GPU:

curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -fsSL https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt update
sudo apt install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

Verify GPU access from Docker:

docker run --rm --gpus all nvidia/cuda:12.4.1-base-ubuntu22.04 nvidia-smi

Step 4 — Create Network and Volumes

Create an isolated Docker network and persistent volumes for data:

docker network create ai
docker volume create ollama
docker volume create openwebui

Step 5 — Run Ollama (GPU-accelerated)

Start the Ollama server and keep it local-only on port 11434. The volume stores models and caches:

docker run -d --name ollama --gpus all --restart unless-stopped --network ai -p 127.0.0.1:11434:11434 -v ollama:/root/.ollama ollama/ollama:latest

Pull a starter model (choose one that fits your GPU memory):

docker exec -it ollama ollama pull llama3.1:8b
docker exec -it ollama ollama pull qwen2.5:7b-instruct

Step 6 — Run Open WebUI

Open WebUI will use Ollama via the internal Docker network. Expose the web interface on port 3000:

docker run -d --name open-webui --restart unless-stopped --network ai -e OLLAMA_BASE_URL=http://ollama:11434 -p 3000:8080 -v openwebui:/app/backend/data ghcr.io/open-webui/open-webui:latest

Open a browser and visit http://SERVER_IP:3000/. Create the first admin user. In the model selector, choose the model you pulled (e.g., llama3.1:8b) and start chatting.

Step 7 — Test the API

You can also use the local API directly. From the host:

curl http://127.0.0.1:11434/api/generate -d '{"model":"llama3.1:8b","prompt":"Say hello in one sentence."}'

Developers can point tools like LangChain or LlamaIndex at http://127.0.0.1:11434 for private inference.

Optimization Tips

If you see out-of-memory errors, switch to smaller or more aggressively quantized models (e.g., q4_k_m) when pulling: ollama pull llama3.1:8b-instruct-q4_K_M. Keep prompts concise and lower context length in Open WebUI settings. On multi-GPU systems, Ollama auto-detects devices; you can fine-tune behavior with environment variables like OLLAMA_NUM_GPU and OLLAMA_NUM_GPU_LAYERS if needed. Always use persistent volumes to avoid re-downloading models after updates.

Security Hardening

By binding Ollama to 127.0.0.1, the API is not exposed externally. Expose Open WebUI only to trusted networks. If you must publish it on the internet, use a reverse proxy with TLS and authentication. For example, with UFW, allow only your LAN:

sudo ufw allow from 192.168.0.0/16 to any port 3000 proto tcp

For Nginx or Caddy, enable HTTPS and basic auth or OIDC. Never expose port 11434 directly without protection.

Troubleshooting

GPU not detected in containers: Re-run sudo nvidia-ctk runtime configure --runtime=docker, then sudo systemctl restart docker. Confirm host drivers with nvidia-smi and container access with the CUDA test image.

Permission errors with Docker: Add your user to the docker group (sudo usermod -aG docker $USER) and re-login.

Slow responses or model crashes: Try a smaller/quantized model, reduce context window, and verify VRAM usage. Ensure swap is enabled for stability when RAM is tight.

Update containers: docker pull ollama/ollama:latest and docker pull ghcr.io/open-webui/open-webui:latest, then docker restart your containers.

Cleanup (Optional)

To stop and remove everything:

docker rm -f open-webui ollama
docker volume rm openwebui ollama
docker network rm ai

What You Achieved

You now have a modern, private AI stack with GPU acceleration on Ubuntu 24.04. Ollama simplifies model management and inference, while Open WebUI offers a polished interface ready for daily use, prototyping, and team collaboration. With careful model choice, proper security, and regular updates, this setup can replace many cloud-based assistants—keeping your data on your hardware.

LifeBytes Journal

Search This Blog