Deploy Ollama and Open WebUI on Ubuntu 24.04 with NVIDIA GPU Acceleration (Step-by-Step)

Overview

Running large language models locally is easier than ever with Ollama (a lightweight LLM runtime) and Open WebUI (a modern chat interface). This step-by-step guide shows how to install Ollama on Ubuntu 24.04 (or 22.04), enable NVIDIA GPU acceleration, and connect Open WebUI via Docker. By the end, you will have a fast, private, and browser-based AI chat system that can run models like Llama 3, Mistral, and Phi on your own hardware.

Prerequisites

• Ubuntu 24.04 LTS (or 22.04)
• An NVIDIA GPU with recent drivers (Turing/RTX or newer recommended)
• At least 16 GB RAM and 20+ GB free disk (models are large)
• sudo access and an internet connection

Step 1 — Update Ubuntu and install essentials

sudo apt update && sudo apt -y upgrade
sudo apt -y install curl ca-certificates gnupg git

Step 2 — Install the NVIDIA driver

If you do not already have the proprietary driver installed, use Ubuntu’s driver manager. Reboot after installation.

sudo ubuntu-drivers autoinstall
sudo reboot

After reboot, verify the driver is working with nvidia-smi. You should see your GPU listed and the driver version.

Step 3 — Install Docker and the NVIDIA Container Toolkit

Install Docker (Community edition from Ubuntu repo is fine for this use case), then add NVIDIA’s container runtime so GPU workloads run inside containers when needed.

Install Docker:
sudo apt -y install docker.io docker-compose-plugin
sudo usermod -aG docker $USER
newgrp docker

Add NVIDIA Container Toolkit repo and install:
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
distribution=$(. /etc/os-release; echo $ID$VERSION_ID)
curl -fsSL https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt update && sudo apt -y install nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

Although Open WebUI does not need GPU access, having the NVIDIA runtime ensures any future GPU-enabled containers work smoothly.

Step 4 — Install Ollama (GPU-enabled)

Ollama provides pre-optimized builds and automatically uses your NVIDIA GPU when available.

curl -fsSL https://ollama.com/install.sh | sh

Enable and start the service (on recent Ubuntu, the installer sets this up automatically):
sudo systemctl enable --now ollama

Verify installation:
ollama --version

Test pulling a model (downloads may be several GB):
ollama pull llama3.1:8b
Run a quick prompt:
ollama run llama3.1:8b

Step 5 — Deploy Open WebUI with Docker

Open WebUI offers a clean multi-user chat interface for local models. We will point it at the Ollama API on the host. Create a project folder, then a basic Compose file.

mkdir -p ~/open-webui && cd ~/open-webui

Create docker-compose.yml with the following content (one service):

services:
  open-webui:
    image: ghcr.io/open-webui/open-webui:latest
    container_name: open-webui
    ports:
     - "3000:8080"
    environment:
     - OLLAMA_BASE_URL=http://host.docker.internal:11434
    extra_hosts:
     - "host.docker.internal:host-gateway"
    restart: unless-stopped

Start the container:
docker compose up -d

Open your browser and go to http://<your-server-ip>:3000. Create the first admin account when prompted. Open WebUI should auto-detect the Ollama endpoint via OLLAMA_BASE_URL. If not, set it under Settings → Connections.

Step 6 — Verify GPU acceleration

While generating a response in Open WebUI, run nvidia-smi in a terminal. You should see GPU utilization and memory usage spike. If usage is zero during generation, check driver versions and restart the Ollama service.

Step 7 — Manage and optimize models

List installed models:
ollama list

Pull additional models:
ollama pull mistral:7b
ollama pull phi3:mini

Set a default model in Open WebUI under Settings → Models, or choose per chat. For better performance on limited VRAM, prefer quantized variants (e.g., :q4_0 or :q6_k if available). You can also tune parallel requests:
export OLLAMA_NUM_PARALLEL=2
Then restart the service:
sudo systemctl restart ollama

Advanced users can build a custom Modelfile to add system prompts or LoRA adapters. Example skeleton:
FROM llama3.1:8b
SYSTEM You are a concise technical assistant.

Apply it with:
ollama create my-tech -f Modelfile
Then select my-tech in Open WebUI.

Troubleshooting

No GPU usage: Ensure nvidia-smi shows the driver. If Ollama still uses CPU, update to the latest Ollama, confirm you installed the proprietary NVIDIA driver, and reboot. Some headless servers require sudo apt -y install nvidia-driver-### and a kernel headers update.

Docker cannot reach Ollama: The Compose file maps host.docker.internal to the Linux host via host-gateway. If your Docker version is old, update Docker or replace the endpoint with your host IP, for example http://192.168.1.50:11434.

Port conflicts (3000 or 11434): Change 3000:8080 in the Compose file to another host port (e.g., 8081:8080). For Ollama, change the listen port by editing its service environment and restarting: sudo systemctl edit ollama then set Environment="OLLAMA_HOST=0.0.0.0:11435", save, sudo systemctl daemon-reload && sudo systemctl restart ollama.

Disk space: Models can exceed 10 GB each. Remove unused models with ollama rm modelname and clean unreferenced layers with ollama prune.

Security and maintenance tips

• Do not expose the Ollama API publicly without a reverse proxy and authentication.
• Use a firewall (e.g., sudo ufw allow 3000/tcp for local access, block external where appropriate).
• Keep components updated: sudo apt update, ollama update, and docker compose pull && docker compose up -d for Open WebUI.

Conclusion

You now have a GPU-accelerated local LLM stack with Ollama and Open WebUI on Ubuntu. This setup is fast, private, and flexible: switch models on demand, create custom prompts, and iterate safely on your own hardware. With regular updates and a few optimizations, it can rival many hosted AI experiences without sending data to the cloud.

Comments