How to Run Local AI Models with Ollama and Open WebUI on Ubuntu (NVIDIA GPU)

Overview

This guide shows how to deploy Ollama and Open WebUI on Ubuntu so you can run large language models (LLMs) locally with NVIDIA GPU acceleration. You will install Docker and the NVIDIA Container Toolkit, run the Ollama API, connect Open WebUI as a front end, and pull a model like Llama 3. This setup is fast, private, and easy to maintain.

Prerequisites

Before you start, make sure you have: Ubuntu 22.04 or later, an NVIDIA GPU with a recent driver (525+), sudo access, internet connectivity, and open ports 11434 (Ollama) and 3000 (Open WebUI). If you have an existing Docker installation, ensure it is up to date.

1) Install NVIDIA driver and verify GPU

Install a stable NVIDIA driver from Ubuntu’s repository, reboot, and verify the GPU is visible:

sudo apt update
sudo apt install -y nvidia-driver-535
sudo reboot
# After reboot:
nvidia-smi

If nvidia-smi prints your GPU details, the driver is working. If not, check Secure Boot, which can block kernel modules; disable it or sign the modules accordingly.

2) Install Docker and enable GPU in containers

Install Docker using the official convenience script, add your user to the docker group, then install the NVIDIA Container Toolkit so containers can access the GPU.

curl -fsSL https://get.docker.com | sh
sudo usermod -aG docker $USER
newgrp docker

# NVIDIA Container Toolkit
sudo apt update
sudo apt install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

# Verify GPU works inside Docker:
docker run --rm --gpus all nvidia/cuda:12.3.2-base-ubuntu22.04 nvidia-smi

If the last command shows GPU output, you are ready to run GPU-enabled containers.

3) Deploy Ollama (LLM runtime)

Ollama serves models locally via an HTTP API. Create a volume for persistent model storage and run the container with GPU support:

docker volume create ollama
docker run -d --name ollama --gpus all \
  -p 11434:11434 \
  -e OLLAMA_HOST=0.0.0.0:11434 \
  -v ollama:/root/.ollama \
  ollama/ollama:latest

Pull a model to test. Quantized models use less VRAM; llama3.1:8b is a good starting point on 8–12 GB GPUs.

docker exec -it ollama ollama pull llama3.1:8b
# Quick test (CLI in the container):
docker exec -it ollama ollama run llama3.1:8b

If the model loads and you can send a prompt, Ollama is ready.

4) Deploy Open WebUI (front end)

Open WebUI provides a user-friendly chat interface and features like prompt sets and file uploads. Create an isolated network, connect Ollama, and run Open WebUI:

docker network create ai
docker network connect ai ollama

docker volume create openwebui
docker run -d --name open-webui --network ai \
  -p 3000:8080 \
  -e OLLAMA_BASE_URL=http://ollama:11434 \
  -v openwebui:/app/backend/data \
  ghcr.io/open-webui/open-webui:latest

Open a browser and go to http://<server-ip>:3000. The first user to sign up becomes the administrator. After creating the admin account, open Settings and disable public signups if you want to restrict access.

5) Use your local AI

In Open WebUI, pick the model you pulled (e.g., llama3.1:8b) and start chatting. You can pull more models from the “Models” area or via:

docker exec -it ollama ollama pull mistral:7b
docker exec -it ollama ollama pull neural-chat:7b

Tip: If a model fails to load due to VRAM limits, choose a smaller or more aggressively quantized variant (e.g., Q4 or 4-bit builds).

6) Update and maintenance

To update to the latest images while keeping your data, pull and recreate the containers with the same volumes:

# Update Ollama
docker pull ollama/ollama:latest
docker stop ollama && docker rm ollama
docker run -d --name ollama --gpus all \
  -p 11434:11434 -e OLLAMA_HOST=0.0.0.0:11434 \
  -v ollama:/root/.ollama --network ai \
  ollama/ollama:latest

# Update Open WebUI
docker pull ghcr.io/open-webui/open-webui:latest
docker stop open-webui && docker rm open-webui
docker run -d --name open-webui --network ai \
  -p 3000:8080 \
  -e OLLAMA_BASE_URL=http://ollama:11434 \
  -v openwebui:/app/backend/data \
  ghcr.io/open-webui/open-webui:latest

Models and settings persist in the Docker volumes. Back up these volumes regularly with your usual server backup process.

7) Troubleshooting

If Open WebUI cannot talk to Ollama, confirm both containers share the same network and that OLLAMA_BASE_URL points to http://ollama:11434. Use docker logs open-webui to check errors.

If the GPU is not used, verify nvidia-smi inside a container works and the Docker daemon has the NVIDIA runtime configured. Also confirm you started Ollama with --gpus all. For small VRAM, prefer smaller models (e.g., 7–8B) and quantized builds.

If you see slow generation, check CPU/GPU utilization with top and nvidia-smi. Running models from SSD storage and avoiding swap helps latency. Restart long-running containers after driver updates.

What you get

With Ollama and Open WebUI on Ubuntu, you have a private, GPU-accelerated local AI stack. You can chat, summarize, and prototype apps against the Ollama API at http://<server-ip>:11434, while Open WebUI provides a polished interface for everyday use.

LifeBytes Journal

Search This Blog