Deploy a Local AI Stack: Install Ollama and Open WebUI with NVIDIA GPU on Ubuntu

Overview

This tutorial shows you how to deploy a fast, private, local AI stack on Ubuntu using Ollama and Open WebUI with NVIDIA GPU acceleration. You will install the NVIDIA driver, Docker, and the NVIDIA Container Toolkit, then run Ollama on the host and Open WebUI in a container. By the end, you will have a browser-based interface to run powerful large language models (LLMs) like Llama 3 with CUDA acceleration on your own machine.

Prerequisites

- Ubuntu 22.04 or 24.04 (freshly updated).
- An NVIDIA GPU with at least 6 GB VRAM (more is better).
- sudo privileges and Internet access.
- Optional: a domain or reverse proxy if you plan to expose the UI externally.

Step 1 — Install NVIDIA Driver

Use Ubuntu’s built-in tools to install a compatible proprietary driver. Reboot afterward and confirm the GPU is detected.

sudo apt update
sudo apt install -y ubuntu-drivers-common
sudo ubuntu-drivers autoinstall
sudo reboot

nvidia-smi

If you see a table with your GPU and driver version (e.g., 535+), you are ready for CUDA-enabled workloads.

Step 2 — Install Docker Engine

If Docker is not installed, use the official convenience script. Add your user to the docker group so you can run containers without sudo.

curl -fsSL https://get.docker.com | sh
sudo usermod -aG docker $USER
newgrp docker
docker version

Step 3 — Enable GPU Access in Containers

Install the NVIDIA Container Toolkit so Docker can pass the GPU into containers.

curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey \
| sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg

curl -fsSL https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list \
| sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#' \
| sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

sudo apt update
sudo apt install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

Verify GPU visibility inside a container:

docker run --rm --gpus all nvidia/cuda:12.4.1-base-ubuntu22.04 nvidia-smi

Step 4 — Install Ollama (runs on the host)

Ollama simplifies downloading and running LLMs locally. It automatically uses CUDA if your NVIDIA driver is installed.

curl -fsSL https://ollama.com/install.sh | sh

Confirm the service is active and the API is reachable on port 11434:

systemctl --user status ollama || systemctl status ollama
curl http://127.0.0.1:11434/api/tags

Pull and test a model (replace with your preferred model/quantization):

ollama pull llama3
ollama run llama3 "Write a two-line poem about GPUs."

Tip: Use smaller quantizations if VRAM is limited, for example llama3:8b-instruct-q4_0.

Step 5 — Deploy Open WebUI in Docker

Open WebUI provides a clean, modern interface for chatting with models served by Ollama. We will run it in Docker and point it to the host’s Ollama API. On Linux, add a host-gateway entry so the container can reach the host at host.docker.internal.

docker run -d --name open-webui \
  -p 3000:8080 \
  --add-host=host.docker.internal:host-gateway \
  -e OLLAMA_BASE_URL=http://host.docker.internal:11434 \
  --gpus all \
  -v open-webui:/app/backend/data \
  ghcr.io/open-webui/open-webui:main

Open your browser at http://<server-ip>:3000. On first login, create a user; that account becomes admin. If you need authentication enabled from the start, add -e WEBUI_AUTH=True to the run command.

Alternative: If --add-host=host-gateway is not supported on your Docker version, use host networking and point to 127.0.0.1:

docker run -d --name open-webui \
  --network host \
  -e OLLAMA_BASE_URL=http://127.0.0.1:11434 \
  --gpus all \
  -v open-webui:/app/backend/data \
  ghcr.io/open-webui/open-webui:main

With host networking, Open WebUI listens on http://0.0.0.0:8080 (no -p flag needed).

Step 6 — Use and Tune Your Local AI

From Open WebUI, select a model (e.g., Llama 3) and start chatting. You can pull additional models with Ollama CLI and they will appear in the UI. To speed up responses and reduce VRAM, try smaller or more aggressive quantizations; to maximize quality, try larger quantizations if your GPU can handle them.

Common environment variables for Open WebUI include:
- WEBUI_AUTH=True to require sign-in.
- OLLAMA_BASE_URL to point to the Ollama server URL.
- PORT to customize the UI port if you use host networking.

Troubleshooting

Open WebUI cannot reach Ollama: Ensure you used --add-host=host.docker.internal:host-gateway and OLLAMA_BASE_URL=http://host.docker.internal:11434, or use host networking. Test connectivity with docker exec -it open-webui curl -s http://host.docker.internal:11434/api/tags.

No GPU in containers: Re-check the container toolkit setup and driver. Run docker run --rm --gpus all nvidia/cuda:12.4.1-base-ubuntu22.04 nvidia-smi. If it fails, reboot and ensure nvidia-smi works on the host first.

Out-of-memory errors: Use a smaller model or more compressed quantization. Close other GPU-heavy apps. You can also run with a larger system swap to reduce crashes when VRAM is exhausted, but performance will be slower.

Docker permissions: If you see “permission denied,” ensure your user is in the docker group (id to verify) and run newgrp docker or re-log in.

Optional: Reverse Proxy and TLS

If exposing Open WebUI on the Internet, put it behind a reverse proxy (Caddy, Nginx, or Traefik) for HTTPS and access control. At minimum, enforce authentication and limit access to trusted IPs. Never expose Ollama’s port 11434 directly without protection.

Maintenance

- Update Ollama periodically by re-running the install script or checking the project release notes, then systemctl restart ollama.
- Update Open WebUI with docker pull ghcr.io/open-webui/open-webui:main and docker restart open-webui.
- Prune old images and volumes with docker system prune (review carefully before confirming).
- Back up /var/lib/ollama (models) and the Open WebUI volume for settings and chats.

You now have a modern, GPU-accelerated, private AI chat environment running locally on Ubuntu. This setup is fast, secure, and fully under your control—and you can expand it with additional models, prompt libraries, and integrations as your needs grow.

LifeBytes Journal

Search This Blog