Overview
This tutorial shows you how to deploy a fast, private, local AI stack on Ubuntu using Ollama and Open WebUI with NVIDIA GPU acceleration. You will install the NVIDIA driver, Docker, and the NVIDIA Container Toolkit, then run Ollama on the host and Open WebUI in a container. By the end, you will have a browser-based interface to run powerful large language models (LLMs) like Llama 3 with CUDA acceleration on your own machine.
Prerequisites
- Ubuntu 22.04 or 24.04 (freshly updated).
- An NVIDIA GPU with at least 6 GB VRAM (more is better).
- sudo privileges and Internet access.
- Optional: a domain or reverse proxy if you plan to expose the UI externally.
Step 1 — Install NVIDIA Driver
Use Ubuntu’s built-in tools to install a compatible proprietary driver. Reboot afterward and confirm the GPU is detected.
sudo apt update sudo apt install -y ubuntu-drivers-common sudo ubuntu-drivers autoinstall sudo reboot
nvidia-smi
If you see a table with your GPU and driver version (e.g., 535+), you are ready for CUDA-enabled workloads.
Step 2 — Install Docker Engine
If Docker is not installed, use the official convenience script. Add your user to the docker group so you can run containers without sudo.
curl -fsSL https://get.docker.com | sh sudo usermod -aG docker $USER newgrp docker docker version
Step 3 — Enable GPU Access in Containers
Install the NVIDIA Container Toolkit so Docker can pass the GPU into containers.
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey \ | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg curl -fsSL https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list \ | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#' \ | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list sudo apt update sudo apt install -y nvidia-container-toolkit sudo nvidia-ctk runtime configure --runtime=docker sudo systemctl restart docker
Verify GPU visibility inside a container:
docker run --rm --gpus all nvidia/cuda:12.4.1-base-ubuntu22.04 nvidia-smi
Step 4 — Install Ollama (runs on the host)
Ollama simplifies downloading and running LLMs locally. It automatically uses CUDA if your NVIDIA driver is installed.
curl -fsSL https://ollama.com/install.sh | sh
Confirm the service is active and the API is reachable on port 11434:
systemctl --user status ollama || systemctl status ollama curl http://127.0.0.1:11434/api/tags
Pull and test a model (replace with your preferred model/quantization):
ollama pull llama3 ollama run llama3 "Write a two-line poem about GPUs."
Tip: Use smaller quantizations if VRAM is limited, for example llama3:8b-instruct-q4_0.
Step 5 — Deploy Open WebUI in Docker
Open WebUI provides a clean, modern interface for chatting with models served by Ollama. We will run it in Docker and point it to the host’s Ollama API. On Linux, add a host-gateway entry so the container can reach the host at host.docker.internal.
docker run -d --name open-webui \ -p 3000:8080 \ --add-host=host.docker.internal:host-gateway \ -e OLLAMA_BASE_URL=http://host.docker.internal:11434 \ --gpus all \ -v open-webui:/app/backend/data \ ghcr.io/open-webui/open-webui:main
Open your browser at http://<server-ip>:3000. On first login, create a user; that account becomes admin. If you need authentication enabled from the start, add -e WEBUI_AUTH=True to the run command.
Alternative: If --add-host=host-gateway is not supported on your Docker version, use host networking and point to 127.0.0.1:
docker run -d --name open-webui \ --network host \ -e OLLAMA_BASE_URL=http://127.0.0.1:11434 \ --gpus all \ -v open-webui:/app/backend/data \ ghcr.io/open-webui/open-webui:main
With host networking, Open WebUI listens on http://0.0.0.0:8080 (no -p flag needed).
Step 6 — Use and Tune Your Local AI
From Open WebUI, select a model (e.g., Llama 3) and start chatting. You can pull additional models with Ollama CLI and they will appear in the UI. To speed up responses and reduce VRAM, try smaller or more aggressive quantizations; to maximize quality, try larger quantizations if your GPU can handle them.
Common environment variables for Open WebUI include:
- WEBUI_AUTH=True to require sign-in.
- OLLAMA_BASE_URL to point to the Ollama server URL.
- PORT to customize the UI port if you use host networking.
Troubleshooting
Open WebUI cannot reach Ollama: Ensure you used --add-host=host.docker.internal:host-gateway and OLLAMA_BASE_URL=http://host.docker.internal:11434, or use host networking. Test connectivity with docker exec -it open-webui curl -s http://host.docker.internal:11434/api/tags.
No GPU in containers: Re-check the container toolkit setup and driver. Run docker run --rm --gpus all nvidia/cuda:12.4.1-base-ubuntu22.04 nvidia-smi. If it fails, reboot and ensure nvidia-smi works on the host first.
Out-of-memory errors: Use a smaller model or more compressed quantization. Close other GPU-heavy apps. You can also run with a larger system swap to reduce crashes when VRAM is exhausted, but performance will be slower.
Docker permissions: If you see “permission denied,” ensure your user is in the docker group (id to verify) and run newgrp docker or re-log in.
Optional: Reverse Proxy and TLS
If exposing Open WebUI on the Internet, put it behind a reverse proxy (Caddy, Nginx, or Traefik) for HTTPS and access control. At minimum, enforce authentication and limit access to trusted IPs. Never expose Ollama’s port 11434 directly without protection.
Maintenance
- Update Ollama periodically by re-running the install script or checking the project release notes, then systemctl restart ollama.
- Update Open WebUI with docker pull ghcr.io/open-webui/open-webui:main and docker restart open-webui.
- Prune old images and volumes with docker system prune (review carefully before confirming).
- Back up /var/lib/ollama (models) and the Open WebUI volume for settings and chats.
You now have a modern, GPU-accelerated, private AI chat environment running locally on Ubuntu. This setup is fast, secure, and fully under your control—and you can expand it with additional models, prompt libraries, and integrations as your needs grow.
Comments
Post a Comment