Overview: This step-by-step guide shows how to run local large language models with GPU acceleration using Ollama and Open WebUI. You will install Ollama to serve models like Llama 3 or Mistral, and deploy Open WebUI as an easy web interface. The tutorial covers Ubuntu 22.04/24.04 and Windows 11 via WSL2, with security, auto-start, and troubleshooting tips.
What you need: A 16 GB RAM system (32 GB recommended), an NVIDIA GPU with recent drivers (8 GB+ VRAM recommended), admin/root access, a stable internet connection, and around 20–30 GB of free disk space for models and containers.
Why this stack? Ollama provides a simple runtime and model manager for local LLMs, while Open WebUI gives a modern, browser-based chat interface, prompt management, RAG integrations, and multi-model switching. Both are lightweight and work on a single machine.
Step 1 — Prepare GPU drivers
Ubuntu (bare metal/VM with GPU passthrough):
sudo apt update && sudo apt install -y ubuntu-drivers-common
sudo ubuntu-drivers autoinstall
sudo reboot
After reboot, verify:
nvidia-smi
If you see your GPU, the driver is good. Ollama downloads the CUDA user-space libs it needs automatically; you only need a working NVIDIA driver.
Windows 11 with WSL2:
Install the latest NVIDIA Game Ready/Studio driver (535+), then update WSL:
wsl --update
wsl --shutdown
Open your Ubuntu WSL distro and confirm:
echo $WSL_DISTRO_NAME
Ollama will use the GPU via WSL automatically if the Windows driver supports it.
Step 2 — Install Docker (for Open WebUI)
Ubuntu:
sudo apt update && sudo apt install -y docker.io
sudo usermod -aG docker $USER
newgrp docker
sudo systemctl enable --now docker
Windows WSL2:
Install Docker Desktop for Windows and enable the WSL2 integration for your Ubuntu distro. Start Docker Desktop before running containers.
Step 3 — Install Ollama
Ubuntu/WSL2:
curl -fsSL https://ollama.com/install.sh | sh
Enable the service on Ubuntu (bare metal):
sudo systemctl enable --now ollama
Test by pulling a small model:
ollama pull llama3.1:8b
Run a quick prompt:
ollama run llama3.1:8b "Write a haiku about GPUs."
If you see a note like “using CUDA,” GPU is active. If not, ensure drivers are correct.
Step 4 — Deploy Open WebUI
Ubuntu (recommended: host networking):
docker run -d --name open-webui --restart unless-stopped --network host \
-e OLLAMA_BASE_URL=http://127.0.0.1:11434 \
-e WEBUI_AUTH=true \
-e [email protected] \
-e DEFAULT_USER_PASSWORD=ChangeMeStrong! \
ghcr.io/open-webui/open-webui:main
Open your browser at http://127.0.0.1:8080 (host network uses the container’s internal port). Log in with the credentials you set.
WSL2 with Docker Desktop (no host networking):
docker run -d --name open-webui --restart unless-stopped -p 3000:8080 \
-e OLLAMA_BASE_URL=http://host.docker.internal:11434 \
-e WEBUI_AUTH=true \
-e [email protected] \
-e DEFAULT_USER_PASSWORD=ChangeMeStrong! \
ghcr.io/open-webui/open-webui:main
Browse to http://localhost:3000. The variable OLLAMA_BASE_URL points Open WebUI to the local Ollama API.
Step 5 — Optimize and manage models
Popular models: llama3.1:8b, mistral-nemo:12b, qwen2.5:7b. You can also choose quantized variants like llama3.1:8b-instruct-q4_K_M for lower VRAM usage.
ollama list
ollama pull mistral-nemo:12b
ollama rm modelname:tag
To limit VRAM, use smaller models or quantized builds. In Open WebUI, select the model per chat. For longer contexts, try a 70B model on a bigger GPU or use smaller context windows to conserve memory.
Step 6 — Secure and persist
Keep Open WebUI bound to localhost if the machine is shared. If you must expose it, use a reverse proxy with TLS (Nginx, Caddy, or Traefik) and keep WEBUI_AUTH=true. On Ubuntu, ensure the firewall blocks unwanted access:
sudo ufw allow 8080/tcp comment 'Open WebUI (local)'
sudo ufw status
Data for Open WebUI is stored in the container’s volume by default. For manual control, mount a volume: -v open-webui:/app/backend/data. Ollama stores models under ~/.ollama; back that up regularly.
Step 7 — Auto-start on boot
Ollama installs a systemd service on Ubuntu. Ensure Docker is enabled (already done) and containers use --restart unless-stopped so Open WebUI comes up after reboots. On Windows, set Docker Desktop to start on login.
Troubleshooting
GPU not used: Update NVIDIA driver, reboot, and confirm with nvidia-smi (Ubuntu) or update WSL and drivers (Windows). Reinstall Ollama if needed. Check that ollama run logs mention CUDA.
Out-of-VRAM: Use a smaller or more aggressively quantized model. Reduce max tokens or context length in Open WebUI settings. Close other GPU apps.
Port conflicts: If 11434 or 8080/3000 are used, change them. For example, -p 3333:8080 for Open WebUI and OLLAMA_HOST=127.0.0.1:11500 ollama serve for Ollama.
Slow downloads: Use a reliable network, or pre-fetch models with ollama pull. You can also host a local model library if bandwidth is limited.
Clean up disk: Remove unused models and images:ollama list && ollama rm model:tag
docker image prune -f
What’s next
Explore RAG in Open WebUI by attaching local documents, add embeddings, and test function calling or tool integrations. With this setup, you have a fast, private, and GPU-accelerated local AI workstation ready for coding, content, and research.
Comments
Post a Comment