How to Run Local LLMs with GPU: Install Ollama and Open WebUI on Ubuntu or Windows WSL2

Overview: This step-by-step guide shows how to run local large language models with GPU acceleration using Ollama and Open WebUI. You will install Ollama to serve models like Llama 3 or Mistral, and deploy Open WebUI as an easy web interface. The tutorial covers Ubuntu 22.04/24.04 and Windows 11 via WSL2, with security, auto-start, and troubleshooting tips.

What you need: A 16 GB RAM system (32 GB recommended), an NVIDIA GPU with recent drivers (8 GB+ VRAM recommended), admin/root access, a stable internet connection, and around 20–30 GB of free disk space for models and containers.

Why this stack? Ollama provides a simple runtime and model manager for local LLMs, while Open WebUI gives a modern, browser-based chat interface, prompt management, RAG integrations, and multi-model switching. Both are lightweight and work on a single machine.

Step 1 — Prepare GPU drivers

Ubuntu (bare metal/VM with GPU passthrough):

sudo apt update && sudo apt install -y ubuntu-drivers-common sudo ubuntu-drivers autoinstall sudo reboot

After reboot, verify:

nvidia-smi

If you see your GPU, the driver is good. Ollama downloads the CUDA user-space libs it needs automatically; you only need a working NVIDIA driver.

Windows 11 with WSL2:

Install the latest NVIDIA Game Ready/Studio driver (535+), then update WSL:

wsl --update wsl --shutdown

Open your Ubuntu WSL distro and confirm:

echo $WSL_DISTRO_NAME

Ollama will use the GPU via WSL automatically if the Windows driver supports it.

Step 2 — Install Docker (for Open WebUI)

Ubuntu:

sudo apt update && sudo apt install -y docker.io sudo usermod -aG docker $USER newgrp docker sudo systemctl enable --now docker

Windows WSL2:

Install Docker Desktop for Windows and enable the WSL2 integration for your Ubuntu distro. Start Docker Desktop before running containers.

Step 3 — Install Ollama

Ubuntu/WSL2:

curl -fsSL https://ollama.com/install.sh | sh

Enable the service on Ubuntu (bare metal):

sudo systemctl enable --now ollama

Test by pulling a small model:

ollama pull llama3.1:8b

Run a quick prompt:

ollama run llama3.1:8b "Write a haiku about GPUs."

If you see a note like “using CUDA,” GPU is active. If not, ensure drivers are correct.

Step 4 — Deploy Open WebUI

Ubuntu (recommended: host networking):

docker run -d --name open-webui --restart unless-stopped --network host \ -e OLLAMA_BASE_URL=http://127.0.0.1:11434 \ -e WEBUI_AUTH=true \ -e [email protected] \ -e DEFAULT_USER_PASSWORD=ChangeMeStrong! \ ghcr.io/open-webui/open-webui:main

Open your browser at http://127.0.0.1:8080 (host network uses the container’s internal port). Log in with the credentials you set.

WSL2 with Docker Desktop (no host networking):

docker run -d --name open-webui --restart unless-stopped -p 3000:8080 \ -e OLLAMA_BASE_URL=http://host.docker.internal:11434 \ -e WEBUI_AUTH=true \ -e [email protected] \ -e DEFAULT_USER_PASSWORD=ChangeMeStrong! \ ghcr.io/open-webui/open-webui:main

Browse to http://localhost:3000. The variable OLLAMA_BASE_URL points Open WebUI to the local Ollama API.

Step 5 — Optimize and manage models

Popular models: llama3.1:8b, mistral-nemo:12b, qwen2.5:7b. You can also choose quantized variants like llama3.1:8b-instruct-q4_K_M for lower VRAM usage.

ollama list ollama pull mistral-nemo:12b ollama rm modelname:tag

To limit VRAM, use smaller models or quantized builds. In Open WebUI, select the model per chat. For longer contexts, try a 70B model on a bigger GPU or use smaller context windows to conserve memory.

Step 6 — Secure and persist

Keep Open WebUI bound to localhost if the machine is shared. If you must expose it, use a reverse proxy with TLS (Nginx, Caddy, or Traefik) and keep WEBUI_AUTH=true. On Ubuntu, ensure the firewall blocks unwanted access:

sudo ufw allow 8080/tcp comment 'Open WebUI (local)' sudo ufw status

Data for Open WebUI is stored in the container’s volume by default. For manual control, mount a volume: -v open-webui:/app/backend/data. Ollama stores models under ~/.ollama; back that up regularly.

Step 7 — Auto-start on boot

Ollama installs a systemd service on Ubuntu. Ensure Docker is enabled (already done) and containers use --restart unless-stopped so Open WebUI comes up after reboots. On Windows, set Docker Desktop to start on login.

Troubleshooting

GPU not used: Update NVIDIA driver, reboot, and confirm with nvidia-smi (Ubuntu) or update WSL and drivers (Windows). Reinstall Ollama if needed. Check that ollama run logs mention CUDA.

Out-of-VRAM: Use a smaller or more aggressively quantized model. Reduce max tokens or context length in Open WebUI settings. Close other GPU apps.

Port conflicts: If 11434 or 8080/3000 are used, change them. For example, -p 3333:8080 for Open WebUI and OLLAMA_HOST=127.0.0.1:11500 ollama serve for Ollama.

Slow downloads: Use a reliable network, or pre-fetch models with ollama pull. You can also host a local model library if bandwidth is limited.

Clean up disk: Remove unused models and images:
ollama list && ollama rm model:tag docker image prune -f

What’s next

Explore RAG in Open WebUI by attaching local documents, add embeddings, and test function calling or tool integrations. With this setup, you have a fast, private, and GPU-accelerated local AI workstation ready for coding, content, and research.

LifeBytes Journal

Search This Blog