Overview
Running large language models locally is easier than ever with Ollama (a lightweight LLM runtime) and Open WebUI (a modern chat interface). This step-by-step guide shows how to install Ollama on Ubuntu 24.04 (or 22.04), enable NVIDIA GPU acceleration, and connect Open WebUI via Docker. By the end, you will have a fast, private, and browser-based AI chat system that can run models like Llama 3, Mistral, and Phi on your own hardware.
Prerequisites
• Ubuntu 24.04 LTS (or 22.04)
• An NVIDIA GPU with recent drivers (Turing/RTX or newer recommended)
• At least 16 GB RAM and 20+ GB free disk (models are large)
• sudo access and an internet connection
Step 1 — Update Ubuntu and install essentials
sudo apt update && sudo apt -y upgradesudo apt -y install curl ca-certificates gnupg git
Step 2 — Install the NVIDIA driver
If you do not already have the proprietary driver installed, use Ubuntu’s driver manager. Reboot after installation.
sudo ubuntu-drivers autoinstallsudo reboot
After reboot, verify the driver is working with nvidia-smi. You should see your GPU listed and the driver version.
Step 3 — Install Docker and the NVIDIA Container Toolkit
Install Docker (Community edition from Ubuntu repo is fine for this use case), then add NVIDIA’s container runtime so GPU workloads run inside containers when needed.
Install Docker:sudo apt -y install docker.io docker-compose-pluginsudo usermod -aG docker $USERnewgrp docker
Add NVIDIA Container Toolkit repo and install:curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpgdistribution=$(. /etc/os-release; echo $ID$VERSION_ID)curl -fsSL https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.listsudo apt update && sudo apt -y install nvidia-container-toolkitsudo nvidia-ctk runtime configure --runtime=dockersudo systemctl restart docker
Although Open WebUI does not need GPU access, having the NVIDIA runtime ensures any future GPU-enabled containers work smoothly.
Step 4 — Install Ollama (GPU-enabled)
Ollama provides pre-optimized builds and automatically uses your NVIDIA GPU when available.
curl -fsSL https://ollama.com/install.sh | sh
Enable and start the service (on recent Ubuntu, the installer sets this up automatically):sudo systemctl enable --now ollama
Verify installation:ollama --version
Test pulling a model (downloads may be several GB):ollama pull llama3.1:8b
Run a quick prompt:ollama run llama3.1:8b
Step 5 — Deploy Open WebUI with Docker
Open WebUI offers a clean multi-user chat interface for local models. We will point it at the Ollama API on the host. Create a project folder, then a basic Compose file.
mkdir -p ~/open-webui && cd ~/open-webui
Create docker-compose.yml with the following content (one service):
services: open-webui: image: ghcr.io/open-webui/open-webui:latest container_name: open-webui ports: - "3000:8080" environment: - OLLAMA_BASE_URL=http://host.docker.internal:11434 extra_hosts: - "host.docker.internal:host-gateway" restart: unless-stopped
Start the container:docker compose up -d
Open your browser and go to http://<your-server-ip>:3000. Create the first admin account when prompted. Open WebUI should auto-detect the Ollama endpoint via OLLAMA_BASE_URL. If not, set it under Settings → Connections.
Step 6 — Verify GPU acceleration
While generating a response in Open WebUI, run nvidia-smi in a terminal. You should see GPU utilization and memory usage spike. If usage is zero during generation, check driver versions and restart the Ollama service.
Step 7 — Manage and optimize models
List installed models:ollama list
Pull additional models:ollama pull mistral:7bollama pull phi3:mini
Set a default model in Open WebUI under Settings → Models, or choose per chat. For better performance on limited VRAM, prefer quantized variants (e.g., :q4_0 or :q6_k if available). You can also tune parallel requests:export OLLAMA_NUM_PARALLEL=2
Then restart the service:sudo systemctl restart ollama
Advanced users can build a custom Modelfile to add system prompts or LoRA adapters. Example skeleton:FROM llama3.1:8bSYSTEM You are a concise technical assistant.
Apply it with:ollama create my-tech -f Modelfile
Then select my-tech in Open WebUI.
Troubleshooting
No GPU usage: Ensure nvidia-smi shows the driver. If Ollama still uses CPU, update to the latest Ollama, confirm you installed the proprietary NVIDIA driver, and reboot. Some headless servers require sudo apt -y install nvidia-driver-### and a kernel headers update.
Docker cannot reach Ollama: The Compose file maps host.docker.internal to the Linux host via host-gateway. If your Docker version is old, update Docker or replace the endpoint with your host IP, for example http://192.168.1.50:11434.
Port conflicts (3000 or 11434): Change 3000:8080 in the Compose file to another host port (e.g., 8081:8080). For Ollama, change the listen port by editing its service environment and restarting: sudo systemctl edit ollama then set Environment="OLLAMA_HOST=0.0.0.0:11435", save, sudo systemctl daemon-reload && sudo systemctl restart ollama.
Disk space: Models can exceed 10 GB each. Remove unused models with ollama rm modelname and clean unreferenced layers with ollama prune.
Security and maintenance tips
• Do not expose the Ollama API publicly without a reverse proxy and authentication.
• Use a firewall (e.g., sudo ufw allow 3000/tcp for local access, block external where appropriate).
• Keep components updated: sudo apt update, ollama update, and docker compose pull && docker compose up -d for Open WebUI.
Conclusion
You now have a GPU-accelerated local LLM stack with Ollama and Open WebUI on Ubuntu. This setup is fast, private, and flexible: switch models on demand, create custom prompts, and iterate safely on your own hardware. With regular updates and a few optimizations, it can rival many hosted AI experiences without sending data to the cloud.
Comments
Post a Comment