How to Self-Host a Local AI Chat with Ollama and Open WebUI on Ubuntu (GPU Ready)

Overview

In this step-by-step guide, you will learn how to self-host a local AI chat environment on Ubuntu using Ollama and Open WebUI. Ollama runs large language models (LLMs) locally and exposes a simple API, while Open WebUI provides a modern browser interface, chat history, and prompt management. This tutorial targets Ubuntu 22.04/24.04 and shows how to enable NVIDIA GPU acceleration, secure the service, and test the API.

Prerequisites

You need an Ubuntu 22.04/24.04 machine with at least 16 GB RAM for comfortable use and an NVIDIA GPU with recent drivers (525+ recommended) if you want hardware acceleration. For CPU-only usage, Ollama still works but will be slower. You also need a user with sudo privileges and internet access.

Step 1: Update the system and install basics

Start by refreshing your package lists and installing useful tools such as curl and ufw. Run: sudo apt update && sudo apt -y upgrade and then sudo apt -y install curl ca-certificates ufw. This ensures you have the latest security updates and a firewall ready to configure later.

Step 2: Verify NVIDIA GPU (optional but recommended)

If you intend to use GPU acceleration, confirm your NVIDIA driver installation. Run nvidia-smi. If the command shows your GPU and driver version, you are ready. If not, install a recommended driver with sudo ubuntu-drivers autoinstall, reboot using sudo reboot, and check again with nvidia-smi. Ollama includes the runtime pieces it needs and will automatically use your GPU when supported.

Step 3: Install Ollama

Ollama provides a one-line installer for Linux. Run: curl -fsSL https://ollama.com/install.sh | sh. This installs the ollama binary and sets up a systemd service called ollama. After the script completes, verify the service with systemctl status ollama. If it is not running, start it using sudo systemctl start ollama and enable it at boot with sudo systemctl enable ollama.

Step 4: Pull your first model

Ollama hosts many popular models. To start, pull a reasonably fast, high-quality base model like Meta’s Llama 3.1. Run ollama pull llama3.1. Other good options include mistral, qwen2.5, or smaller quantized variants that fit into limited VRAM (for example, llama3.1:8b or mistral:7b-instruct). Use ollama list to see installed models.

Step 5: Test the local API

Ollama listens on http://127.0.0.1:11434 by default. You can chat in the terminal with ollama run llama3.1. To test via API, run: curl http://127.0.0.1:11434/api/generate -d '{"model":"llama3.1","prompt":"Say hello from a local model."}'. You should see a streamed JSON response. Press Ctrl+C to stop streaming if needed.

Step 6: Install Docker (for Open WebUI)

Open WebUI is easiest to deploy with Docker. Install Docker and its prerequisites. First run: sudo apt -y install apt-transport-https gnupg lsb-release. Then add Docker’s repository key: curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /usr/share/keyrings/docker.gpg. Add the repo: echo "deb [arch=$(dpkg --print-architecture) signed-by=/usr/share/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null. Install Docker: sudo apt update && sudo apt -y install docker-ce docker-ce-cli containerd.io. Optionally add your user to the Docker group: sudo usermod -aG docker $USER and re-login.

Step 7: Deploy Open WebUI

Open WebUI connects to Ollama’s API and provides a rich chat interface. Create a persistent volume directory and run the container pointing to the local Ollama endpoint. Use: docker run -d --name open-webui -p 3000:8080 -e OLLAMA_API_BASE_URL=http://host.docker.internal:11434 -v openwebui-data:/app/backend/data --restart unless-stopped ghcr.io/open-webui/open-webui:latest. On Linux, if host.docker.internal is not available, replace it with the host’s IP (e.g., http://127.0.0.1:11434) and add --network host instead of -p mapping if you prefer host networking: docker run -d --name open-webui --network host -e OLLAMA_API_BASE_URL=http://127.0.0.1:11434 -v openwebui-data:/app/backend/data --restart unless-stopped ghcr.io/open-webui/open-webui:latest.

Step 8: Secure optional access and firewall

If you will only use the services locally, keep them bound to localhost and do not expose ports publicly. For remote access on a trusted LAN, allow Open WebUI’s port via UFW using sudo ufw allow 3000/tcp (or none if using --network host and default port 8080). Enable the firewall with sudo ufw enable, then verify rules with sudo ufw status. For public access, place Open WebUI behind a reverse proxy (Nginx, Caddy, or Traefik) with HTTPS and authentication.

Step 9: Use the interface

Open a browser to http://SERVER_IP:3000 (or http://localhost:3000). On first load, you can create an admin account, choose your default model (e.g., llama3.1), manage prompts, and run chats. You can switch models per-conversation and configure system prompts for specific tasks like coding, summarization, or Q&A.

Troubleshooting and tips

If a model fails to load due to GPU memory limits, pull a smaller or more heavily quantized variant such as llama3.1:8b or a q4_k_m quant. If CPU usage is too high, reduce the context window or batch size in Open WebUI settings. If the Open WebUI container cannot reach Ollama, double-check OLLAMA_API_BASE_URL, networking mode, and whether the ollama service is running. For best performance on NVIDIA GPUs, close other GPU-heavy apps and monitor usage with nvidia-smi. To update, run sudo systemctl stop ollama && curl -fsSL https://ollama.com/install.sh | sh && sudo systemctl start ollama and pull newer model versions as needed.

What you have now

You have a fully local AI chat stack with GPU acceleration using Ollama and a clean, user-friendly interface via Open WebUI. It is private by default, fast on modern GPUs, and flexible with many model choices. You can integrate it with other tools via the Ollama API for scripting, automation, and offline workflows. This setup gives you control over costs, data privacy, and performance while staying current with the latest open models.

Comments