Overview
This step-by-step guide shows you how to run local Large Language Models (LLMs) on Ubuntu using Ollama and Open WebUI. You will install Ollama, optionally enable NVIDIA GPU acceleration, and deploy Open WebUI in Docker to get a fast, friendly chat interface. By the end, you will have a private AI assistant running on your own hardware with secure access options and practical troubleshooting tips.
Prerequisites
Use Ubuntu 22.04 or 24.04 with at least 8 GB of RAM (16 GB recommended). For GPU acceleration, an NVIDIA GPU with 8 GB or more VRAM is ideal. You need sudo access and open ports 11434 for Ollama and 3000 (or your choice) for Open WebUI. This guide covers both CPU-only and GPU setups, so you can start even without a supported GPU.
Step 1: (Optional) Install NVIDIA Drivers and CUDA
If you plan to use a GPU, first confirm your hardware with lspci | grep -i nvidia. Install the recommended driver via sudo ubuntu-drivers autoinstall, then reboot. After rebooting, verify the driver with nvidia-smi. If you will run Open WebUI with GPU access in Docker, also install the NVIDIA container runtime using sudo apt-get install -y nvidia-container-toolkit and configure Docker with sudo nvidia-ctk runtime configure followed by sudo systemctl restart docker.
Step 2: Install Ollama on Ubuntu
Install Ollama with a single command: curl -fsSL https://ollama.com/install.sh | sh. This creates a system service and exposes the local API on http://127.0.0.1:11434. Check the version with ollama -v and verify the service using systemctl status ollama. If you need remote access on your LAN, set the host binding by creating an override file. Run sudo systemctl edit ollama, add [Service] and Environment="OLLAMA_HOST=0.0.0.0:11434", then save, sudo systemctl daemon-reload, and sudo systemctl restart ollama. Only expose Ollama on trusted networks or behind a reverse proxy with authentication.
Step 3: Pull and Run Models with Ollama
Pull a small, fast model to test your setup. For general chat, use ollama pull llama3.2:3b. For coding tasks, try ollama pull qwen2.5-coder:7b or a quantized variant like :q4_0 for lower memory usage. Run an interactive session with ollama run llama3.2 and type your prompt. To generate from the shell, try echo "Explain RAID levels simply" | ollama run llama3.2. Ollama will use the GPU automatically if supported; otherwise it falls back to CPU. Tune performance with environment variables such as OLLAMA_NUM_PARALLEL=1 to reduce memory pressure and OLLAMA_KV_SIZE=512 for larger context windows when your memory allows.
Step 4: Deploy Open WebUI with Docker
Open WebUI provides a clean web interface and multi-model support. If Docker is not installed, add it with sudo apt-get update && sudo apt-get install -y docker.io and ensure it runs at startup with sudo systemctl enable --now docker. Launch Open WebUI connected to Ollama using docker run -d --name open-webui -p 3000:8080 -e OLLAMA_BASE_URL=http://localhost:11434 -v open-webui:/app/backend/data -v /var/lib/ollama:/root/.ollama --restart unless-stopped ghcr.io/open-webui/open-webui:latest. If Open WebUI runs on a different host from Ollama, set OLLAMA_BASE_URL to the Ollama server’s IP, for example http://192.168.1.50:11434. For GPU inside the container, add --gpus all and make sure the NVIDIA container toolkit is configured.
Step 5: Secure Access with a Reverse Proxy and HTTPS
If you plan to reach the interface over the internet, place Open WebUI behind a reverse proxy with TLS and authentication. A simple option is Caddy, which can obtain and renew certificates automatically. For example, you can point a domain to your server and configure Caddy to proxy yourdomain.com to localhost:3000 and enable basic auth. With Nginx, use an SSL server block, set proxy_pass http://127.0.0.1:3000, and enable rate limiting and headers like X-Frame-Options and Content-Security-Policy. Always avoid exposing the raw Ollama port unless you fully trust the network.
Step 6: Updates, Backups, and Autostart
Update Ollama by rerunning the installer or using your package manager if you installed via a repo. To update Open WebUI, pull the latest image with docker pull ghcr.io/open-webui/open-webui:latest and restart the container. Persist your data by backing up /var/lib/ollama and the Docker volume open-webui. Both Ollama and Docker containers start automatically on boot, but you can confirm with systemctl is-enabled ollama and the container’s --restart unless-stopped flag.
API Quick Test
You can call Ollama’s local API directly. After pulling a model, try curl http://localhost:11434/api/generate -d '{"model":"llama3.2","prompt":"Give me three bullet points about containers"}'. This is useful for integrating local LLMs into scripts, chatbots, or development tools without sending data to third parties.
Troubleshooting
If model loading fails with “no space left on device,” free disk space with df -h, remove unused Docker images with docker system prune -a, or delete old models in /var/lib/ollama. If nvidia-smi returns an error, reinstall the driver and ensure Secure Boot is either disabled or configured with signed modules. If port 11434 or 3000 is already in use, change the binding (for example OLLAMA_HOST=0.0.0.0:11435) or stop the conflicting process. On low-memory hosts, choose smaller or more heavily quantized models (for example :q4_0), reduce parallel requests with OLLAMA_NUM_PARALLEL=1, and close other memory-hungry services.
What You Achieved
You now have a private, production-ready local AI stack on Ubuntu. Ollama runs the model backend with optional GPU acceleration, while Open WebUI delivers a modern chat interface. With a reverse proxy and backups in place, you can confidently use local LLMs for coding assistance, content drafting, documentation, and experimentation without sending your data to the cloud.
3.
Comments
Post a Comment