Overview
This tutorial shows how to run local large language models (LLMs) on Ubuntu 22.04/24.04 using Ollama and Open WebUI. You will enable optional NVIDIA GPU acceleration, pull models, expose an API, and add a clean web interface. The steps are simple, secure by default, and work well on laptops, workstations, or lab servers.
Prerequisites
- Ubuntu 22.04 or 24.04 with at least 8 GB RAM (16 GB+ recommended for larger models).
- Optional: An NVIDIA GPU with recent drivers (550+ recommended) for faster inference.
- A user with sudo privileges and basic terminal skills.
Step 1 — (Optional) Install NVIDIA drivers for GPU acceleration
If you have an NVIDIA GPU, install the proprietary driver. On Ubuntu, using the graphics driver PPA is usually not required; the built-in repository works well.
sudo ubuntu-drivers autoinstall
sudo reboot
After reboot, verify the driver and CUDA runtime:
nvidia-smi
If you see your GPU listed without errors, you are ready for GPU-backed inference in Ollama. If not, reinstall the driver or check Secure Boot status (it can block kernel modules).
Step 2 — Install Ollama
Ollama is a lightweight runtime and API server for local LLMs. Install it with the official script:
curl -fsSL https://ollama.com/install.sh | sh
Once installed, start and enable the service:
sudo systemctl enable --now ollama
sudo systemctl status ollama
By default, Ollama listens on 127.0.0.1:11434. This is good for security. You can verify the API is up:
curl http://127.0.0.1:11434/api/tags
To force GPU usage when available, set an environment variable and restart:
echo 'export OLLAMA_USE_GPU=always' | sudo tee -a /etc/environment
sudo systemctl restart ollama
Step 3 — Pull a model and run your first prompt
Choose a model based on your hardware. Smaller, quantized models run on most CPUs; larger ones benefit from GPUs. Popular starters: llama3.1, mistral, qwen2, phi3.
ollama pull llama3.1
Send a quick prompt:
ollama run llama3.1 "List three ways to optimize Python code."
You can also use the HTTP API. The example below streams tokens:
curl http://127.0.0.1:11434/api/generate -d '{
"model": "llama3.1",
"prompt": "Explain the difference between concurrency and parallelism.",
"stream": true
}'
Manage models as you experiment:
ollama list
ollama show llama3.1
ollama rm <model-name>
Step 4 — Install Open WebUI (friendly web interface)
Open WebUI provides a polished browser interface on top of Ollama. The easiest method is Docker. Install Docker if you don’t have it:
sudo apt-get update
sudo apt-get install -y ca-certificates curl gnupg
sudo install -m 0755 -d /etc/apt/keyrings
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg
echo \
"deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu \
$(. /etc/os-release && echo $VERSION_CODENAME) stable" | \
sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
sudo apt-get update
sudo apt-get install -y docker-ce docker-ce-cli containerd.io
sudo usermod -aG docker $USER
newgrp docker
Run Open WebUI and point it to your local Ollama. On Linux, add a host-gateway entry for convenience:
docker run -d --name open-webui \
-p 3000:8080 \
--add-host=host.docker.internal:host-gateway \
-e OLLAMA_BASE_URL=http://host.docker.internal:11434 \
--restart unless-stopped \
ghcr.io/open-webui/open-webui:latest
Open your browser and visit http://localhost:3000 to chat with your models, create prompts, and manage settings. The first login will create an admin account.
Optional — LAN access and security
If you want other devices on your network to use your Ollama API, bind it to all interfaces. This exposes an unauthenticated API, so protect it with a firewall or a reverse proxy with auth.
echo 'export OLLAMA_HOST=0.0.0.0:11434' | sudo tee -a /etc/environment
sudo systemctl restart ollama
sudo ufw allow from 192.168.1.0/24 to any port 11434 proto tcp
For production, put Nginx/Caddy in front with HTTPS and basic auth or OIDC. Keep models and data on an encrypted disk where possible.
Troubleshooting tips
- GPU not used: Run nvidia-smi to confirm driver installation. Set OLLAMA_USE_GPU=always. Ensure your user can access /dev/nvidia* devices. Disable Secure Boot or enroll MOK if drivers won’t load.
- Out-of-memory errors: Pull a smaller or more heavily quantized model (e.g., Q4_K_M). Close apps to free VRAM/RAM.
- Slow responses: Use a faster model family (Mistral/Qwen variants), reduce context length, or enable GPU. On CPU-only systems, keep temperature low and set fewer threads if thermally throttling.
- API connectivity from Docker: Use --add-host=host.docker.internal:host-gateway or connect both containers to a user-defined Docker network.
What to try next
- Add function calling or RAG: Pair Ollama with a local vector database (e.g., Chroma) and a small document loader (LlamaIndex/LangChain).
- Schedule model updates: ollama pull <model> via cron.
- Multi-user setup: Run Open WebUI behind a reverse proxy with SSO and per-user workspaces.
- Automate with Ansible: Create a role to install drivers, Ollama, Docker, and Open WebUI in one run.
Wrap-up
You now have a complete local AI stack: Ollama for fast, private LLM inference and Open WebUI for a clean chat experience. With GPU acceleration, a few careful security steps, and the built-in API, this setup is powerful for prototyping, offline work, and privacy-first deployments.
Comments
Post a Comment