Overview
Running large language models locally is easier than ever. In this guide, you will deploy a private, GPU-accelerated AI stack with Ollama and Open WebUI on Ubuntu 24.04 using Docker. Ollama handles model downloads and inference, while Open WebUI provides a friendly chat interface, prompt library, RAG features, and multi-user management. By the end, you will have a fast, on-prem AI assistant accessible via a browser, without sending data to third parties.
Prerequisites
System: Ubuntu 24.04 (Noble), an NVIDIA GPU (e.g., RTX 3060+), NVIDIA driver 535+ (or latest), at least 16 GB RAM, and reliable internet. You should have sudo access. This tutorial uses Docker; no prior Kubernetes skills required.
Step 1 — Install NVIDIA Driver and Reboot
If you have not installed the proprietary driver, do it now and reboot:
sudo ubuntu-drivers autoinstallsudo reboot
Step 2 — Install Docker Engine
Add Docker’s official repository and install the engine:
sudo apt updatesudo apt install -y ca-certificates curl gnupgsudo install -m 0755 -d /etc/apt/keyringscurl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpgecho "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu noble stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/nullsudo apt updatesudo apt install -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
Optional: allow your user to run Docker without sudo and start a new shell:
sudo usermod -aG docker $USERnewgrp docker
Step 3 — Enable GPU inside Containers
Install NVIDIA Container Toolkit so Docker can use your GPU:
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpgcurl -fsSL https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.listsudo apt updatesudo apt install -y nvidia-container-toolkitsudo nvidia-ctk runtime configure --runtime=dockersudo systemctl restart docker
Verify GPU access from Docker:
docker run --rm --gpus all nvidia/cuda:12.4.1-base-ubuntu22.04 nvidia-smi
Step 4 — Create Network and Volumes
Create an isolated Docker network and persistent volumes for data:
docker network create aidocker volume create ollamadocker volume create openwebui
Step 5 — Run Ollama (GPU-accelerated)
Start the Ollama server and keep it local-only on port 11434. The volume stores models and caches:
docker run -d --name ollama --gpus all --restart unless-stopped --network ai -p 127.0.0.1:11434:11434 -v ollama:/root/.ollama ollama/ollama:latest
Pull a starter model (choose one that fits your GPU memory):
docker exec -it ollama ollama pull llama3.1:8bdocker exec -it ollama ollama pull qwen2.5:7b-instruct
Step 6 — Run Open WebUI
Open WebUI will use Ollama via the internal Docker network. Expose the web interface on port 3000:
docker run -d --name open-webui --restart unless-stopped --network ai -e OLLAMA_BASE_URL=http://ollama:11434 -p 3000:8080 -v openwebui:/app/backend/data ghcr.io/open-webui/open-webui:latest
Open a browser and visit http://SERVER_IP:3000/. Create the first admin user. In the model selector, choose the model you pulled (e.g., llama3.1:8b) and start chatting.
Step 7 — Test the API
You can also use the local API directly. From the host:
curl http://127.0.0.1:11434/api/generate -d '{"model":"llama3.1:8b","prompt":"Say hello in one sentence."}'
Developers can point tools like LangChain or LlamaIndex at http://127.0.0.1:11434 for private inference.
Optimization Tips
If you see out-of-memory errors, switch to smaller or more aggressively quantized models (e.g., q4_k_m) when pulling: ollama pull llama3.1:8b-instruct-q4_K_M. Keep prompts concise and lower context length in Open WebUI settings. On multi-GPU systems, Ollama auto-detects devices; you can fine-tune behavior with environment variables like OLLAMA_NUM_GPU and OLLAMA_NUM_GPU_LAYERS if needed. Always use persistent volumes to avoid re-downloading models after updates.
Security Hardening
By binding Ollama to 127.0.0.1, the API is not exposed externally. Expose Open WebUI only to trusted networks. If you must publish it on the internet, use a reverse proxy with TLS and authentication. For example, with UFW, allow only your LAN:
sudo ufw allow from 192.168.0.0/16 to any port 3000 proto tcp
For Nginx or Caddy, enable HTTPS and basic auth or OIDC. Never expose port 11434 directly without protection.
Troubleshooting
GPU not detected in containers: Re-run sudo nvidia-ctk runtime configure --runtime=docker, then sudo systemctl restart docker. Confirm host drivers with nvidia-smi and container access with the CUDA test image.
Permission errors with Docker: Add your user to the docker group (sudo usermod -aG docker $USER) and re-login.
Slow responses or model crashes: Try a smaller/quantized model, reduce context window, and verify VRAM usage. Ensure swap is enabled for stability when RAM is tight.
Update containers: docker pull ollama/ollama:latest and docker pull ghcr.io/open-webui/open-webui:latest, then docker restart your containers.
Cleanup (Optional)
To stop and remove everything:
docker rm -f open-webui ollamadocker volume rm openwebui ollamadocker network rm ai
What You Achieved
You now have a modern, private AI stack with GPU acceleration on Ubuntu 24.04. Ollama simplifies model management and inference, while Open WebUI offers a polished interface ready for daily use, prototyping, and team collaboration. With careful model choice, proper security, and regular updates, this setup can replace many cloud-based assistants—keeping your data on your hardware.
Comments
Post a Comment