Overview
Want a private, fast, and customizable AI chatbot without sending your data to the cloud? In this guide you will deploy Ollama (which runs large language models locally) together with Open WebUI (a modern chat interface) using Docker. The setup works on Linux, Windows, and macOS, and can use your NVIDIA GPU for acceleration. You will get a production‑style layout with data volumes, secure defaults, update steps, and troubleshooting tips.
What You Will Need
- A machine with at least 8 GB RAM (16 GB+ recommended for larger models). CPU‑only works; GPU is optional.
- Docker Engine (Linux) or Docker Desktop (Windows/macOS). Ensure Docker Compose is available (Docker Desktop includes it).
- Optional GPU acceleration: NVIDIA GPU, recent NVIDIA drivers, and NVIDIA Container Toolkit on Linux; on Windows, Docker Desktop with WSL2 backend and CUDA‑capable drivers.
Step 1 — Create the Docker Compose file
Create a working folder (for example, ai-stack) and add a file named docker-compose.yml with the following baseline (CPU‑only, safe defaults that bind to localhost):
version: "3.9"
services:
ollama:
image: ollama/ollama:latest
container_name: ollama
restart: unless-stopped
volumes:
- ollama:/root/.ollama
ports:
- "127.0.0.1:11434:11434"
open-webui:
image: ghcr.io/open-webui/open-webui:latest
container_name: open-webui
restart: unless-stopped
environment:
- OLLAMA_API_BASE=http://ollama:11434
depends_on:
- ollama
volumes:
- open-webui:/app/backend/data
ports:
- "127.0.0.1:3000:8080"
volumes:
ollama:
open-webui:
Binding to 127.0.0.1 keeps services private on the host. You can later expose them behind a reverse proxy with HTTPS if you need remote access.
Step 2 — Start the stack
From the folder with your compose file, run:
docker compose pull
docker compose up -d
Wait a few seconds for containers to initialize. You can watch logs with docker compose logs -f.
Step 3 — Download your first model
Ollama manages models on demand. Pull a small model to test quickly (Llama 3.2 3B is a good start):
docker exec -it ollama ollama pull llama3.2:3b
You can list models later with docker exec -it ollama ollama list. For better quality, try llama3.1:8b or a reasoning model when your hardware allows it.
Step 4 — Open the chat UI
Visit http://localhost:3000. In the Open WebUI interface, choose the model you pulled (e.g., llama3.2:3b) and start chatting. Responses run entirely on your machine through Ollama at http://localhost:11434.
Optional: Enable NVIDIA GPU acceleration
GPU support can dramatically speed up responses. Ensure your system is ready first:
- Linux: Install the proprietary NVIDIA driver and the NVIDIA Container Toolkit (nvidia-container-toolkit). Verify nvidia-smi works on the host.
- Windows: Install NVIDIA drivers with CUDA, enable WSL2 and GPU support in Docker Desktop, and ensure WSL2 integration is turned on for your Linux distro.
Then, choose one of the following methods for the ollama service:
A) Compose with GPU (supported in recent Docker Compose versions):
services:
ollama:
image: ollama/ollama:latest
container_name: ollama
restart: unless-stopped
volumes:
- ollama:/root/.ollama
ports:
- "127.0.0.1:11434:11434"
gpus: all
B) Run Ollama with a direct docker run command (replaces the compose service):
docker stop ollama && docker rm ollama
docker run -d --name ollama --gpus all \
-p 127.0.0.1:11434:11434 \
-v ollama:/root/.ollama \
ollama/ollama:latest
After enabling GPU, repull or reload models so they compile kernels for the GPU on first run. Use docker logs ollama -f to confirm CUDA is used.
Security Hardening (Recommended)
- Keep services bound to localhost as shown. For remote access, place a reverse proxy (Caddy, Nginx, Traefik) in front with HTTPS and authentication.
- In Open WebUI, create an admin account first and limit signups from Settings. You can also run it behind SSO or a VPN.
- Do not expose port 11434 publicly; Ollama has no built‑in auth. If you must, secure the path via a proxy and firewall rules.
Updating and Backups
To update to the latest images:
docker compose pull
docker compose up -d
Your models (Ollama) and chat data (Open WebUI) live in Docker volumes named ollama and open-webui. Back them up with:
docker run --rm -v ollama:/data -v $(pwd):/backup busybox tar czf /backup/ollama-vol.tgz -C / data
docker run --rm -v open-webui:/data -v $(pwd):/backup busybox tar czf /backup/open-webui-vol.tgz -C / data
Troubleshooting
- Port already in use: Change the left side of the port mapping (for example, 127.0.0.1:3001:8080) or stop the conflicting service.
- Slow or out‑of‑memory on big models: Choose a smaller model (3B–8B). On GPU, ensure sufficient VRAM; quantized variants (e.g., Q4_K_M) reduce memory needs.
- GPU not detected: Confirm nvidia-smi works on the host, restart Docker, and verify you used gpus: all or --gpus all. On Windows, ensure WSL2 integration is enabled in Docker Desktop.
- Open WebUI cannot reach Ollama: Check OLLAMA_API_BASE is set to http://ollama:11434 in Compose and that both services share the same default network (they do by default).
Remove Everything (Optional)
To stop and remove containers but keep volumes: docker compose down.
To also delete all data volumes (irrevocable): docker compose down -v.
What You Get
You now have a private AI chatbot that runs fully on your machine, with a clean Docker layout, optional GPU acceleration, and safe defaults. Expand by adding more models (e.g., CodeLlama for coding, Phi‑3 for low‑resource devices), enabling RAG with document uploads in Open WebUI, or placing the stack behind a reverse proxy for secure remote access. This approach keeps your data local, reduces latency, and gives you full control over updates and performance.
Comments
Post a Comment