Local large language models are now practical on a single PC. In this tutorial, you will deploy Ollama (model runtime) and Open WebUI (a friendly chat interface) using Docker on Windows or Linux. We will enable NVIDIA GPU acceleration, persist models on disk, and cover secure access and troubleshooting. By the end, you will be chatting with a local LLM like llama3.1 in your browser, no cloud required.
What You Will Need
- A 64-bit PC with at least 16 GB RAM. For GPU acceleration, an NVIDIA GPU with 8 GB+ VRAM is recommended.
- Docker Engine or Docker Desktop (Compose v2 included).
- Free disk space (15–30 GB per model is common).
- Optional but recommended: NVIDIA GPU drivers and CUDA runtime for Docker.
Step 1: Prepare Your System (GPU Optional)
Linux (Ubuntu/Debian)
1) Install Docker Engine and the Compose plugin from the official Docker repo.
2) Install NVIDIA GPU drivers from your distro or NVIDIA site.
3) Install the NVIDIA Container Toolkit:sudo apt-get install -y nvidia-container-toolkit
Then configure and restart Docker:sudo nvidia-ctk runtime configure --runtime=dockersudo systemctl restart docker
Verify GPU visibility in containers:docker run --rm --gpus all nvidia/cuda:12.3.2-base-ubuntu22.04 nvidia-smi
Windows 10/11
1) Install the latest NVIDIA GPU driver (Studio or Game Ready).
2) Install Docker Desktop and enable WSL 2 backend during setup.
3) In Docker Desktop > Settings > Resources > WSL integration, enable your default distro.
4) Ensure GPU is exposed to containers. If you run docker run --rm --gpus all nvidia/cuda:12.3.2-base-ubuntu22.04 nvidia-smi and see your GPU, you are ready.
Step 2: Create a Docker Compose File
We will run two containers: ollama (the LLM runtime API) and open-webui (the web front-end). The services will share a network and persistent volumes. Create a folder like ollama-openwebui and a file compose.yaml with the following content:
services:
ollama:
image: ollama/ollama:latest
container_name: ollama
restart: unless-stopped
ports:
- "11434:11434"
volumes:
- ollama_data:/root/.ollama
environment:
- OLLAMA_KEEP_ALIVE=24h
deploy:
resources:
reservations:
devices:
- capabilities: ["gpu"]
openwebui:
image: ghcr.io/open-webui/open-webui:latest
container_name: open-webui
restart: unless-stopped
depends_on:
- ollama
ports:
- "3000:8080"
environment:
- OLLAMA_BASE_URL=http://ollama:11434
- WEBUI_AUTH=True
- DEFAULT_MODELS=llama3.1:8b
volumes:
- openwebui_data:/app/backend/data
volumes:
ollama_data:
openwebui_data:
Notes:
- The deploy.resources.reservations.devices section hints Compose to request GPU. On Linux, also start with --gpus all if you run containers manually.
- Ports: Ollama API is 11434, Open WebUI is exposed on 3000 (mapped to container 8080).
Step 3: Start the Stack
In the folder containing compose.yaml, run:docker compose up -d
Wait for both containers to start. You can watch logs with:docker compose logs -f
Step 4: Pull a Model and Run Your First Chat
Open a terminal and pull a model into Ollama. For a good balance of quality and speed, try Meta’s 8B model:docker exec -it ollama ollama pull llama3.1:8b
You can test from the CLI:docker exec -it ollama ollama run llama3.1:8b "Write a haiku about local AI."
If the response appears, the model is working.
Now open your browser and visit http://localhost:3000. Create an admin account (since we set WEBUI_AUTH=True). In Settings > Models, you should see llama3.1:8b. Create a new chat and start prompting.
GPU Acceleration Checks
- If you have an NVIDIA GPU, Ollama should automatically use it. Confirm via logs: docker logs ollama (look for CUDA initialization).
- If you do not have a GPU, Ollama will use CPU. Expect slower generation but it will work.
Useful Options and Performance Tips
- Try smaller variants for low VRAM: llama3.2:3b or phi3:mini.
- You can pin models to GPU RAM by enabling sufficient numa/gpu memory; if VRAM is low, Ollama will offload layers to system RAM.
- To pre-download a model at startup, set DEFAULT_MODELS in the Open WebUI service as shown.
- For multilingual or coding tasks, add models like qwen2.5:7b or codestral.
Security and Remote Access
- Keep WEBUI_AUTH=True to require sign-in. You can also set OPENWEBUI_ADMIN_EMAIL and OPENWEBUI_ADMIN_PASSWORD as environment variables for unattended setups.
- If exposing Open WebUI to the internet, place it behind a reverse proxy (Nginx, Caddy, or Traefik) with HTTPS and strong passwords.
- The Ollama API on port 11434 should remain private unless you need remote access; firewall it if required.
Troubleshooting
- GPU not detected: On Linux, reinstall nvidia-container-toolkit and verify nvidia-smi works both on the host and in a container. On Windows, ensure WSL 2 is enabled and Docker Desktop is up to date.
- “No space left on device”: Increase disk space or prune unused model blobs: docker exec -it ollama ollama rm <model>. You can also clear unused images with docker system prune (caution).
- Slow or out-of-memory: Use a smaller model, reduce context length in Open WebUI, close other GPU-intensive apps, or increase swap on Linux.
- Port in use: Change the published ports in compose.yaml (e.g., "3001:8080") and redeploy.
Updating and Maintenance
To update to the latest versions, run:docker compose pulldocker compose up -d
Your models are safe in the ollama_data volume, and your chat history lives in openwebui_data. Always back up these volumes before major upgrades.
What’s Next
You now have a privacy-friendly, GPU-accelerated local AI stack. Explore function calling, RAG connectors in Open WebUI, or run multiple models side by side. With Docker and Ollama, swapping models and keeping performance high is only a pull away.
Comments
Post a Comment