How to Self‑Host a Private AI Chatbot with Ollama and Open WebUI (Docker, GPU‑Ready)

Overview

Want a private, fast, and customizable AI chatbot without sending your data to the cloud? In this guide you will deploy Ollama (which runs large language models locally) together with Open WebUI (a modern chat interface) using Docker. The setup works on Linux, Windows, and macOS, and can use your NVIDIA GPU for acceleration. You will get a production‑style layout with data volumes, secure defaults, update steps, and troubleshooting tips.

What You Will Need

- A machine with at least 8 GB RAM (16 GB+ recommended for larger models). CPU‑only works; GPU is optional.

- Docker Engine (Linux) or Docker Desktop (Windows/macOS). Ensure Docker Compose is available (Docker Desktop includes it).

- Optional GPU acceleration: NVIDIA GPU, recent NVIDIA drivers, and NVIDIA Container Toolkit on Linux; on Windows, Docker Desktop with WSL2 backend and CUDA‑capable drivers.

Step 1 — Create the Docker Compose file

Create a working folder (for example, ai-stack) and add a file named docker-compose.yml with the following baseline (CPU‑only, safe defaults that bind to localhost):

version: "3.9"
services:
  ollama:
    image: ollama/ollama:latest
    container_name: ollama
    restart: unless-stopped
    volumes:
      - ollama:/root/.ollama
    ports:
      - "127.0.0.1:11434:11434"
  open-webui:
    image: ghcr.io/open-webui/open-webui:latest
    container_name: open-webui
    restart: unless-stopped
    environment:
      - OLLAMA_API_BASE=http://ollama:11434
    depends_on:
      - ollama
    volumes:
      - open-webui:/app/backend/data
    ports:
      - "127.0.0.1:3000:8080"
volumes:
  ollama:
  open-webui:

Binding to 127.0.0.1 keeps services private on the host. You can later expose them behind a reverse proxy with HTTPS if you need remote access.

Step 2 — Start the stack

From the folder with your compose file, run:

docker compose pull
docker compose up -d

Wait a few seconds for containers to initialize. You can watch logs with docker compose logs -f.

Step 3 — Download your first model

Ollama manages models on demand. Pull a small model to test quickly (Llama 3.2 3B is a good start):

docker exec -it ollama ollama pull llama3.2:3b

You can list models later with docker exec -it ollama ollama list. For better quality, try llama3.1:8b or a reasoning model when your hardware allows it.

Step 4 — Open the chat UI

Visit http://localhost:3000. In the Open WebUI interface, choose the model you pulled (e.g., llama3.2:3b) and start chatting. Responses run entirely on your machine through Ollama at http://localhost:11434.

Optional: Enable NVIDIA GPU acceleration

GPU support can dramatically speed up responses. Ensure your system is ready first:

- Linux: Install the proprietary NVIDIA driver and the NVIDIA Container Toolkit (nvidia-container-toolkit). Verify nvidia-smi works on the host.

- Windows: Install NVIDIA drivers with CUDA, enable WSL2 and GPU support in Docker Desktop, and ensure WSL2 integration is turned on for your Linux distro.

Then, choose one of the following methods for the ollama service:

A) Compose with GPU (supported in recent Docker Compose versions):

services:
  ollama:
    image: ollama/ollama:latest
    container_name: ollama
    restart: unless-stopped
    volumes:
      - ollama:/root/.ollama
    ports:
      - "127.0.0.1:11434:11434"
    gpus: all

B) Run Ollama with a direct docker run command (replaces the compose service):

docker stop ollama && docker rm ollama
docker run -d --name ollama --gpus all \
  -p 127.0.0.1:11434:11434 \
  -v ollama:/root/.ollama \
  ollama/ollama:latest

After enabling GPU, repull or reload models so they compile kernels for the GPU on first run. Use docker logs ollama -f to confirm CUDA is used.

Security Hardening (Recommended)

- Keep services bound to localhost as shown. For remote access, place a reverse proxy (Caddy, Nginx, Traefik) in front with HTTPS and authentication.

- In Open WebUI, create an admin account first and limit signups from Settings. You can also run it behind SSO or a VPN.

- Do not expose port 11434 publicly; Ollama has no built‑in auth. If you must, secure the path via a proxy and firewall rules.

Updating and Backups

To update to the latest images:

docker compose pull
docker compose up -d

Your models (Ollama) and chat data (Open WebUI) live in Docker volumes named ollama and open-webui. Back them up with:

docker run --rm -v ollama:/data -v $(pwd):/backup busybox tar czf /backup/ollama-vol.tgz -C / data
docker run --rm -v open-webui:/data -v $(pwd):/backup busybox tar czf /backup/open-webui-vol.tgz -C / data

Troubleshooting

- Port already in use: Change the left side of the port mapping (for example, 127.0.0.1:3001:8080) or stop the conflicting service.

- Slow or out‑of‑memory on big models: Choose a smaller model (3B–8B). On GPU, ensure sufficient VRAM; quantized variants (e.g., Q4_K_M) reduce memory needs.

- GPU not detected: Confirm nvidia-smi works on the host, restart Docker, and verify you used gpus: all or --gpus all. On Windows, ensure WSL2 integration is enabled in Docker Desktop.

- Open WebUI cannot reach Ollama: Check OLLAMA_API_BASE is set to http://ollama:11434 in Compose and that both services share the same default network (they do by default).

Remove Everything (Optional)

To stop and remove containers but keep volumes: docker compose down.

To also delete all data volumes (irrevocable): docker compose down -v.

What You Get

You now have a private AI chatbot that runs fully on your machine, with a clean Docker layout, optional GPU acceleration, and safe defaults. Expand by adding more models (e.g., CodeLlama for coding, Phi‑3 for low‑resource devices), enabling RAG with document uploads in Open WebUI, or placing the stack behind a reverse proxy for secure remote access. This approach keeps your data local, reduces latency, and gives you full control over updates and performance.

Comments