Self-Host Private AI Chat: Deploy Ollama + Open WebUI on Docker (GPU Ready)

If you want a private, fast, and customizable AI chat without sending data to third-party clouds, hosting Ollama with Open WebUI on Docker is a great choice. Ollama runs lightweight local large language models (LLMs) and Open WebUI provides a clean chat interface, prompt management, and model switching. This guide shows you how to deploy both with Docker on Linux or Windows (WSL2), including optional GPU acceleration for NVIDIA or AMD.

Prerequisites

- A 64-bit machine with at least 16 GB RAM for smooth inference (8 GB can work for smaller models).
- Docker Engine and Docker Compose v2 installed.
- For GPU acceleration (optional):
• NVIDIA: Install the latest NVIDIA driver and NVIDIA Container Toolkit.
• AMD: Recent ROCm-supported GPU and drivers. On Linux, make sure /dev/kfd and /dev/dri are present.

Step 1: Prepare folders

Create a working folder to store persistent data. This keeps your models and chat history safe across updates.

mkdir -p ~/ai-stack && cd ~/ai-stack

Step 2: Create a Docker Compose file

The following docker-compose.yml launches two services: ollama (the model runtime) and open-webui (the web interface). It maps volumes for persistence, exposes ports, and connects the web UI to Ollama. GPU support can be enabled with a single line if you use NVIDIA.

version: "3.9"
services:
  ollama:
    image: ollama/ollama:latest
    container_name: ollama
    ports:
      - "11434:11434"
    volumes:
      - ollama:/root/.ollama
    environment:
      - OLLAMA_KEEP_ALIVE=24h
    restart: unless-stopped
    # Enable this line if you have an NVIDIA GPU and the container toolkit installed:
    # gpus: all

  open-webui:
    image: ghcr.io/open-webui/open-webui:latest
    container_name: open-webui
    depends_on:
      - ollama
    environment:
      - OLLAMA_API_BASE=http://ollama:11434
      # Optional: disable public sign-ups after you create the first admin
      # - SIGNUP_ENABLED=false
    ports:
      - "3000:8080"
    volumes:
      - open-webui:/app/backend/data
    restart: unless-stopped

volumes:
  ollama:
  open-webui:

Notes for AMD/ROCm on Linux: Depending on your distribution and drivers, you may need to pass GPU devices to the Ollama container. Add the following under services.ollama if models aren’t using your GPU:

    devices:
      - /dev/kfd:/dev/kfd
      - /dev/dri:/dev/dri
    group_add:
      - "video"
    environment:
      - HSA_OVERRIDE_GFX_VERSION=11.0.0

If you cannot use GPU yet, you can run entirely on CPU by leaving GPU lines out. Start small models first and scale up as resources allow.

Step 3: Start the stack

Run the following to download images and start containers in the background:

docker compose up -d

Verify both containers are healthy:

docker compose ps

Step 4: Access the web interface and pull a model

Open your browser to http://localhost:3000 (or the server’s IP on port 3000). On the first load, Open WebUI will ask you to create an admin account. After that, connect to Ollama automatically via the configured OLLAMA_API_BASE.

You need to download at least one model. You can pull models either from the WebUI’s Models section or via CLI. For example, from the host:

docker exec -it ollama ollama pull llama3.1:8b

Once downloaded, open a new chat in Open WebUI and select the model (e.g., llama3.1:8b). You can then chat, create system prompts, and save conversations.

Step 5: Confirm GPU acceleration (optional)

To check whether the GPU is used, open logs during a generation:

docker logs -f ollama

You should see messages indicating GPU layers offloaded if acceleration is active. On NVIDIA, you can also run nvidia-smi on the host while generating text to confirm utilization.

Step 6: Secure and harden

- Accounts: After creating your admin user, consider disabling public sign-ups by uncommenting SIGNUP_ENABLED=false in the compose file and restarting.
- Network: Run behind a reverse proxy such as Caddy, Nginx, or Traefik to add HTTPS. If you expose it to the internet, restrict access with firewall rules and strong authentication.
- Data: Store volumes on disks with sufficient space. Models can be several gigabytes each.

Step 7: Update, backup, and migrate

- Update images:

docker compose pull
docker compose up -d

- Backup volumes (on the host):

docker run --rm -v ollama:/data -v $PWD:/backup alpine tar czf /backup/ollama-backup.tgz -C / data
docker run --rm -v open-webui:/data -v $PWD:/backup alpine tar czf /backup/open-webui-backup.tgz -C / data

- Migrate to another server by restoring these archives into volumes with the reverse tar process.

Troubleshooting

- Port conflicts: If ports 3000 or 11434 are in use, change the left side of the port mappings in the compose file (e.g., "8081:8080").
- GPU not detected (NVIDIA): Ensure the host driver matches your GPU, the NVIDIA Container Toolkit is installed, and the compose service has gpus: all. Restart Docker after toolkit installs.
- GPU not detected (AMD): Confirm ROCm support for your GPU and kernel, expose /dev/kfd and /dev/dri, and add your user to the video group on the host.
- Out of memory: Choose a smaller model (e.g., 3B/7B variants), reduce context length in the WebUI, or add swap space. On WSL2, limit memory usage or increase it in .wslconfig.

Why this stack?

Ollama provides a simple, consistent way to run many open models locally, with one command per model and automatic quantized formats for laptops and servers. Open WebUI adds a polished interface with multi-model selection, prompt templates, knowledge features, and API compatibility for tools. Together, they give you a private, portable AI chat solution that you control end to end.

With this setup, you can iterate quickly, test new models, and keep your data on your hardware. When you need more speed, enable GPU acceleration or move the same stack to a more powerful server with minimal changes.

LifeBytes Journal

Search This Blog