How to Run a Private Local AI Assistant with Ollama and Open WebUI on Windows, macOS, and Linux

Overview

Running a private AI assistant on your own computer is now practical, fast, and secure. With Ollama providing an easy local model runtime and Open WebUI offering a clean chat interface, you can chat with modern large language models (LLMs) without sending data to the cloud. This tutorial shows how to install Ollama and Open WebUI on Windows, macOS, and Linux, enable GPU acceleration, manage models, expose the API, and troubleshoot common issues.

Prerequisites and Hardware

You need a 64-bit system with at least 8 GB RAM (16 GB recommended). GPU acceleration greatly improves speed: NVIDIA GPUs (Windows/Linux) via CUDA, AMD GPUs (Linux) via ROCm, and Apple Silicon (macOS) via Metal are supported. Ensure your graphics drivers are up to date before enabling GPU features.

Step 1: Install Ollama

Windows (PowerShell as Administrator): winget install Ollama.Ollama. After installation, the Ollama service starts automatically. If needed: services.msc → restart the Ollama service.

macOS (Apple Silicon or Intel): curl -fsSL https://ollama.com/install.sh | sh. The command installs and starts the Ollama service. You can verify with: ollama --version.

Linux (systemd-based): curl -fsSL https://ollama.com/install.sh | sh. Then enable and start the service: sudo systemctl enable --now ollama. Check status with systemctl status ollama.

Step 2: Pull and Run a Model

Ollama downloads models on first use. Good general-purpose choices are Llama 3.1 (8B) and Mistral. Smaller models run on CPUs and modest GPUs, while larger models need more VRAM.

Examples: ollama run llama3.1:8b or ollama run mistral. To download without starting a session: ollama pull llama3.1:8b. To list installed models: ollama list. To remove a model and free space: ollama rm llama3.1:8b.

Step 3: Install Open WebUI (Docker)

Open WebUI is a modern web interface that connects to Ollama at http://localhost:11434. The easiest way to run it is with Docker.

Windows/macOS (host.docker.internal works): docker run -d --name open-webui -p 3000:8080 -e OLLAMA_BASE_URL=http://host.docker.internal:11434 -v openwebui-data:/app/backend/data ghcr.io/open-webui/open-webui:latest

Linux (use host networking for simplicity): docker run -d --name open-webui --network=host -e OLLAMA_BASE_URL=http://127.0.0.1:11434 -v openwebui-data:/app/backend/data ghcr.io/open-webui/open-webui:latest

Open your browser to http://localhost:3000, create an admin account, and select your default model. You can set a system prompt, temperature, and context length in the settings for each model.

Step 4: Enable GPU Acceleration

Windows (NVIDIA): Install the latest NVIDIA driver and CUDA runtime. Ollama detects CUDA automatically. If you have multiple GPUs, you can control usage with OLLAMA_NUM_GPU and related variables. If you receive out-of-memory errors, switch to a smaller model (e.g., 7B/8B) or lower context length.

Linux (NVIDIA): Install the proprietary NVIDIA driver and CUDA toolkit from your distribution. Restart the Ollama service after installation: sudo systemctl restart ollama.

Linux (AMD): Install ROCm compatible with your GPU and kernel. Ollama uses ROCm when available. If ROCm is not detected, Ollama will fall back to CPU.

macOS (Apple Silicon): Ollama uses Metal by default. You do not need to install extra drivers.

Step 5: Use the Local API (Optional)

Ollama exposes a simple HTTP API at http://localhost:11434. Common endpoints include /api/generate (single-turn) and /api/chat (multi-turn). If you want to access Ollama from other devices on your LAN, set OLLAMA_HOST=0.0.0.0:11434 before starting the service, and open the firewall port cautiously. For example on Linux: sudo systemctl edit ollama and add the environment variable, then sudo systemctl daemon-reload && sudo systemctl restart ollama.

Step 6: Model Tips and Performance

Choose models that match your hardware and tasks. For laptops or CPUs, use 3–8B models for snappy responses. For workstations with 12–24 GB VRAM, try 13B and above. Use quantized variants (the default in Ollama) to reduce memory and disk usage. In Open WebUI, you can set a higher context length for coding and chat history, but that uses more RAM/VRAM.

Updating and Maintenance

Update Ollama: Windows: winget upgrade Ollama.Ollama. macOS/Linux: rerun the install script or use your package manager if you installed via Homebrew or a repo. Restart the service after updating.

Update models: ollama pull llama3.1:8b fetches newer revisions. You can pin tags (e.g., :8b) to stay consistent across machines.

Move model storage: By default models are stored under ~/.ollama. To store models on another drive, set OLLAMA_MODELS to a new path and restart the service, then re-pull needed models.

Troubleshooting

Port conflict on 11434: Stop the conflicting service or change the Ollama port with OLLAMA_HOST=127.0.0.1:11500 and restart. Update OLLAMA_BASE_URL in Open WebUI to match.

Disk space issues: Large models take multiple gigabytes. Remove unused models with ollama rm <model>, and periodically check ~/.ollama.

GPU out-of-memory: Switch to a smaller model, lower context length, or disable image features if enabled. Ensure no other GPU-heavy apps are running.

Docker cannot reach Ollama: On Linux, prefer --network=host, or add --add-host=host.docker.internal:host-gateway and use http://host.docker.internal:11434 for OLLAMA_BASE_URL.

Security and Best Practices

Keep Ollama bound to localhost unless you truly need remote access. If exposing to the network, place it behind a reverse proxy with TLS and authentication. Regularly update Ollama and Open WebUI, test new models in a separate profile, and back up your Open WebUI data volume if you rely on saved chats or prompts.

You Are Ready

With Ollama running locally and Open WebUI providing a friendly interface, you have a fast, private AI assistant for writing, coding, note-taking, and research. Start small with an 8B model, tune your settings, and upgrade models as your hardware allows. Most tasks will feel instant on a modest GPU, and everything stays on your machine.

LifeBytes Journal

Search This Blog