Why a local RAG chatbot?
If you work in IT, you probably have internal documents that never belong in a public cloud: runbooks, SOPs, incident postmortems, customer notes, firewall rules, or server inventories. A local chatbot can answer questions from those files without uploading anything outside your network. The modern approach is RAG (Retrieval-Augmented Generation): the system searches your documents for relevant passages and then asks the language model to respond using that context.
In this tutorial you will deploy a practical, self-hosted setup on Linux using Ollama (to run local LLMs) and Open WebUI (a friendly web interface). You will end with a browser-based chat that can be extended with document ingestion features and can run fully offline.
What you will build
Ollama will run the language model on your Linux host. Open WebUI will provide the web UI and manage connections to Ollama. This combination is popular because it’s simple to update, works well with Docker, and supports a “private by default” workflow.
Prerequisites
You need a Linux server or workstation (Ubuntu/Debian/Fedora are fine). Recommended: 16 GB RAM or more, and SSD storage. A GPU is optional; CPU-only works, but responses will be slower. You also need root or sudo access and an internet connection for the initial downloads (you can later run offline).
Step 1: Install Ollama
On most Linux distributions, the quickest method is the official install script. Run the following:
Command:
curl -fsSL https://ollama.com/install.sh | sh
After installation, verify that the service is running:
ollama --version
systemctl status ollama
If your firewall is strict, note that Ollama typically listens on 127.0.0.1:11434 by default (local-only). That’s good for security. You can keep it that way when Open WebUI runs on the same machine.
Step 2: Pull a model with Ollama
Choose a model that matches your hardware. A solid general-purpose starting point is a smaller Llama-family model. Pull a model like this:
ollama pull llama3.1
Then test it quickly:
ollama run llama3.1
Type a short prompt (for example: “Summarize the purpose of RAG in one paragraph.”) and confirm you get a response. Exit with /bye.
Step 3: Install Docker (if needed)
Open WebUI is commonly deployed with Docker. If Docker is not installed, install it using your distro’s recommended method. On Ubuntu, this is typically:
sudo apt update
sudo apt install -y docker.io
sudo systemctl enable --now docker
To avoid running Docker commands with sudo, you can add your user to the docker group (log out and back in afterward):
sudo usermod -aG docker $USER
Step 4: Run Open WebUI connected to Ollama
Start Open WebUI as a container and point it to Ollama. If Ollama is running on the same host, the container can reach it using host networking (simple on Linux):
docker run -d --name open-webui --restart=unless-stopped --network=host -e OLLAMA_BASE_URL=http://127.0.0.1:11434 -v open-webui:/app/backend/data ghcr.io/open-webui/open-webui:main
Open your browser and go to:
http://localhost:8080
Create the first admin account when prompted. Once logged in, confirm that your Ollama model appears in the model list. If it doesn’t, re-check that Ollama is running and that the URL is correct.
Step 5: Basic security hardening (recommended)
If this is more than a lab setup, don’t expose port 8080 directly to the internet. Instead, put it behind a reverse proxy (Nginx/Traefik/Caddy) with HTTPS and authentication. At minimum, restrict access to your LAN via firewall rules. If multiple users will access it, create separate accounts and disable anonymous access in the UI settings.
Step 6: Add documents for RAG (practical approach)
RAG requires two pieces: (1) a place to store your documents and (2) an index/search layer that can retrieve relevant chunks. Many teams start with a controlled folder of PDFs/Markdown/TXT and progressively add ingestion and indexing tools as needs grow.
A simple, safe workflow is:
1) Put sanitized internal docs in a dedicated directory (example: /srv/knowledgebase).
2) Convert “messy” formats to text where possible (Markdown and text files work best).
3) In Open WebUI, look for knowledge or document features (often called “Knowledge,” “Documents,” or “RAG” depending on version) and import your files.
If you do not see document ingestion in your build, treat this deployment as the base LLM layer and add a dedicated RAG service later (for example, a vector database plus an ingestion pipeline). The key advantage is that you already have the model hosting and UI stable and local.
Troubleshooting common issues
Open WebUI can’t see Ollama models: Verify Ollama is running (systemctl status ollama) and confirm the base URL. If you didn’t use --network=host, use Docker’s host gateway options or run both services in the same Docker network and reference Ollama by container name.
Slow responses: Try a smaller model, close other heavy workloads, and ensure you have enough RAM. CPU-only inference is normal but slower. If you have a supported GPU, check Ollama’s documentation for acceleration support on your platform.
High disk usage: Models can be several GB each. Remove unused models with ollama rm <model> and keep only what you use.
Next steps
Once the local chatbot is working, you can improve accuracy and trust by tightening your knowledge base: keep documents current, remove duplicates, and structure key procedures in Markdown. If you expand into a full RAG stack, define clear ingestion rules and access controls so the chatbot only retrieves what each user is allowed to see.
Comments
Post a Comment