Deploy a Local AI Assistant with Ollama on Ubuntu (GPU Optional) and Connect It to a Simple Chat API

Why run a local AI assistant?

Cloud AI tools are convenient, but they are not always the best fit for technical teams. Running an AI model locally can help when you need to keep data on-premises, reduce recurring API costs, work offline, or experiment with custom prompts without sending logs to third-party services. In this tutorial, you will set up Ollama on Ubuntu and run a modern large language model locally. You will also expose a small HTTP endpoint so your helpdesk tools, scripts, or internal apps can query the model in a controlled way.

Prerequisites

You will need an Ubuntu machine (22.04 or newer is ideal), at least 8 GB RAM (16 GB recommended), and enough disk space for models (many are 4–10 GB each). CPU-only is fine for testing; a supported NVIDIA GPU will improve speed significantly. You should also have sudo access and basic terminal familiarity.

Step 1: Update Ubuntu and install basic tools

Start by updating packages and installing common utilities. This keeps the system clean and avoids dependency issues during installation.

Run:

sudo apt update && sudo apt -y upgrade
sudo apt -y install curl ca-certificates jq

Step 2: Install Ollama

Ollama provides a simple way to download and run models locally with a consistent CLI and a built-in service. Install it using the official script.

Run:

curl -fsSL https://ollama.com/install.sh | sudo sh

After installation, confirm that the service is available and the CLI responds.

ollama --version

Step 3 (Optional): Enable NVIDIA GPU acceleration

If you have an NVIDIA GPU, install the recommended driver and verify it is working. GPU support can drastically reduce response time for larger prompts.

Run:

sudo ubuntu-drivers autoinstall
reboot

After reboot:

nvidia-smi

If you see GPU details, you are ready. If not, confirm Secure Boot settings and driver installation. Ollama typically detects available acceleration automatically.

Step 4: Download and run a model

Now pull a model. A common starting point is a lightweight instruct model that performs well on general tasks. Example:

ollama pull llama3.2

Run an interactive session:

ollama run llama3.2

Ask something practical, such as: “Draft a troubleshooting checklist for DNS resolution issues on Ubuntu.” If you get coherent output, the base setup is complete.

Step 5: Use the local HTTP API

Ollama includes an API endpoint on the local machine. This makes it easy to integrate with scripts, ticketing systems, or internal tools. From the same server, test a generation request.

Run:

curl http://localhost:11434/api/generate -d '{
"model": "llama3.2",
"prompt": "Write a 6-step incident response note for a failed backup job.",
"stream": false
}' | jq -r '.response'

If you see a clear response, your local model is ready for automation.

Step 6: Create a simple “chat gateway” service (safe internal use)

For internal teams, it is often helpful to provide a tiny wrapper service that your tools can call without exposing the full Ollama interface. The example below uses Python and forwards requests to Ollama while keeping the endpoint minimal.

Install Python tools:

sudo apt -y install python3 python3-venv

Create a small app:

mkdir -p ~/ollama-gateway && cd ~/ollama-gateway
python3 -m venv .venv
. .venv/bin/activate
pip install flask requests

Create app.py with the following content:

from flask import Flask, request, jsonify
import requests

app = Flask(__name__)

@app.post("/chat")
def chat():
data = request.get_json(force=True)
prompt = data.get("prompt", "").strip()
if not prompt:
return jsonify({"error": "Missing prompt"}), 400

payload = {
"model": data.get("model", "llama3.2"),
"prompt": prompt,
"stream": False
}
r = requests.post("http://127.0.0.1:11434/api/generate", json=payload, timeout=120)
r.raise_for_status()
return jsonify({"response": r.json().get("response", "")})

if __name__ == "__main__":
app.run(host="127.0.0.1", port=8088)

Run it:

python app.py

Test in another terminal:

curl -s http://127.0.0.1:8088/chat -H "Content-Type: application/json" -d '{"prompt":"Summarize the key steps to fix a full disk on Linux."}' | jq

Step 7: Secure and operationalize the setup

Keep the API bound to 127.0.0.1 unless you have a clear need to expose it. If you must allow LAN access, put it behind a reverse proxy with authentication and strict firewall rules. Also consider separating prompts from sensitive data: even local tools can leak secrets via logs or copied outputs. For reliability, you can convert the gateway into a systemd service later, but even a basic local-only endpoint is useful for automation and experimentation.

Common troubleshooting tips

Slow responses: use a smaller model, reduce prompt length, or enable GPU acceleration. Out of memory: close other apps, add swap (temporary fix), or choose a smaller model variant. Port issues: confirm Ollama listens on 11434 locally and that your gateway points to 127.0.0.1. Model not found: run ollama list and verify the model name matches.

What you can build next

With a local model running, you can create internal knowledge assistants for helpdesk teams, draft incident updates, summarize logs, or generate standard operating procedures. The biggest win is control: you decide where prompts go, how access works, and what gets stored. Once this is stable, try adding model-specific system prompts for consistent tone, or integrate the gateway into a chatbot UI used by your team.

LifeBytes Journal

Search This Blog