Self-hosted ChatGPT - Jacob P Evans

The homelab runs its own ChatGPT: a 14-billion-parameter model on a dedicated GPU, reachable from anywhere on the LAN. No tokens, no metering, no prompt ever leaving the network. “Hermes” is not an agent or a daemon — it is the model plus the serving stack. The model is NousResearch’s Hermes-4-14B; the stack is Ollama doing GPU inference, Open WebUI for chat, and a separate CPU Ollama + Qdrant for retrieval. There are two Ollama instances, not one.

How you reach it

Teal is your machine, ink is the DNS + reverse-proxy edge, coral is the LLM stack. Both DNS names — the chat UI and the raw API — terminate at the same GPU Ollama. Every name resolves through Technitium and is fronted by Traefik with a wildcard certificate, so it is HTTPS end to end.

What’s in the stack

Container	Does	Reached at
`hermes-infer`	Ollama on the RX 6800 (ROCm), serves `hermes4`	`https://ollama.<domain>`
`hermes-chat`	Open WebUI chat front-end	`https://llm.<domain>`
`llamaindex`	CPU Ollama (`nomic-embed-text`) for embeddings	internal (RAG)
`qdrant`	Vector store for retrieval	`https://qdrant.<domain>`

All four sit on the ai VLAN. hermes-infer is a privileged LXC with the GPU passed through (/dev/kfd + /dev/dri); the model lives on a 120 GB volume. The LXCs and the Traefik ingress are provisioned by tofu-proxmox; Ollama, ROCm, the model pull, and Open WebUI are configured by ansible-proxmox-apps.

This is not the same “local AI” as the Apple Silicon stack. That one is the MLX server on this MacBook (also port 11434), tuned to hold one resident model. This page is the homelab GPU stack — a different machine, a bigger model, always on, shared across the LAN.

Use it from your Mac

Everything below is reachable by DNS name over HTTPS. Replace example.net with your homelab’s internal domain.

1 · Browser

Open https://llm.example.net, sign in, pick hermes4, and chat. This is the full Open WebUI — conversation history, system prompts, file uploads.

2 · `ollama` CLI

Point the CLI at the remote GPU instead of running a model locally:

brew install ollama                       # CLI only — no local model needed
export OLLAMA_HOST=https://ollama.example.net
ollama list                               # -> hermes4
ollama run hermes4 "Explain ZFS ARC in two sentences."

3 · OpenAI-compatible API

Ollama speaks the OpenAI API, so any OpenAI client works — just change the base URL. Drop this into editors, scripts, and SDKs:

curl https://ollama.example.net/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"hermes4","messages":[{"role":"user","content":"hello"}]}'

from openai import OpenAI

client = OpenAI(base_url="https://ollama.example.net/v1", api_key="ollama")  # key ignored
resp = client.chat.completions.create(
    model="hermes4",
    messages=[{"role": "user", "content": "hello"}],
)
print(resp.choices[0].message.content)

The raw ollama. endpoint is unauthenticated — it is LAN-internal, the same posture as the other homelab dashboards. If you want an authenticated path, Open WebUI also exposes https://llm.example.net/v1: generate an API key under Settings → Account and send it as a bearer token.

tofu-proxmox

Provisions the LXCs, GPU passthrough, and the Traefik ingress entries.

ansible-proxmox-apps

Installs Ollama + ROCm, pulls Hermes-4, configures Open WebUI.

LXC vs Docker

Why the inference stack runs as native LXC, not Docker.

AI development pipeline

The other “local AI” — MLX on the workstation, via Bifrost.

Distributed & multi-Mac Configuration

​How you reach it

​What’s in the stack

​Use it from your Mac

​1 · Browser

​2 · ollama CLI

​3 · OpenAI-compatible API

​Related

tofu-proxmox

ansible-proxmox-apps

LXC vs Docker

AI development pipeline

How you reach it

What’s in the stack

Use it from your Mac

1 · Browser

2 · `ollama` CLI

3 · OpenAI-compatible API

Related