Run the model where it makes sense. Fast and resident on the laptop, big and shared in the basement, and — soon — sharded across two Macs overnight.“Local LLM” here means two distinct stacks that answer to the same OpenAI-shaped API, plus the strategy that decides which one runs what. Neither is an agent. Both are just the model plus a serving stack — the agent layer (Claude Code, Gemini, the routines) lives above them and calls in over HTTP.
- The Apple Silicon stack —
vllm-mlxworkers behindllama-swapon the M4 Max, one resident model, tuned to coexist with a working desktop. This is the workstation’s own private model for delegated edits, drafts, and “don’t burn cloud tokens on this” tasks. - The homelab GPU stack — a bigger model on a dedicated GPU, always on, reachable from the whole LAN. See Homelab GPU.
Where it’s heading
A second 128 GB Mac arrives next. It does not turn two machines into one 256 GB pool — Apple Silicon unified memory can’t be merged across a cable. What it does unlock is combined capacity by sharding: a model too big for one 128 GB box, run across both over a fast link, unattended, by morning. That’s a capacity win, not a speed win — covered honestly in Distributed & multi-Mac.The principles that hold across both stacks
- One resident model, not a rotation. Swapping a multi-GB model evicts wired
GPU memory and reloads another — the slowest thing you can do. The workstation
holds a single resident model behind capability-role aliases
(
default,coding,quickest,tool-calling, …) that all resolve to it. - The registry is the source of truth. Which physical model is resident
lives in one place — the AI-stack registry that
nix-aiwrites at activation, read as~/.config/ai-stack/registry.json. Docs describe the strategy; the current id is a registry value, never hard-coded here. See Models & quantization. - Measure, don’t claim. No tuning change ships on a marketing number — only a measured one. Throughput and quality come from the public benchmark dataset, not vibes.
- MoE for throughput. A sparse mixture-of-experts model with a few billion active parameters decodes far faster than a dense model of the same total size — the lever that makes a big model usable on a laptop.
- Capacity, not speed, across machines. Sharding one model over two Macs is communication-bound; reserve it for models that don’t fit, and run two independent workers for everything that does.
In this section
Apple Silicon stack
The M4 Max
vllm-mlx + llama-swap stack and every non-secret tuning knob — and why each one is set the way it is.Models & quantization
One-resident posture, MoE vs dense, OptiQ / DWQ / mxfp4, and the fast-vs-overnight model tiers.
Backends & tool calling
Why
vllm-mlx, how it compares to Ollama / llama.cpp / mlx-lm / Rapid-MLX, and the tool-calling reliability problem.Distributed & multi-Mac
The honest two-Mac story: combined capacity via sharding, two-workers-vs-shard, and what’s measurement-gated.
Homelab GPU
The always-on, LAN-shared model on a dedicated GPU — a different machine, a bigger model.
Benchmarking
The reproducible harness and public dataset that every tuning decision is measured against.
How it connects
nix-ai
Packages the inference stack — the
vllm-mlx LaunchAgent, llama-swap, the MLX modules, and the AI-stack registry.AI development pipeline
Where local models sit in the bigger picture — routed alongside Claude, Gemini, and Copilot by task class.
Local AI isolation
Why a local model and the agents calling it still can’t read protected secrets.
Operational reference (private)
Host-specific values, real topology, and incident history live in the gated companion docs.