Local LLM benchmarking - Jacob P Evans

One envelope schema, every upstream eval tool, one public HF dataset.

mlx-benchmarks is the result-envelope contract and publisher for benchmarking MLX-quantized and locally-hosted LLMs on Apple Silicon. It is the thin glue between upstream evaluation tools (lm-eval, vllm benchmark_serving, agent-framework harnesses) and a single public HuggingFace dataset, with a Gradio viewer on top.

What it does

Defines envelope v1 in schema.json — the authoritative, versioned contract every published shard validates against.
Provides mlx-bench-publish, a CLI that converts raw tool output into the envelope, validates it, and uploads to the HF dataset with content-addressed filenames (data/run-<timestamp>-<git_sha>-<suite>-<model_slug>.parquet).
Owns converters for lm-eval, vllm benchmark_serving, and framework-eval (OpenAI / Qwen-Agent / smolagents / ADK).
Auto-detects runtime metadata (OS, chip, memory, Python, MLX, lm-eval versions) via detect_system() so envelopes are fully reproducible without hand-curation.
Deploys a Gradio viewer to HF Spaces on every main push touching space/.

How it fits

Feeds into	Consumes
HF dataset, HF Space viewer	`nix-ai` (vllm-mlx, llama-swap), `lm-eval`, `vllm`, agent-framework SDKs

Getting started

Bring up the inference stack

From the nix-darwin flake: darwin-rebuild switch --flake .. This starts vllm-mlx + llama-swap on localhost:11434 via nix-ai. Or run vllm-mlx serve directly if you’re not on the Nix stack.

Install and authenticate

git clone https://github.com/JacobPEvans/mlx-benchmarks && cd mlx-benchmarks && uv sync. Then export HF_TOKEN=... with write scope on the dataset namespace.

Run a smoke benchmark

Point lm-eval at the local endpoint:

BASE="http://localhost:11434/v1/chat/completions"
.venv/bin/lm_eval --model local-chat-completions \
  --model_args "base_url=$BASE,model=mlx-community/Qwen3.5-9B-MLX-4bit" \
  --tasks gsm8k_cot_zeroshot --limit 10 \
  --output_path ./run-output

Publish (dry-run first)

.venv/bin/mlx-bench-publish ./run-output/<model-dir>/results_*.json --kind lm-eval --suite reasoning --dry-run validates the envelope locally against schema.json. Drop --dry-run to push to the HF dataset.

View results

Open the HF Space viewer — it auto-loads every published shard. Or cd space && python app.py for a local copy.

nix-ai

Packages the inference stack: vllm-mlx LaunchAgent, llama-swap, MLX module derivations. Where models actually run.

nix-darwin

macOS host config. Composes nix-ai into the system flake so benchmarks have a reproducible environment.

ai-assistant-instructions

Model routing + permission policy. Tells AI clients which models to benchmark.

Local LLM

The serving stack, tuning, and model strategy these benchmarks measure.

Source on GitHub

Schema, publisher, converters, full README, docs/architecture.md.

Tools Docs automation

​What it does

​How it fits

​Getting started

​Related repos