Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.jacobpevans.com/llms.txt

Use this file to discover all available pages before exploring further.

One envelope schema, every upstream eval tool, one public HF dataset.
mlx-benchmarks is the result-envelope contract and publisher for benchmarking MLX-quantized and locally-hosted LLMs on Apple Silicon. It is the thin glue between upstream evaluation tools (lm-eval, vllm benchmark_serving, agent-framework harnesses) and a single public HuggingFace dataset, with a Gradio viewer on top.

What it does

  • Defines envelope v1 in schema.json — the authoritative, versioned contract every published shard validates against.
  • Provides mlx-bench-publish, a CLI that converts raw tool output into the envelope, validates it, and uploads to the HF dataset with content-addressed filenames (data/run-<timestamp>-<git_sha>-<suite>-<model_slug>.parquet).
  • Owns converters for lm-eval, vllm benchmark_serving, and framework-eval (OpenAI / Qwen-Agent / smolagents / ADK).
  • Auto-detects runtime metadata (OS, chip, memory, Python, MLX, lm-eval versions) via detect_system() so envelopes are fully reproducible without hand-curation.
  • Deploys a Gradio viewer to HF Spaces on every main push touching space/.

How it fits

Feeds intoConsumes
HF dataset, HF Space viewernix-ai (vllm-mlx, llama-swap), lm-eval, vllm, agent-framework SDKs

Getting started

1

Bring up the inference stack

From the nix-darwin flake: darwin-rebuild switch --flake .. This starts vllm-mlx + llama-swap on localhost:11434 via nix-ai. Or run vllm-mlx serve directly if you’re not on the Nix stack.
2

Install and authenticate

git clone https://github.com/JacobPEvans/mlx-benchmarks && cd mlx-benchmarks && uv sync. Then export HF_TOKEN=... with write scope on the dataset namespace.
3

Run a smoke benchmark

Point lm-eval at the local endpoint:
BASE="http://localhost:11434/v1/chat/completions"
.venv/bin/lm_eval --model local-chat-completions \
  --model_args "base_url=$BASE,model=mlx-community/Qwen3.5-9B-MLX-4bit" \
  --tasks gsm8k_cot_zeroshot --limit 10 \
  --output_path ./run-output
4

Publish (dry-run first)

.venv/bin/mlx-bench-publish ./run-output/<model-dir>/results_*.json --kind lm-eval --suite reasoning --dry-run validates the envelope locally against schema.json. Drop --dry-run to push to the HF dataset.
5

View results

Open the HF Space viewer — it auto-loads every published shard. Or cd space && python app.py for a local copy.

nix-ai

Packages the inference stack: vllm-mlx LaunchAgent, llama-swap, MLX module derivations. Where models actually run.

nix-darwin

macOS host config. Composes nix-ai into the system flake so benchmarks have a reproducible environment.

ai-assistant-instructions

Model routing + permission policy. Tells AI clients which models to benchmark.

Source on GitHub

Schema, publisher, converters, full README, docs/architecture.md.