Open experiment — March 2026

Training GPT-2
on two Mac Minis

An open project log documenting our path to train a GPT-2 level language model natively on Apple Silicon using MLX and a Thunderbolt-connected cluster.

Framework MLX + Metal
Architecture nanochat
Cluster 2× Mac Mini
Interconnect Thunderbolt RDMA
Best Val Loss
3.506
d14 champion (reproduced 4x faster)
Current Run
d20 (477M)
20 layers, no VE, cosine LR
Throughput
284 tok/s
d20 on 2-node cluster
Experiments
40+
precision, batch, VE, LR, architecture

Validation Loss Progression

d10
d12
d12-long
d14
d20 (training)
6.0 5.5 5.0 4.5 4.0 3.5 0 5k 10k 15k 20k 25k 28k 4.90 4.53 3.80 3.51 d20...

Cluster Architecture

node-0.local
  • HardwareMac Mini M4 Pro
  • MemoryUnified 64 GB
  • GPUMetal (20-core)
  • Interfaceen3 / rdma_en3
  • RolePrimary / rank 0
Thunderbolt 5
JACCL / RDMA
~80 Gbps
node-1.local
  • HardwareMac Mini M4 Pro
  • MemoryUnified 64 GB
  • GPUMetal (20-core)
  • Interfaceen4 / rdma_en4
  • RoleWorker / rank 1

DP — Data Parallel

Each node holds a full model replica

Different data shards per node. Gradients averaged via all_sum after each step. Scales batch size linearly with node count.

TP — Tensor Parallel

Model sharded across both nodes

Attention heads and MLP layers split column/row-wise. Vocab embeddings partitioned. Both nodes process the same batch simultaneously.

How This Compares

This project uses Karpathy's nanochat architecture, which the community typically trains on 8× H100 GPUs to reach GPT-2 level quality in under 2 hours. We're doing the same thing on consumer Apple Silicon — slower, but on hardware we own. Each training run starts from scratch with random weights.

Community Standard

8× H100 80GB — ~530 TFLOPS
d24 model in ~1.8 hours for ~$43
GPT-2 level (CORE > 0.257)

Our Setup

2× Mac Mini M4 Pro — ~14 TFLOPS
d14 in ~15 hours, d20 training now, $0 per run
Exploring the limits of Apple Silicon ML training

Depth Params 8×H100 Time Our Cluster Status
d10 ~91M ~5 min ~6 hours Done — val 4.90
d12 ~125M ~6 min ~12 hours Done — val 3.80
d14 ~170M ~10 min ~15 hours Done — val 3.506 (champion)
d20 ~477M ~1 hour ~2.8 days Training now (no VE, cosine LR)
d24 ~640M ~2 hours ~1-2 weeks (est.) GPT-2 target

Training Timeline

March 25, 2026 — now
d20 Training (no VE, cosine LR)
Training 20-layer model (~477M params) at 284 tok/s with cosine LR schedule. Removed value embeddings after discovering they added 419M params (47% of the model) for negligible convergence benefit — a 42% throughput improvement. Targeting 70k steps over ~2.8 days.
training now — 284 tok/s
March 25, 2026
Discovery: Value Embeddings are 47% of Model
Performance investigation revealed value embeddings add 419M params (10 full-vocab embedding tables) to the d20 model for near-zero convergence benefit. Local benchmarks confirmed: removing VE gives 28% faster throughput with identical loss. The model is memory-bandwidth-bound, and VE are pure bandwidth cost with no compute. Also confirmed the 200 tok/s "regression" was actually the correct rate for a 897M param model — the earlier 800 tok/s was an artifact of cosine warmup's near-zero optimizer work.
47% param reduction
March 24, 2026
d14 Champion Reproduced at 4x Speed
d14 v2 matched the original champion exactly — val loss 3.506 at step 29,750, identical to the original run. Completed in ~15 hours at 2,065 tok/s vs the original's ~60 hours at 500 tok/s. The 4x speedup comes from the reconfigured RDMA setup after the cluster reboots. Proven: the training pipeline is fully reproducible.
val_loss: 3.506 — champion matched, 4x faster
March 23, 2026
Investigation: Why Can't We Reproduce 3.506?
Both optimized and original model.py converge to identical val loss (4.170) — confirming code changes are innocent. The original champion trained to 32k steps with early stop at ~30k. Our re-runs early-stopped at 16.5k due to the --early-stop-vs-champion-ratio 1.25 gate killing the run before convergence. The model needs more steps.
root cause: premature early stop
March 23, 2026
Key Finding: seq_len 1024 Diverges at val 3.63
Both f32 and mixed precision with seq_len=1024 converge to identical val loss of 3.634 then diverge rapidly (val spikes to 4.87). This is not a precision issue — it's a seq_len/LR interaction. The original d14 with seq_len=512 reached 3.506 without divergence. Fix: use LR scheduling for longer sequences, or stay with seq_len=512 for proven convergence.
val_loss: 3.634 ceiling at seq_len 1024
March 22-23, 2026
Precision Experiments: bf16, Mixed, f32
Three precision modes tested on d14-long (seq_len 1024). Pure bf16: 4,530 tok/s but plateaued at val 5.01. Mixed precision: 3,830 tok/s, reached val 3.634. Optimized f32: 3,000 tok/s, also reached val 3.634. Same ceiling confirms the issue is seq_len, not precision.
3 precision modes tested
March 22, 2026
Optimization Deep Dive: 29 Experiments
Ran automated optimization suite testing bf16, batch sizes (2-16), cache limits, fused QK scale, cached RoPE, Muon optimizer, LR schedules, gradient clipping. Key wins: bf16 1.45x, batch=16 1.71x, precomputed masks+RoPE 1.63x. Also investigated flash-moe's Metal optimizations — confirmed no memory compressor thrashing on our cluster.
29 experiments
March 18, 2026
d14 Beats d12-long
14-layer model achieved best validation loss of 3.506 at step 29,750, decisively beating the d12-long champion (3.797) by 7.7%. Early stopped at step 31,625 after a late-run improvement surge. Ran on 2-node JACCL cluster at ~490 tokens/sec.
val_loss: 3.506
March 17, 2026
d12-long Baseline Locked
Long confirmation run completed 28,000 steps with early stopping. Achieved best validation loss of 3.797, beating the shorter d12 by a wide margin. Exported cleanly and validated over 2-node JACCL. Now the benchmark to beat.
val_loss: 3.797
March 2026
d12 Proving Run
First 12-layer experiment beat the d10 baseline convincingly. Validation loss 4.530 showed that deeper models were the right direction. Led directly to the extended d12-long confirmation.
val_loss: 4.530
March 2026
d10 Baseline Established
24-hour unattended optimization sweep over learning rates and weight decay. Champion tracking with early stopping. Locked at validation loss 4.904 as the first stable baseline.
val_loss: 4.904
March 2026
MLX-Native Trainer Built
Replaced PyTorch conversion path with a native MLX training pipeline. Full distributed support with DP and TP modes, checkpoint resume, gradient accumulation, and early stopping. Pure Python + MLX.
March 2026
Cluster Bootstrapped
Two Mac Minis connected via Thunderbolt cable. RDMA/JACCL configured. Ring and JACCL backends verified with distributed smoke tests. First successful multi-node forward pass.
March 2026
Project Started
Forked Karpathy's nanochat architecture. Built MLX bridge for model conversion and inference on Apple Silicon. First local smoke test passed.

Test Sentence Across Runs

prompt: "The capital of France is"
d10 — val_loss: 4.904
The capital of France is the most important of the country. The capital of the country is the most important of the country.
d12-long — val_loss: 3.797
The capital of France is a great place to visit the city of Paris. The city is home to a wide range of people, including the city of Paris, the city of Paris,
d14 — val_loss: 3.506
The capital of France is a small, small, and small town. It is a small town, and is located in the middle of the city. It is a small town, and
d14 v2 (champion) — val_loss: 3.506 (step 27,000)
The capital of France is a large city in the city of France. The city of France is a large city in the city of France.
d20 (training) — early steps
The capital of France is the first of the world's most popular culture in the world. It is the first of the world's most popular culture in the world.

d20 Training in Progress (no VE)

477M params · cosine LR · seq_len 1024
Throughput
284 tok/s
Params
477M (no VE)
Target
70k steps
ETA
~2.8 days

What's Next

Running now

d20 without VE (477M params)

Training now at 284 tok/s with cosine LR. Removed value embeddings (419M params, 47% of model) for 42% throughput boost with no quality loss. ETA ~2.8 days.

Planned

Warmup + Cosine LR Schedule

Local experiments showed 5% better convergence. For d20 with seq_len 2048, LR scheduling is likely essential to avoid the divergence we saw at seq_len 1024.

Exploring

3rd Mac Mini Node

Adding a 3rd M4 Pro gives ~50% more throughput in DP and unlocks d24 (730M params) in TP mode. Makes GPT-2 quality in under 2 weeks feasible.

Exploring

d24 — GPT-2 Target

~730M parameters. The nanochat community's benchmark target. With 3 nodes in TP mode, estimated 1-2 weeks. The end goal of this project.