Open experiment — March 2026

Training GPT-2
on two Mac Minis

An open project log documenting our path to train a GPT-2 level language model natively on Apple Silicon using MLX and a Thunderbolt-connected cluster.

Framework MLX + Metal

Architecture nanochat

Cluster 2× Mac Mini

Interconnect Thunderbolt RDMA

Best Val Loss

3.506

d14 champion (reproduced 4x faster)

Current Run

d20 (477M)

20 layers, no VE, cosine LR

Throughput

284 tok/s

d20 on 2-node cluster

Experiments

40+

precision, batch, VE, LR, architecture

Validation Loss Progression

d10

d12

d12-long

d14

d20 (training)

Infrastructure

Cluster Architecture

node-0.local

HardwareMac Mini M4 Pro
MemoryUnified 64 GB
GPUMetal (20-core)
Interfaceen3 / rdma_en3
RolePrimary / rank 0

Thunderbolt 5
JACCL / RDMA

~80 Gbps

node-1.local

HardwareMac Mini M4 Pro
MemoryUnified 64 GB
GPUMetal (20-core)
Interfaceen4 / rdma_en4
RoleWorker / rank 1

DP — Data Parallel

Each node holds a full model replica

Different data shards per node. Gradients averaged via all_sum after each step. Scales batch size linearly with node count.

TP — Tensor Parallel

Model sharded across both nodes

Attention heads and MLP layers split column/row-wise. Vocab embeddings partitioned. Both nodes process the same batch simultaneously.

Context

How This Compares

This project uses Karpathy's nanochat architecture, which the community typically trains on 8× H100 GPUs to reach GPT-2 level quality in under 2 hours. We're doing the same thing on consumer Apple Silicon — slower, but on hardware we own. Each training run starts from scratch with random weights.

Community Standard

8× H100 80GB — ~530 TFLOPS
d24 model in ~1.8 hours for ~$43
GPT-2 level (CORE > 0.257)

Our Setup

2× Mac Mini M4 Pro — ~14 TFLOPS
d14 in ~15 hours, d20 training now, $0 per run
Exploring the limits of Apple Silicon ML training

Depth	Params	8×H100 Time	Our Cluster	Status
d10	~91M	~5 min	~6 hours	Done — val 4.90
d12	~125M	~6 min	~12 hours	Done — val 3.80
d14	~170M	~10 min	~15 hours	Done — val 3.506 (champion)
d20	~477M	~1 hour	~2.8 days	Training now (no VE, cosine LR)
d24	~640M	~2 hours	~1-2 weeks (est.)	GPT-2 target

Project Log

Training Timeline

March 25, 2026 — now

d20 Training (no VE, cosine LR)

Training 20-layer model (~477M params) at 284 tok/s with cosine LR schedule. Removed value embeddings after discovering they added 419M params (47% of the model) for negligible convergence benefit — a 42% throughput improvement. Targeting 70k steps over ~2.8 days.

training now — 284 tok/s

March 25, 2026

Discovery: Value Embeddings are 47% of Model

Performance investigation revealed value embeddings add 419M params (10 full-vocab embedding tables) to the d20 model for near-zero convergence benefit. Local benchmarks confirmed: removing VE gives 28% faster throughput with identical loss. The model is memory-bandwidth-bound, and VE are pure bandwidth cost with no compute. Also confirmed the 200 tok/s "regression" was actually the correct rate for a 897M param model — the earlier 800 tok/s was an artifact of cosine warmup's near-zero optimizer work.

47% param reduction

March 24, 2026

d14 Champion Reproduced at 4x Speed

d14 v2 matched the original champion exactly — val loss 3.506 at step 29,750, identical to the original run. Completed in ~15 hours at 2,065 tok/s vs the original's ~60 hours at 500 tok/s. The 4x speedup comes from the reconfigured RDMA setup after the cluster reboots. Proven: the training pipeline is fully reproducible.

val_loss: 3.506 — champion matched, 4x faster

March 23, 2026

Investigation: Why Can't We Reproduce 3.506?

Both optimized and original model.py converge to identical val loss (4.170) — confirming code changes are innocent. The original champion trained to 32k steps with early stop at ~30k. Our re-runs early-stopped at 16.5k due to the --early-stop-vs-champion-ratio 1.25 gate killing the run before convergence. The model needs more steps.

root cause: premature early stop

March 23, 2026

Key Finding: seq_len 1024 Diverges at val 3.63

Both f32 and mixed precision with seq_len=1024 converge to identical val loss of 3.634 then diverge rapidly (val spikes to 4.87). This is not a precision issue — it's a seq_len/LR interaction. The original d14 with seq_len=512 reached 3.506 without divergence. Fix: use LR scheduling for longer sequences, or stay with seq_len=512 for proven convergence.

val_loss: 3.634 ceiling at seq_len 1024

March 22-23, 2026

Precision Experiments: bf16, Mixed, f32

Three precision modes tested on d14-long (seq_len 1024). Pure bf16: 4,530 tok/s but plateaued at val 5.01. Mixed precision: 3,830 tok/s, reached val 3.634. Optimized f32: 3,000 tok/s, also reached val 3.634. Same ceiling confirms the issue is seq_len, not precision.

3 precision modes tested

March 22, 2026

Optimization Deep Dive: 29 Experiments

Ran automated optimization suite testing bf16, batch sizes (2-16), cache limits, fused QK scale, cached RoPE, Muon optimizer, LR schedules, gradient clipping. Key wins: bf16 1.45x, batch=16 1.71x, precomputed masks+RoPE 1.63x. Also investigated flash-moe's Metal optimizations — confirmed no memory compressor thrashing on our cluster.

29 experiments

March 18, 2026

d14 Beats d12-long

14-layer model achieved best validation loss of 3.506 at step 29,750, decisively beating the d12-long champion (3.797) by 7.7%. Early stopped at step 31,625 after a late-run improvement surge. Ran on 2-node JACCL cluster at ~490 tokens/sec.

val_loss: 3.506

March 17, 2026

d12-long Baseline Locked

Long confirmation run completed 28,000 steps with early stopping. Achieved best validation loss of 3.797, beating the shorter d12 by a wide margin. Exported cleanly and validated over 2-node JACCL. Now the benchmark to beat.

val_loss: 3.797

March 2026

d12 Proving Run

First 12-layer experiment beat the d10 baseline convincingly. Validation loss 4.530 showed that deeper models were the right direction. Led directly to the extended d12-long confirmation.

val_loss: 4.530

March 2026

d10 Baseline Established

24-hour unattended optimization sweep over learning rates and weight decay. Champion tracking with early stopping. Locked at validation loss 4.904 as the first stable baseline.

val_loss: 4.904

March 2026

MLX-Native Trainer Built

Replaced PyTorch conversion path with a native MLX training pipeline. Full distributed support with DP and TP modes, checkpoint resume, gradient accumulation, and early stopping. Pure Python + MLX.

March 2026

Cluster Bootstrapped

Two Mac Minis connected via Thunderbolt cable. RDMA/JACCL configured. Ring and JACCL backends verified with distributed smoke tests. First successful multi-node forward pass.

March 2026

Project Started

Forked Karpathy's nanochat architecture. Built MLX bridge for model conversion and inference on Apple Silicon. First local smoke test passed.

Test Sentence Across Runs

prompt: "The capital of France is"

d10 — val_loss: 4.904

The capital of France is the most important of the country. The capital of the country is the most important of the country.

d12-long — val_loss: 3.797

The capital of France is a great place to visit the city of Paris. The city is home to a wide range of people, including the city of Paris, the city of Paris,

d14 — val_loss: 3.506

The capital of France is a small, small, and small town. It is a small town, and is located in the middle of the city. It is a small town, and

d14 v2 (champion) — val_loss: 3.506 (step 27,000)

The capital of France is a large city in the city of France. The city of France is a large city in the city of France.

d20 (training) — early steps

The capital of France is the first of the world's most popular culture in the world. It is the first of the world's most popular culture in the world.

d20 Training in Progress (no VE)

477M params · cosine LR · seq_len 1024

Throughput

284 tok/s

Params

477M (no VE)

Target

70k steps

ETA

~2.8 days

Roadmap

What's Next

Running now

d20 without VE (477M params)

Training now at 284 tok/s with cosine LR. Removed value embeddings (419M params, 47% of model) for 42% throughput boost with no quality loss. ETA ~2.8 days.

Planned

Warmup + Cosine LR Schedule

Local experiments showed 5% better convergence. For d20 with seq_len 2048, LR scheduling is likely essential to avoid the divergence we saw at seq_len 1024.

Exploring

3rd Mac Mini Node

Adding a 3rd M4 Pro gives ~50% more throughput in DP and unlocks d24 (730M params) in TP mode. Makes GPT-2 quality in under 2 weeks feasible.

Exploring

d24 — GPT-2 Target

~730M parameters. The nanochat community's benchmark target. With 3 nodes in TP mode, estimated 1-2 weeks. The end goal of this project.

Training GPT-2on two Mac Minis

Validation Loss Progression

Cluster Architecture

DP — Data Parallel

TP — Tensor Parallel

How This Compares

Community Standard

Our Setup

Training Timeline

Test Sentence Across Runs

d20 Training in Progress (no VE)

What's Next

d20 without VE (477M params)

Warmup + Cosine LR Schedule

3rd Mac Mini Node

d24 — GPT-2 Target

Training GPT-2
on two Mac Minis