March 25, 2026 — now
d20 Training (no VE, cosine LR)
Training 20-layer model (~477M params) at 284 tok/s
with cosine LR schedule. Removed value embeddings after discovering they added 419M params
(47% of the model) for negligible convergence benefit — a 42% throughput improvement.
Targeting 70k steps over ~2.8 days.
training now — 284 tok/s
March 25, 2026
Discovery: Value Embeddings are 47% of Model
Performance investigation revealed value embeddings add 419M params
(10 full-vocab embedding tables) to the d20 model for near-zero convergence benefit.
Local benchmarks confirmed: removing VE gives 28% faster throughput with identical loss.
The model is memory-bandwidth-bound, and VE are pure bandwidth cost with no compute.
Also confirmed the 200 tok/s "regression" was actually the correct rate for a 897M param model
— the earlier 800 tok/s was an artifact of cosine warmup's near-zero optimizer work.
47% param reduction
March 24, 2026
d14 Champion Reproduced at 4x Speed
d14 v2 matched the original champion exactly — val loss
3.506 at step 29,750, identical to the
original run. Completed in ~15 hours at 2,065 tok/s vs the original's ~60 hours
at 500 tok/s. The 4x speedup comes from the reconfigured RDMA setup after
the cluster reboots. Proven: the training pipeline is fully reproducible.
val_loss: 3.506 — champion matched, 4x faster
March 23, 2026
Investigation: Why Can't We Reproduce 3.506?
Both optimized and original model.py converge to identical val loss (4.170) —
confirming code changes are innocent. The original champion trained to 32k steps
with early stop at ~30k. Our re-runs early-stopped at 16.5k due to the
--early-stop-vs-champion-ratio 1.25 gate killing the run before
convergence. The model needs more steps.
root cause: premature early stop
March 23, 2026
Key Finding: seq_len 1024 Diverges at val 3.63
Both f32 and mixed precision with seq_len=1024 converge to identical val loss of
3.634 then diverge rapidly (val spikes to 4.87).
This is not a precision issue — it's a seq_len/LR interaction. The original d14
with seq_len=512 reached 3.506 without divergence. Fix: use LR scheduling for longer
sequences, or stay with seq_len=512 for proven convergence.
val_loss: 3.634 ceiling at seq_len 1024
March 22-23, 2026
Precision Experiments: bf16, Mixed, f32
Three precision modes tested on d14-long (seq_len 1024).
Pure bf16: 4,530 tok/s but plateaued at val 5.01.
Mixed precision: 3,830 tok/s, reached val 3.634.
Optimized f32: 3,000 tok/s, also reached val 3.634.
Same ceiling confirms the issue is seq_len, not precision.
3 precision modes tested
March 22, 2026
Optimization Deep Dive: 29 Experiments
Ran automated optimization suite testing bf16, batch sizes (2-16), cache limits,
fused QK scale, cached RoPE, Muon optimizer, LR schedules, gradient clipping.
Key wins: bf16 1.45x, batch=16
1.71x, precomputed masks+RoPE
1.63x. Also investigated flash-moe's Metal
optimizations — confirmed no memory compressor thrashing on our cluster.
29 experiments
March 18, 2026
d14 Beats d12-long
14-layer model achieved best validation loss of 3.506
at step 29,750, decisively beating the d12-long champion (3.797) by 7.7%.
Early stopped at step 31,625 after a late-run improvement surge.
Ran on 2-node JACCL cluster at ~490 tokens/sec.
val_loss: 3.506
March 17, 2026
d12-long Baseline Locked
Long confirmation run completed 28,000 steps with early stopping.
Achieved best validation loss of 3.797,
beating the shorter d12 by a wide margin. Exported cleanly and validated
over 2-node JACCL. Now the benchmark to beat.
val_loss: 3.797
March 2026
d12 Proving Run
First 12-layer experiment beat the d10 baseline convincingly.
Validation loss 4.530 showed
that deeper models were the right direction. Led directly to the
extended d12-long confirmation.
val_loss: 4.530
March 2026
d10 Baseline Established
24-hour unattended optimization sweep over learning rates and weight decay.
Champion tracking with early stopping. Locked at validation loss
4.904 as the first stable baseline.
val_loss: 4.904
March 2026
MLX-Native Trainer Built
Replaced PyTorch conversion path with a native MLX training pipeline.
Full distributed support with DP and TP modes, checkpoint resume,
gradient accumulation, and early stopping. Pure Python + MLX.
March 2026
Cluster Bootstrapped
Two Mac Minis connected via Thunderbolt cable. RDMA/JACCL configured.
Ring and JACCL backends verified with distributed smoke tests.
First successful multi-node forward pass.
March 2026
Project Started
Forked Karpathy's nanochat architecture. Built MLX bridge for model
conversion and inference on Apple Silicon. First local smoke test passed.
d10 — val_loss: 4.904
The capital of France is the most important of the country. The capital of the country is the most important of the country.
d12-long — val_loss: 3.797
The capital of France is a great place to visit the city of Paris. The city is home to a wide range of people, including the city of Paris, the city of Paris,
d14 — val_loss: 3.506
The capital of France is a small, small, and small town. It is a small town, and is located in the middle of the city. It is a small town, and
d14 v2 (champion) — val_loss: 3.506 (step 27,000)
The capital of France is a large city in the city of France. The city of France is a large city in the city of France.
d20 (training) — early steps
The capital of France is the first of the world's most popular culture in the world. It is the first of the world's most popular culture in the world.