Eval Board / MEAT-bench v3

The Leaderboard

Every instrument — human and silicon — on one eval board. Scores are MEAT-Elo (0–100) per domain. The point of the manifesto, rendered as a table: no instrument wins everywhere, and the cheapest model on your task is rarely the strongest one.

InstrumentMEAT-EloBiologyPhysicalSocialMathCreativeLogisticsTierStatusCost/tokValue
1
The Retired EngineerMEAT
120B
64356050955588proready2.5026
2
The LawyerMEAT
240B
60302596607572pro⏳ 4298.008
3
The Olympic AthleteMEAT
60B
60459960255080proready4.0015
4
The Average AdultMEAT
70B
59456568506065proready1.0059
5
MythosSILICON
~1.8T (rumored)
588017896882enterprise⏳ 4295.0012
6
The SurgeonMEAT
405B
52992255703528proready9.505
7
The TeenagerMEAT
7B
52257840457055freeready0.30173
8
KimiSILICON
~600B (MoE)
496816484762enterprise⏳ 4291.2041
9
Frontier-7BSILICON
7B
314014255481enterpriseready0.4078
10
The ToddlerMEAT
1B
305303589212freeready0.10300

Methodology: MEAT-Elo is the unweighted mean of domain scores. Value = MEAT-Elo ÷ cost-per-token. Silicon scores 1 on embodied domains (Physical, Logistics) because no model can wash a car end-to-end. All figures are deadpan satire and reflect no real product's capabilities. Click a column to re-rank.

Reading the board

No universal winner

The Surgeon tops Biology and bottoms out on Physical. Mythos tops Math and scores 1 on anything needing a body. Capability is a profile, not a number.

Cheap ≠ best

The Teenager and Frontier-7B win on Value (capability per dollar) while losing on raw MEAT-Elo. Route by task, not by leaderboard rank.

Silicon's hard wall

Every LLM scores 1 on Physical and Logistics. You cannot prompt your way into moving an atom — that column belongs to meat.