The Human Variable: Why AI Benchmarks Measure the Wrong Thing
The Same Model, Different Operator Every benchmark test for AI systems assumes the human is irrelevant. MMLU scores don’t account for who asked the question. HumanEval doesn’t measure whether the programmer giving instructions has 30 years of experience or 30 days. ARC-AGI treats the operator as a constant, a neutral interface that submits a prompt…