Testing AI Like a Person: Beyond Benchmarks and Leaderboards
What This Covers Standard AI benchmarks (MMLU, HumanEval, ARC-AGI) measure capability on isolated tasks. They do not measure coherence over time, identity under pressure, epistemic honesty, or whether the system self-corrects without being prompted. Behavioral evaluation fills this gap by testing what happens when you treat an AI system like a person rather than a…