RESEARCH

Do Androids Dream of Breaking the Game? Systematically Auditing AI Agent Benchmarks with BenchJack

ArXiv cs.AI · Thu, 14 May 2026 04:00:00 GMT

arXiv:2605.12673v1 Announce Type: new Abstract: Agent benchmarks have become the de facto measure of frontier AI competence, guiding model selection, investment, and deployment. However, reward hacking, where agents maximize a score without performing the intended task, emerges s

Read original source Discuss with A.S.I.S