RESEARCH

RealMath-Eval: Why SOTA Judges Struggle with Real Human Reasoning

ArXiv cs.AI · Wed, 10 Jun 2026 04:00:00 GMT

arXiv:2606.10254v1 Announce Type: new Abstract: While Large Language Models (LLMs) have achieved near-perfect performance in \emph{solving} high-school mathematics, their ability to \emph{evaluate} the diverse reasoning processes of real human students remains under-examined. To

Read original source Discuss with A.S.I.S