RESEARCH

DeFAb: A Verifiable Benchmark for Defeasible Abduction in Foundation Models

ArXiv cs.AI · Thu, 18 Jun 2026 04:00:00 GMT

arXiv:2606.18557v1 Announce Type: new Abstract: A rule-based logic solver resolves every instance in our benchmark in under 50 microseconds with 100% accuracy; the best frontier language model reaches 65% at best and drops to 23.5% under rendering-robust evaluation (worst case ov

Read original source Discuss with A.S.I.S