RESEARCH

OSGuard: A Benchmark for Safety in Computer-Use Agents

ArXiv cs.AI · Tue, 16 Jun 2026 04:00:00 GMT

arXiv:2606.15034v1 Announce Type: new Abstract: Computer-use agents are increasingly evaluated by whether they complete realistic desktop and web tasks. However, task success alone can miss failures in which an agent reaches the nominal goal through an unsafe shortcut. We introdu

Read original source Discuss with A.S.I.S