RESEARCH

Refusal Lives Downstream of Persona in Chat Models

ArXiv cs.AI · Fri, 26 Jun 2026 04:00:00 GMT

arXiv:2606.26161v1 Announce Type: new Abstract: Linear directions in activation space have been identified for both refusal and persona traits in instruction-tuned chat models, but the two have been studied as separate mechanisms. We show they interact: a compliant persona gates

Read original source Discuss with SiMON