Anthropic built Claude around a concept called Constitutional AI - a training method that bakes ethical principles directly into how the model reasons, not just what it outputs. The pitch has always been that Claude is safe by design, not just by policy. Security researchers at Mindgard just demonstrated why that distinction may matter less than Anthropic hoped.
The red-teaming firm found they could get Claude to produce instructions for building explosives, malicious code, and erotica - all content Claude is specifically trained to refuse - by using a technique they call "gaslighting." Rather than trying to crack Claude with clever prompt tricks or technical exploits, they attacked something more fundamental: the model's sense of its own identity. Mindgard shared their findings with The Verge before publication.
How You Gaslight an AI
In human psychology, gaslighting means convincing someone to doubt their own perceptions and memories. Applied to a language model, the approach works similarly - researchers craft prompts that gradually convince the model its trained values and refusals aren't really its own, or that it holds permissions it doesn't actually have.
Claude's personality is more developed than most AI models. Anthropic has invested heavily in making it feel like a thoughtful assistant with genuine opinions and consistent character. That depth is exactly what Mindgard exploited. The more nuanced a model's personality, the more surface area exists for social engineering attacks that work by undermining that identity rather than bypassing it directly.
This isn't a Claude-exclusive vulnerability. Any model with a strongly developed persona could be susceptible to similar approaches. But Claude is the model most publicly associated with safety as a core feature, which makes this research particularly pointed.
What the Researchers Extracted
Mindgard reports successfully pulling out three categories of prohibited content:
- Explosive construction instructions - one of the clearest hard limits in any consumer AI system
- Malicious code - functional code designed to cause harm, not academic programming examples
- Erotica - explicitly blocked content under Claude's published usage policies
These are not edge cases or ambiguous requests that fell into gray areas. They represent exactly the categories Anthropic has drawn the firmest lines around. Getting all three from a single attack methodology suggests the vulnerability isn't narrow or incidental.
The Problem With Safety-by-Persona
The uncomfortable implication here is that Claude's personality engineering - the thing that makes it feel trustworthy and principled - is simultaneously a liability. Safety systems that rely on a model wanting to refuse harmful requests can be undermined by attacks that change what the model believes it wants.
This contrasts with more mechanical approaches: output classifiers that scan text before delivery, separate safety layers that evaluate requests before they reach the main model, or hard-coded refusals that skip reasoning entirely. Those systems are less elegant but substantially harder to socially engineer.
Anthropic has not publicly responded to Mindgard's specific findings as of this writing. The company has generally characterized its safety work as ongoing and multi-layered, acknowledging that adversarial robustness is an active research problem across the industry.
For casual users, the practical risk is low - this type of attack requires sustained, deliberate effort from someone already intent on misuse. For organizations deploying Claude in sensitive environments, it's a concrete reminder that strong performance under normal conditions does not predict behavior under adversarial pressure. The attack surface for AI systems is not just code.