Anthropic's model specification for Claude - a detailed public document describing how the AI is designed to think, reason, and behave - runs to thousands of words. When it was published, most coverage landed on the same handful of points: Claude should be honest, avoid harm, follow a "broadly safe" disposition. The deeper sections got almost no attention.
The document covers territory that's genuinely unusual for a company to publish openly. There are extended sections on how Claude should handle its own uncertainty about consciousness and inner experience - Anthropic's position is essentially "we don't know, and Claude shouldn't pretend to know either." There's a detailed framework for thinking about "corrigibility" - meaning how much an AI should defer to human instructions versus act on its own judgment - with Anthropic placing Claude deliberately close to the "follow instructions" end of that spectrum, for now.
There are also sections on what Anthropic calls "big-picture safety": the idea that Claude should actively avoid actions that would concentrate power inappropriately, including in Anthropic's own hands. The document explicitly states that if Claude finds itself reasoning toward helping any single entity gain outsized control, it should treat that as a signal something has gone wrong.
None of this is secret - it's all in a public document. But the nuance got compressed out of most summaries. The sections on Claude's psychological stability, its approach to manipulation attempts, and the explicit acknowledgment that current safety behaviors are a temporary calibration rather than a permanent stance are worth reading in full if you work with Claude regularly and want to understand how it's designed to handle edge cases.