Token budgets in AI agent code have always been awkward to enforce. Cloud providers like Anthropic and OpenAI let you set account-level spending caps, but those caps are retrospective - they tell you after a run that you burned through your budget, not while it's happening. A new open-source library called Tokencap fills that gap.
The tool works by wrapping your AI client or patching orchestration frameworks like LangChain, CrewAI, or AutoGen at the application layer, intercepting API calls as they happen and applying budget rules in real time. Setup is two lines:
# Wrap the client directly
client = tokencap.wrap(anthropic.Anthropic(), limit=50_000)
# Or patch LangChain, CrewAI, or AutoGen
tokencap.patch(limit=50_000)
When an agent approaches its limit, Tokencap triggers configurable actions at different thresholds: warn you, degrade (silently swap to a cheaper, smaller model to keep the run alive), or block further calls entirely. The degrade option is the most useful in practice - instead of crashing an expensive multi-step agent run, it keeps things moving by routing remaining steps to a lower-cost model.
Anyone who's run agentic workflows in production has probably watched token counts spiral. A coding agent that spawns sub-agents, a research pipeline that expands its own search - these patterns consume tokens fast and unpredictably. Writing a custom interceptor to handle this is boilerplate that most teams skip. Tokencap does it with a clean, two-line API.
The project is open source on GitHub. It currently supports Anthropic's Python client directly and patches the major Python orchestration frameworks.