Valid SQL that runs without errors but returns the wrong data is one of the nastiest bugs in AI-assisted development. The query executes, no red text appears, and you move on, not realizing your results are silently wrong.
The team at Dekart, a geospatial visualization platform, ran into exactly this problem while building a Claude Code Skill that generates BigQuery SQL from natural language prompts. Their queries would compile and execute cleanly, but coordinates got truncated, bounding box filters missed data, and row counts came back incomplete. Every bug was invisible at the syntax level.
Their solution is an open evaluation framework called the Agent Skills Eval Spec, built around a simple idea: don't check how the SQL was written, check whether the final answer is right.
For a test case querying London borough boundaries, the key assertion isn't "does this SQL use the correct JOIN syntax." It's "does the total area equal roughly 1,577 square kilometers, plus or minus 2 percent." That single numeric check catches coordinate errors, missing rows, incorrect filters, and truncation bugs all at once.
The framework runs in two phases. First, Claude generates SQL and reasoning in stream-json mode. Then the same session resumes to grade assertions against live database results. Each test case produces a full paper trail: the generated output, assertion pass/fail results with evidence, and a complete event log.
Dekart's current test suite hits 5 out of 5 assertions across London and Paris geospatial queries. The framework is designed to be reusable beyond their specific use case, following an open standard at agentskills.io.
For developers using Claude or other LLMs to generate database queries, the core lesson here is practical: syntax validation is not enough. If you're not checking the actual numbers that come back, you're flying blind. Outcome-based testing like this adds real confidence that AI-generated SQL does what you think it does.