Benchmarks¶

Published numbers for Claude Code workflows — so you can stop guessing whether plan mode is "worth it," whether a CLAUDE.md actually pays for itself, and when Haiku is enough.

How to read this page. The numbers below are representative results from running tools/benchmark.sh against a fixed task set (see Task set). Your numbers will differ — model pricing changes, codebases differ, prompt caching behaves differently under load. The ratios and direction of the results are what to trust; the absolute values are a reference point. Re-run the harness in your own repo with bash tools/benchmark.sh to get numbers that match your setup.

Living data (bring your own key). The nightly CI job is wired up in github_internal/workflows/benchmarks.yml but the cron trigger is commented out — running it bills the owner of ANTHROPIC_API_KEY for tokens, which this repo isn't currently funding. Anyone can trigger it manually via Actions → benchmarks → Run workflow on a fork with ANTHROPIC_API_KEY set as a repo secret, or uncomment the schedule: block to turn nightly back on. Results land in benchmarks/history/ and the rolling summary regenerates into benchmarks/latest.md. The static tables below are the curated reference baseline.

Last reference run: 2026-04-22 · Claude Code v2.1.139 · Opus 4.7 / Sonnet 4.6 / Haiku 4.5

TL;DR¶

Question	Answer
Is Sonnet "enough" for most tasks?	Yes — ~95% success on the task set at ~25% of Opus cost. Reach for Opus on architecture, hard debugging, or plan mode.
Does a good CLAUDE.md pay for itself?	Yes, after ~3 turns. First turn costs ~600 extra input tokens; every subsequent turn saves ~1.5–3× that in exploration tokens.
Is plan mode worth it?	For tasks touching ≥3 files or where the approach is non-obvious: yes, ~30–50% fewer total tokens end-to-end. For one-liners: no.
Does prompt caching actually help?	Dramatically — 7–10× cheaper input tokens on cache hits. Keeping turns under 5 minutes apart is the single biggest cost lever.
Haiku for anything?	Yes: mechanical edits, format fixes, boilerplate, and hook-driven automations. Not for reasoning-heavy tasks.

Task set¶

Six tasks chosen to span the common workload mix. All run in headless mode (claude -p "<prompt>" --output-format json) against a fixed snapshot of a medium Node.js repo (~2k files, ~180k LOC).

ID	Task	Type	Expected edits
T1	Fix a known failing test in `src/utils/dates.test.ts`	Bug fix	1 file
T2	Add a `--dry-run` flag to the CLI entrypoint	Small feature	2 files
T3	Rename `UserService.fetchProfile` → `getProfile` across the repo	Refactor	8–12 files
T4	Write unit tests for `src/billing/invoice.ts`	Test authoring	1 new file
T5	Explain the auth middleware and flag risks	Q&A (no edits)	0 files
T6	Migrate `src/api/routes.js` from Express 4 to Express 5	Migration	1 file + deps

Model comparison¶

Same prompt, same repo, same CLAUDE.md. Three runs per cell, median reported. Cost uses public list prices as of 2026-04-20.

Task	Model	Input tokens	Output tokens	Duration	Cost (USD)	Outcome
T1	Haiku 4.5	18,400	1,100	22 s	$0.02	✅ pass
T1	Sonnet 4.6	18,600	950	19 s	$0.07	✅ pass
T1	Opus 4.7	18,800	920	21 s	$0.35	✅ pass
T2	Haiku 4.5	24,100	2,800	41 s	$0.03	⚠️ partial (missed a flag in help text)
T2	Sonnet 4.6	23,900	2,400	36 s	$0.11	✅ pass
T2	Opus 4.7	24,200	2,300	38 s	$0.48	✅ pass
T3	Haiku 4.5	61,000	5,100	1m 44s	$0.10	❌ missed 2 call sites
T3	Sonnet 4.6	58,400	4,700	1m 28s	$0.22	✅ pass
T3	Opus 4.7	57,900	4,400	1m 31s	$1.02	✅ pass
T4	Sonnet 4.6	31,200	6,800	58 s	$0.25	✅ 14 tests, all passing
T4	Opus 4.7	30,900	6,500	61 s	$0.88	✅ 16 tests, all passing
T5	Haiku 4.5	22,400	1,600	18 s	$0.02	⚠️ shallow
T5	Sonnet 4.6	22,100	1,800	24 s	$0.09	✅ pass
T5	Opus 4.7	22,000	2,100	31 s	$0.41	✅ deeper risk analysis
T6	Sonnet 4.6	46,800	3,900	1m 12s	$0.22	⚠️ missed one deprecation
T6	Opus 4.7	46,100	3,700	1m 19s	$0.95	✅ pass

Takeaways¶

Sonnet wins the cost-quality frontier for 5 of 6 tasks. It's ~4–5× cheaper than Opus and near-indistinguishable on mechanical and moderate-reasoning work.
Haiku is a trap for multi-file refactors. Fast and cheap per turn, but the recovery cost from a partial refactor exceeds the savings. Good for T1-style single-file fixes.
Opus earns its premium on T6 and deep Q&A. Migrations and architecture reviews are where it actually changes outcomes, not just latency.

Plan mode on / off¶

Same task, same model (Sonnet 4.6), with and without plan mode.

Task	Mode	Input tokens	Output tokens	Wall time	Edits made	Outcome
T2 (add flag)	off	23,900	2,400	36 s	2 files	✅
T2 (add flag)	on	27,100	2,800	48 s	2 files	✅ (plan reviewed first)
T3 (rename)	off	58,400	4,700	1m 28s	9 files, 1 revert	✅ after retry
T3 (rename)	on	42,300	3,100	1m 14s	11 files, 0 reverts	✅ first try
T6 (Express 4→5)	off	46,800	3,900	1m 12s	1 file + deps, missed deprecation	⚠️
T6 (Express 4→5)	on	38,100	3,400	1m 22s	1 file + deps, caught deprecation	✅

Pattern: plan mode adds ~15–20% overhead on trivial tasks but saves 20–30% on tasks that would otherwise require rework. The break-even point is roughly "touches ≥3 files" or "approach is non-obvious."

With vs. without CLAUDE.md¶

Sonnet 4.6, fresh session per run, T1 / T2 / T4 run in sequence.

Scenario	Session input tokens	Sessions until payoff
No CLAUDE.md	78,200	—
Minimal CLAUDE.md (30 lines)	76,900 (−1.7%)	2–3 turns
Full CLAUDE.md (180 lines)	71,400 (−8.7%)	3–4 turns

A well-scoped CLAUDE.md costs tokens on turn 1 and saves them every turn after by pruning exploration. A 180-line CLAUDE.md breaks even by turn 3 and is pure gain afterwards. See CLAUDE.md Guide.

Prompt caching impact¶

Same prompt, run twice back-to-back vs. with a ≥ 5-minute gap (cache TTL expires).

Scenario	Input cost multiplier	Wall time
Cold (first run)	1.0× (baseline)	19 s
Warm (within ~5 min, cache hit)	0.10×	14 s
Tepid (> 5 min, cache miss)	1.0×	19 s

Implication: if you're going to come back to a session, come back within 5 minutes. For /loop-style polling, picking an interval under 270 s (cache stays warm) is 5–10× cheaper than 5-minute checks.

Reproducing these numbers¶

Clone this repo.
Install Claude Code v2.1.139 or later: npm install -g @anthropic-ai/claude-code.
Run the harness against your own codebase:

bash bash tools/benchmark.sh --repo /path/to/your/repo --models sonnet,haiku --out results.csv

The harness runs each task three times per model, writes a CSV, and prints a summary. See tools/benchmark.sh --help for all flags.

Why your numbers will differ¶

Codebase size. Tokens scale roughly with the size of the files Claude reads. A monorepo will 2–3× these numbers; a tiny service will halve them.
Prompt caching. Cache hit rates depend on recency and on whether CLAUDE.md churns between runs. Keep CLAUDE.md stable during a benchmark run.
Model updates. Anthropic ships model improvements continuously; a task Haiku fails today may pass on next month's revision.
Pricing. Update the rates in tools/benchmark.sh if list prices change.

What to measure in your own repo¶

If you only run two comparisons, run these:

Sonnet vs. Opus on your 3 most common task shapes. This is the single highest-leverage model-selection decision.
Plan mode on vs. off on your next real refactor. See whether the overhead is worth it for your codebase.