LLM Cost-Per-Outcome Calculator

Most AI cost calculators show $X per million tokens. This one shows what you actually pay to finish a real developer task — like writing a unit test or auditing code — across 10 popular LLMs. The cost gaps will surprise you.

Quick example:

For Write a unit test, Gemini 3 Flash is currently the most cost-effective option, completing the outcome for $0.0007. In comparison, the high-capacity Claude 4.7 Opus costs $0.2508 per success—making it 353x more expensive for this specific workload. Adjusting the task dropdown or the retry rate slider on the left will instantly re-calculate these economic tipping points in real-time.

01 / Control Panel

Calculator Parameters

1.1x
0.5x (Optimistic)1.1x (Task Default)3.0x (Pessimistic)
Calculation Formula

Base cost is calculated per million tokens. The total outcome cost =

(in_tokens × $/M + out_tokens × $/M) × retry_rate
Loading interactive visualizations...
Loading interactive visualizations...
05 / Ground Evidence

Stack-Ranked Outcomes

Sorted by cost ascending
LLM Model Estimated TokensQuality Score Outcome Cost Token Source
Gemini 3 FlashCheapest
In: 3,800
Out: 1,200
$0.0007Aider Workload
GPT-4o mini
In: 3,900
Out: 1,400
$0.0016Aider Workload
DeepSeek V4 Pro
In: 4,000
Out: 1,850
$0.0037Aider Workload
Claude 4.5 Haiku
In: 3,800
Out: 1,300
$0.0091Aider Workload
o3-mini
In: 4,500
Out: 2,500
$0.0175Aider Workload
Gemini 3.5 Flash
In: 3,850
Out: 1,450
$0.0207Aider Workload
GPT-4o
In: 4,100
Out: 1,900
$0.0322Aider Workload
Gemini 3.1 Pro
In: 4,300
Out: 2,000
$0.0359Aider Workload
Claude 4.6 Sonnet
In: 4,000
Out: 1,800
$0.0429Aider Workload
Claude 4.7 Opus
In: 4,200
Out: 2,200
$0.2508Aider Workload
The Ground Truth Registry (⬇️ Sorted by True Cost)

What is measured: The raw underlying calculations proving the final costs. It maps the workload token counts, verified accuracy (Quality Score), and the calculated True Cost.

Verifiability & Dual Sourcing: (1) Quality Source: Click any highlighted link under Quality Score to audit the accuracy evaluation (e.g. LMSys battles or AgentNoah OWASP audits). (2) Token Source: The rightmost badge identifies the benchmark workload used to audit raw input/output token counts.

Want this methodology on your own audits?

AgentNoah BUILD uses the same 16-phase pipeline + cross-audit memory + provenance discipline shown in this calculator — except on YOUR repo, not someone else's pricing data.

Free 14-day trial · No credit card · Bring your own IDE LLM (Claude Code, Cursor, Gemini CLI, Antigravity)

Try AgentNoah free