Methodology & Data Sources — LLM Cost Calculator

Built by AgentNoah BUILD ⚡

This calculator was built end-to-end using AgentNoah's BYOL BUILD pipeline on Google Antigravity (Gemini 3.5 Flash) in one evening. Read the full case study → including the 6 fabrications the review loop caught before this site went live.

Visit AgentNoah

The Formula: Cheap ≠ Value

Standard vendor tables show pricing in simple "Dollars per Million input/output tokens". This promotes models that look extremely inexpensive (e.g. Gemini Flash or GPT-4o mini) while hiding the real economic cost of completing tasks in software development workflows.

When a model has lower quality, it makes mistakes, leading to failed builds, lint errors, or broken tests. In practice, a developer or orchestrator must retry the call (often multiple times) until a successful outcome is achieved.

Calculation Formula

Outcome Cost = [ (In_Tokens / 1,000,000) × In_Price + (Out_Tokens / 1,000,000) × Out_Price ] × Retry_Rate

Where the Retry Rate acts as a linear multiplier representing the average attempts needed to pass a verification loop (tests, compile, manual validation).

Proven Data & Source URLs

Vendor / Source	Provenance Link	Last Verified
Anthropic Claude	anthropic.com/pricing	2026-05-23
OpenAI GPT & o-series	openai.com/pricing	2026-05-23
Google Gemini	ai.google.dev/pricing	2026-05-23
DeepSeek (V4 Pro)	api-docs.deepseek.com/quick_start/pricing	2026-05-25
Aider Leaderboard	aider.chat/docs/leaderboards	2026-05-23
AgentNoah Benchmark	AgentNoah OWASP K=3 security sweep	2026-05-23

Limits + Caveats

Token estimate provenance: Cells tagged aider are calibrated to typical patterns from the Aider coding leaderboard — they are our best estimate of typical input/output token usage per model per task, NOT direct per-cell measurements (Aider does not publish per-task token counts in this format). Cells tagged agentnoah-owaspfor the security-audit task come from AgentNoah's K=3 BYOL benchmark. PRs welcome at github.com/guevae2/llm-cost-per-outcome/issues to refine specific cells with real measurements.
Token estimates vary: Real usage varies wildly depending on prompt templates, system instructions, few-shot examples, and framework overhead. The estimates are static baseline measurements.
Retry Rate is a proxy: In actual systems, failed attempts are not always complete retries. They might be smaller recovery steps, although the overall cost contribution trends similarly.
Open source contributed: Data is contributed and updated by the community. If you notice any outdated or incorrect price cells, please open an Issue.

Submit corrections or suggest models on GitHub Issues