Back to Calculator

Methodology & Provenance Data

Details on how cost per outcome is calculated, model data sources, limitations, and how to contribute.

Built by AgentNoah BUILD ⚡

This calculator was built end-to-end using AgentNoah's BYOL BUILD pipeline on Google Antigravity (Gemini 3.5 Flash) in one evening. Read the full case study → including the 6 fabrications the review loop caught before this site went live.

Visit AgentNoah

The Formula: Cheap ≠ Value

Standard vendor tables show pricing in simple "Dollars per Million input/output tokens". This promotes models that look extremely inexpensive (e.g. Gemini Flash or GPT-4o mini) while hiding the real economic cost of completing tasks in software development workflows.

When a model has lower quality, it makes mistakes, leading to failed builds, lint errors, or broken tests. In practice, a developer or orchestrator must retry the call (often multiple times) until a successful outcome is achieved.

Calculation Formula

Outcome Cost = [ (In_Tokens / 1,000,000) × In_Price + (Out_Tokens / 1,000,000) × Out_Price ] × Retry_Rate

Where the Retry Rate acts as a linear multiplier representing the average attempts needed to pass a verification loop (tests, compile, manual validation).

Proven Data & Source URLs

Vendor / SourceProvenance LinkLast Verified
Anthropic Claudeanthropic.com/pricing2026-05-23
OpenAI GPT & o-seriesopenai.com/pricing2026-05-23
Google Geminiai.google.dev/pricing2026-05-23
DeepSeek (V4 Pro)api-docs.deepseek.com/quick_start/pricing2026-05-25
Aider Leaderboardaider.chat/docs/leaderboards2026-05-23
AgentNoah BenchmarkAgentNoah OWASP K=3 security sweep2026-05-23

Limits + Caveats

  • Token estimate provenance: Cells tagged aider are calibrated to typical patterns from the Aider coding leaderboard — they are our best estimate of typical input/output token usage per model per task, NOT direct per-cell measurements (Aider does not publish per-task token counts in this format). Cells tagged agentnoah-owaspfor the security-audit task come from AgentNoah's K=3 BYOL benchmark. PRs welcome at github.com/guevae2/llm-cost-per-outcome/issues to refine specific cells with real measurements.
  • Token estimates vary: Real usage varies wildly depending on prompt templates, system instructions, few-shot examples, and framework overhead. The estimates are static baseline measurements.
  • Retry Rate is a proxy: In actual systems, failed attempts are not always complete retries. They might be smaller recovery steps, although the overall cost contribution trends similarly.
  • Open source contributed: Data is contributed and updated by the community. If you notice any outdated or incorrect price cells, please open an Issue.