Skip to main content

01 - Pricing Basics

Every AI-powered application needs to answer the same fundamental question: how many credits does this request cost? Without a consistent pricing foundation, costs become opaque - different engineers hard-code different rates in different places, audit trails vanish, and changing your pricing requires a code deployment.

ducto's PricingEngine solves this by letting you define pricing formulas as simple math expressions in configuration, completely separate from your application code. Think of it like a restaurant bill: each item on the meal - tokens, tools, search queries - has its own line-item cost, and the total is simply the sum of those items. The UsageMetrics class bundles all the ingredients (token counts, tool calls, etc.) into a single object that the engine evaluates against your configured formulas.

This separation of concerns is a core design principle. Formulas live in a database table or a Python dictionary, not scattered across your codebase. Changing how gpt-4o is priced means updating one formula string, not hunting down every place that calculates costs. This makes your pricing auditable, testable, and adjustable.

The PricingEngine is also completely stateless - it performs pure computation without any storage or database dependency. Give it a formula set and usage metrics, and it returns the same cost every time. This makes it trivially testable and safe to call from anywhere in your application.

In this notebook, we will walk through four common pricing scenarios: a basic token-only call (the most common pattern), a call with tool invocations, a call with search or RAG operations, and a call that benefits from the LLM provider's context caching discount. Each scenario demonstrates how the same engine handles different metric combinations.

Setup

Before we can calculate any costs, we need to import the core ducto classes that make up the pricing pipeline. Each import serves a distinct purpose.

PricingEngine is the calculator - it takes input variables (token counts, tool calls, search queries) and evaluates them against formulas defined in configuration. UsageMetrics is the data object that bundles all those input variables together. ToolCall is a simple named type used when the engine needs to include per-tool costs in its calculation.

# PricingEngine: the core calculator that evaluates math expressions against usage metrics.
# It parses formula strings into safe AST trees at construction time.
from ducto.engine import PricingEngine

# UsageMetrics: bundles all input variables that pricing formulas reference.
# Each field maps to a variable name available in the expression syntax.
from ducto.metrics import UsageMetrics, ToolCall

Static config via from_dict

The PricingEngine accepts configuration as a Python dictionary with four optional sections. Each section targets a different dimension of AI application costs.

The models section defines per-model token pricing. Each model name maps to a math expression string. The engine parses these strings using a safe AST-based evaluator - not Python's eval() - so expressions are sandboxed and cannot access the filesystem or network.

Available metric variables in expressions:

  • input_tokens and output_tokens: The prompt and completion token counts from an LLM call. These are the most commonly used variables since every model call produces both input and output.
  • cache_read_tokens and cache_write_tokens: Tokens served from or written to an LLM provider's context cache. These are non-zero only when your application uses prompt caching.
  • tool_calls: The number of tool invocations the model makes. This is non-zero when your request includes tool definitions and the model decides to call them.
  • search_queries and search_results: The number of search queries issued and results processed. These are non-zero in RAG (Retrieval-Augmented Generation) or web-search-augmented generation flows.
  • web_search_calls and code_exec_calls: Web search API calls and code execution sandbox invocations. These are used by agentic applications that have browsing or code-running capabilities.
  • fixed_job: A special variable for fixed-cost operations that don't scale with token count.

The tools, search, and cache sections follow the same pattern: each key maps to a formula using the appropriate variables. Once the dictionary is assembled, pass it to PricingEngine.from_dict() to build the engine.

# Build a pricing configuration dictionary with four sections.
# Each section's formula uses variable names that UsageMetrics understands.

# "models" — per-model token pricing. Keyed by model name, valued as an expression string.
# gpt-4o: input tokens cost 5 credits each, output tokens cost 15 credits each.
# claude-sonnet-4: input 3 credits per token, output 15 credits per token.
# claude-haiku-3.5: input 1 credit per token, output 4 credits per token.
config = {
"models": {
"gpt-4o": "input_tokens * 5 + output_tokens * 15",
"claude-sonnet-4": "input_tokens * 3 + output_tokens * 15",
"claude-haiku-3.5": "input_tokens * 1 + output_tokens * 4",
},
# "tools" — per-tool-call cost added on top of the model token cost.
"tools": {"code_exec": "tool_calls * 50"},
# "search" — cost for RAG or search-augmented generation operations.
"search": {"costs": "search_queries * 10 + search_results * 1"},
# "cache" — discount applied for LLM context cache usage.
"cache": {"discount": "cache_read_tokens * 1 + cache_write_tokens * 5"},
}

# Build the engine: from_dict() parses all formulas into internal AST trees.
# No database or storage is needed at this stage — pure computation.
engine = PricingEngine.from_dict(config)

# Inspect what was registered via the pricing schema.
schema = engine.pricing_schema()
print(f"Engine ready — {len(schema.models)} models registered (gpt-4o, claude-sonnet-4, claude-haiku-3.5)")

Basic call (tokens only)

The simplest and most common pricing scenario is a pure chat completion: no tools, no search, no caching. We provide just the model name, input token count, and output token count. The engine returns a CreditCost object with a detailed breakdown of each cost component.

In this case, we ask gpt-4o to process 500 input tokens and 200 output tokens. The formula is input_tokens * 5 + output_tokens * 15, which gives us 500 * 5 + 200 * 15 = 2,500 + 3,000 = 5,500 credits total. No tool credits or search credits are added because we did not specify those metrics.

# Tokens-only is the most common case — a simple chat completion with no extras.
# We provide the model name, input token count, and output token count.
# The engine matches the model name to its pricing formula and evaluates it.
cost = engine.calculate(UsageMetrics(
model="gpt-4o", input_tokens=500, output_tokens=200,
))

# The CreditCost object has separate fields for each cost component.
# model_credits: the cost from the model's token formula
# tool_credits: the cost from any tool invocations (zero here since none were specified)
# total: the sum of all cost components
print(f" Model: {cost.model_credits} ({500}x5 + {200}x15 = 2,500 + 3,000)")
print(f" Tools: {cost.tool_credits}") # Zero because no tools were called
print(f" Total: {cost.total}")
assert cost.total == 5500 # 500*5 + 200*15 = 2,500 + 3,000 = 5,500

With tool calls

Many AI applications go beyond simple chat by giving the model tools to call — code execution, database queries, or external API invocations. Each tool invocation adds cost on top of the token consumption.

In this scenario, we use claude-sonnet-4 with 1,000 input tokens and 400 output tokens, plus a single code_exec tool call. The engine evaluates two separate formulas: the model formula for tokens (1,000 * 3 + 400 * 15 = 3,000 + 6,000 = 9,000 credits) and the tool formula for the invocation (1 * 50 = 50 credits). The total is 9,050 credits.

# Adding tools shows how pricing stacks when using multiple dimensions.
# The engine applies the model formula AND the tool formula, then sums them.
cost = engine.calculate(UsageMetrics(
model="claude-sonnet-4", input_tokens=1000, output_tokens=400,
tool_calls=[ToolCall(name="code_exec")], # Single tool invocation
))
print(f" Model: {cost.model_credits} Tools: {cost.tool_credits} Total: {cost.total}")
# Expected: model = 1,000*3 + 400*15 = 3,000 + 6,000 = 9,000; tool = 1*50 = 50; total = 9,050
assert cost.total == 9050

With search / RAG

Applications that augment LLM calls with external knowledge retrieval add another cost dimension. Search queries and result processing each incur their own charges.

In a RAG (Retrieval-Augmented Generation) flow, the application typically issues one or more search queries, retrieves multiple results, and feeds those results into the LLM context. The search cost metric tracks both the number of queries issued and the number of results processed.

In this example, we use gpt-4o with moderate token counts (200 input, 50 output), plus 3 search queries that return 45 total results. The engine applies both the model formula and the search formula.

# Search/RAG flow: the application issues search queries and processes results.
# The engine applies the search formula from config in addition to the model formula.
cost = engine.calculate(UsageMetrics(
model="gpt-4o", input_tokens=200, output_tokens=50,
search_queries=3, search_results=45, # Search/RAG metrics
))
print(f" Model: {cost.model_credits} Search: {cost.search_credits} Total: {cost.total}")
# Expected: model = 200*5 + 50*15 = 1,000 + 750 = 1,750; search = 3*10 + 45*1 = 30 + 45 = 75; total = 1,825
assert cost.total == 1825

With cache discount

LLM providers offer context caching to reduce costs when the same conversation prefix is reused across multiple requests. ducto models this as a separate cache section in the pricing config, which produces a discount (a reduction in total credits).

In this scenario, we use claude-haiku-3.5 with 300 input tokens and 100 output tokens, plus 200 cache read tokens and 50 cache write tokens. The cache savings are calculated from the cache formula and subtracted from the total.

# Cache discount: the engine applies savings (negative cost) based on cache usage.
# This models how LLM providers charge less for cache hits versus cache writes.
cost = engine.calculate(UsageMetrics(
model="claude-haiku-3.5", input_tokens=300, output_tokens=100,
cache_read_tokens=200, cache_write_tokens=50, # Context caching metrics
))
# The cache_savings field reflects the discount from the cache formula (positive value = amount saved).
print(f" Model: {cost.model_credits} Cache: {cost.cache_savings} Total: {cost.total}")
# Inspect every component cost through the breakdown dictionary.
# The breakdown includes model, tools, search, and cache entries separately.
print(f"Breakdown keys: {list(cost.breakdown.keys())}")