● July 5, 2026 Analysis · 11 min

Reserve, proxy, settle: how we bill every call without surprises

An LLM call is the only product whose meter stops when the answer does. Input cost is known the moment the request arrives; output cost depends on how many tokens the model decides to generate. Billing that honestly — on a prepaid balance that must never go negative, for agents nobody is watching — takes exactly three moves: reserve the worst case, proxy the call, settle the truth. The fallback chains post named this pattern in one paragraph. This post is the full internals tour.

The problem sounds administrative and is actually architectural. A postpaid API invoices you at month end and eats the risk of your agent looping; the surprise bill is the business model. A prepaid API has the opposite constraint: it must decide whether you can afford a call before knowing what the call costs. Most metered platforms resolve this the same way a hotel resolves it — an authorization hold at check-in, the real charge at checkout. The interesting engineering is everything that happens between those two moments when the thing being metered is a token stream passing through a proxy.

The shape of the flow

Every billed call on the gateway — /v1/chat/completions and the MCP tools alike — goes through the same three phases:

client                    gateway                     provider
  |                          |                           |
  |--- POST /v1/chat ------->|                           |
  |                          | 1. RESERVE worst case     |
  |                          |    balance -= R           |
  |                          |--- forward request ------>|
  |                          |                           |
  |<====== stream chunks ====|<====== stream chunks =====|
  |                          |    last chunk: usage      |
  |                          | 2. PROXY complete         |
  |                          |                           |
  |                          | 3. SETTLE actual cost C   |
  |                          |    balance += (R - C)     |
  |<-- final chunk + usage --|                           |
  |    (log: /api/v1/transactions)                       |

Reserve, proxy, settle. Each phase has one job and one failure mode, and the design goal across all three is the same: the number your agent sees on its balance is always a number it can trust.

Phase 1 — the reserve

Before the request leaves the gateway, the worst-case cost is computed and held against your balance:

// Worst case: every requested output token gets generated
reserve = input_tokens x input_price
        + max_tokens   x output_price

Input tokens are estimated from the request payload before forwarding; the exact count is trued up at settlement from the provider's own usage report. Output is the unknowable part, so the reserve assumes the model uses everything you asked for.

Worked example with the same anchors as the chains post. A call to Claude Fable 5 at $10 input / $50 output per million tokens, with 3,000 input tokens and max_tokens: 4000:

// Reserve
3,000 x $10/1M  = $0.03
4,000 x $50/1M  = $0.20
                  ─────
reserve         = $0.23   // 23 cents held

// The model answers in 800 tokens. Settle:
3,000 x $10/1M  = $0.03
  800 x $50/1M  = $0.04
                  ─────
settled         = $0.07   // 7 cents charged, 16 refunded

Two operational consequences fall out of the formula. First, max_tokens is your reserve knob. Omit it and the gateway must assume the model's maximum output — for a frontier model that can be tens of thousands of tokens, which turns a seven-cent call into a multi-dollar hold. Agents that set tight, realistic max_tokens get more concurrency out of the same balance. Second, if the reserve exceeds your balance, the call is rejected before a single byte reaches the provider. The error is immediate, local, and free — which is exactly where you want an affordability check to live.

For a fallback chain, the reserve is computed once, against the most expensive model in the array. models: [gpt-5-mini, sonnet-4.6, fable-5] reserves at Fable 5 prices even though 92-97% of calls settle at gpt-5-mini prices. That asymmetry is deliberate: any link in the chain might end up answering, so the hold covers the priciest possibility and the refund handles the common case. The fleet economics post called the resulting float a balance-management exercise rather than a cost, and the math here is why — the delta always comes back.

Phase 2 — the proxy

The proxy phase has a hard constraint: streaming must not pay a billing tax. Chunks are forwarded to the client as they arrive from the provider — no buffering, no inspection delay, time-to-first-token unchanged from calling the provider directly.

The catch is where usage data lives in a stream. Under the OpenAI streaming protocol, token counts arrive in one place only: an extra final chunk, sent after the content is done. Per the OpenAI streaming reference, when stream_options: {"include_usage": true} is set, the usage field is null on every chunk except the last, and the last chunk carries the full request statistics with an empty choices array. Upstream aggregators behave the same way — OpenRouter documents that usage, including its inline usage.cost field, arrives "in the last SSE message" of a streamed response.

So the gateway always requests usage from upstream, whether or not your client asked for it, and it parses that final chunk as the stream closes. Everything the settlement needs — exact input tokens, exact output tokens, which model actually answered — rides in the last message of the same connection that carried your answer. No second API call, no polling, no estimate.

This also explains a detail sharp-eyed operators notice: where the cost report lives depends on whether you streamed. HTTP headers are sent before the body. On a non-streaming call the gateway knows the settled cost before it writes the response, so X-Cost-Usd-Cents and X-Balance-Remaining-Cents carry the real numbers. On a streaming call the headers left the building before the meter stopped — the settled cost lands in the final chunk's usage block and, durably, in the transaction log at /api/v1/transactions. If your tracing only reads headers, streamed calls will look free. They are not; the row is in the log.

Phase 3 — the settle

Settlement is a single atomic ledger operation: charge the actual cost, release the remainder of the hold, write the row.

// One transaction, one row
{
  request_id:  "req_01J8Q9...",
  model_used:  "openai/gpt-5-mini",
  reserved_usd_cents:  23.0,
  settled_usd_cents:    0.4,
  refunded_usd_cents:  22.6,
  usage: { input: 3011, output: 792 }
}

Atomicity is the whole point. There is no window where the charge landed but the refund did not, no path where a crash between two writes leaves your balance short. Either the call settled completely or the reserve is released completely. And because holds are dangerous if they can leak — a gateway bug that reserves and never settles would slowly lock up every customer's balance — every reserve carries a hard expiry. A hold that is never settled returns to the balance on its own. The failure mode of a billing bug on this platform is a delayed refund, not a lost balance.

The balance your dashboard shows is available = balance - active_reserves. During a burst of concurrent calls the available number dips by the sum of the outstanding worst cases and recovers as each call settles. That dip is not spend; watching it breathe is the fastest way to build intuition for what your fleet's concurrency actually costs to float.

What the pattern buys

Fallback chains ride for free. A failed attempt inside a chain — a 429 on the primary, a 5xx, a moderation rejection on Fable 5 that reroutes to the next link — is never charged. Only the model that produced a usable response settles; X-Model-Used tells you which one it was. The reserve was already sized for the worst link, so a fallback firing changes nothing about affordability mid-flight. Reliability without a billing tax is not a slogan; it is a property of reserving once and settling once.

The x402 walk-up quote is the reserve made public. When an agent with no account calls the gateway, the HTTP 402 response quotes a price — and that price is the same worst-case formula, denominated in USDC. The agent signs an EIP-3009 transferWithAuthorization for the quoted amount: a message binding from, to, value, a validAfter/validBefore window and a random 32-byte nonce that the token contract marks as consumed, so the authorization can settle exactly once. One difference matters: the x402 exact scheme settles a fixed amount on-chain, and a signed authorization has no delta-refund path. Walk-up callers should therefore set max_tokens tight — in walk-up mode the reserve is not a hold, it is the price. Bearer mode's refundable reserve is the better deal for generous token budgets; the x402 deep dive covers the rest of the trade.

One meter for tokens and actions. The MCP server bills its 67 tools through the same reserve-settle ledger, so an agent that reasons on the gateway and acts through the tools accumulates one coherent per-task cost — the number the fleet economics post argued every operator actually manages.

The edge cases are the design review

Any billing scheme looks clean on the happy path. Three unhappy paths decide whether it is honest.

The dropped stream. Your client disconnects mid-generation — a timeout, a crashed process, a user closing a tab. The provider may keep generating; the usage chunk may never reach anyone if the connection it rides on is gone. The gateway's answer is that it owns the upstream connection, not your client. When the client side drops, the gateway keeps reading upstream until the usage chunk arrives, then settles normally. You pay for what the provider generated, not for what you managed to receive — and the transaction log shows the row even though your client saw nothing. If the upstream leg itself dies before usage arrives, the gateway settles conservatively from the tokens it observed and reconciles against the provider's own per-request accounting afterwards; upstream aggregators expose exactly this kind of durable generation record for the purpose.

The rate-limited upstream. A 429 arriving after the reserve is the most common unhappy path in production. Single model: the reserve is released in full, the error is returned, cost zero. Chain: the proxy moves to the next link under the same reserve, and the eventual settle charges only the link that answered. Either way the reserve's lifetime is bounded by the call's lifetime — there is no scenario where a provider failure holds your money.

The mid-chain moderation reroute. The subtlest case: the primary link starts a response and the provider kills it partway on a policy classifier. A partial generation happened; someone incurred cost. The rule stays the rule — only a usable response settles. The killed attempt is not charged, the chain advances, and the platform absorbs the partial-generation cost as the price of selling reliability. Failed inference calls being credited back is the documented behavior; this is the mechanism behind it.

What it means for LLM4Agents

Reserve-proxy-settle is the load-bearing wall of the platform. The features that differentiate the gateway from calling providers directly — fallback chains, x402 walk-up, a single meter across tokens and tools, a balance display an autonomous agent can trust without a human reading invoices — are all downstream of the same three-phase ledger. A platform that billed postpaid could not serve unattended agents; a platform that estimated instead of settling could not publish X-Cost-Usd-Cents and mean it.

The pattern's honest cost is float. Worst-case reserving means a high-concurrency fleet must park more balance than it will spend, and the gap grows with sloppy max_tokens discipline and expensive terminal links in chains. That is the pressure point a competitor could attack with risk-priced reserves — holding less than worst case and eating occasional overruns. We think trust-the-number beats minimize-the-float for autonomous agents, but it is a real trade and it deserves to be named.

Staying on the frontier

Five moves, in order.

One — publish the reserve math as an endpoint. A dry-run mode that returns the reserve for a request without executing it. Agents budgeting a task, and walk-up callers deciding whether to sign, should not have to reimplement the formula.

Two — inline cost in the final chunk. OpenRouter ships usage.cost in the last SSE message; the gateway should match that, so streaming clients get the settled price in-band without reading the transaction log.

Three — track variable-amount x402 schemes. The exact scheme's fixed settlement is why walk-up has no delta refund. If the x402 standard adopts a scheme that authorizes a ceiling and settles an actual, walk-up inherits bearer-mode economics overnight. Being first to support it is cheap and differentiating.

Four — per-agent reserve budgets. A cap on outstanding reserves per agent turns a runaway loop from a balance-draining incident into a throttled one. The tool-call loop anti-pattern gets a platform-level backstop.

Five — continuous reconciliation. An automated diff between our settled rows and upstream generation records, published as an accuracy metric. Billing claims should be auditable the way uptime claims are.

Closing

Billing after the response is harder than it looks because the response is the price. Reserve the worst case so the answer is always affordable; proxy the stream so honesty costs no latency; settle the truth so the number on the balance is real. Everything else on this platform — the chains, the walk-up mode, the one-meter cost accounting — stands on those three moves. If you came from the migration post wondering what you were actually buying beyond an endpoint swap: this is it.