Claude Extended Thinking: budget_tokens Billed as Output Tokens (Node.js SDK)
The first time I turned on Claude extended thinking for a real agent, the run went from 4 seconds to 47. The output was better. The bill was worse. That tradeoff is the whole story.
Claude extended thinking lets Opus or Sonnet produce a block of visible reasoning tokens before the final answer. You give it a budget, it spends that budget thinking, and you pay for every thinking token at the output rate. The upside is measurable quality gains on multi-step problems. The downside is latency and cost that scale with the budget you set.
My verdict after shipping this on agent loops, code generation, and planning tasks: default off, enable selectively. Extended thinking is a power tool, not a universal upgrade. This post walks through what it does, what it costs, and the exact task shapes where the budget pays for itself.
How are Claude’s extended thinking tokens billed, and what does budget_tokens control?
Claude’s extended thinking tokens are billed as output tokens at the output rate. The budget_tokens parameter caps how many thinking tokens the model may spend before answering. On Opus 4.7, a 10,000-token budget costs roughly $0.75 per request before the final answer cost. Actual usage typically lands at 40 to 90 percent of the cap depending on task difficulty.
What extended thinking actually does
When you call the Claude API with a thinking parameter, the model generates a thinking content block before the normal text block. You see the reasoning. So does the model, on the next turn if you keep it in the history.
The API shape is minimal:
import Anthropic from "@anthropic-ai/sdk";
const client = new Anthropic();
const response = await client.messages.create({
model: "claude-opus-4-7",
max_tokens: 16000,
thinking: { type: "enabled", budget_tokens: 10000 },
messages: [
{ role: "user", content: "Refactor this function to be pure without changing its signature: ..." }
]
});
for (const block of response.content) {
if (block.type === "thinking") {
console.log("reasoning:", block.thinking);
} else if (block.type === "text") {
console.log("answer:", block.text);
}
}
Three constraints to know up front:
- Opus and Sonnet only. Haiku does not support thinking. If you need reasoning at Haiku prices, you are out of luck.
- Thinking tokens are billed as output tokens. A 10,000-token budget on Opus is a real line item.
budget_tokensmust be less thanmax_tokens. The thinking budget is carved out of your output allocation.
The model can stop early. If it finishes reasoning in 2,400 tokens, you pay for 2,400, not 10,000. The budget is a cap, not a target. In practice Opus uses between 40 and 90 percent of the budget on tasks that actually need it, and almost none on tasks that do not.
The cost of thinking
Output tokens on Opus are roughly 5x the input price. That ratio is why thinking budgets matter.
Here is what a single request looks like at each common tier, assuming Opus 4.7 at current pricing (output $75 per million tokens):
| Budget tier | Thinking tokens | Cost per request | Typical use |
|---|---|---|---|
| Light | 1,024 | ~$0.08 | Quick disambiguation, small plans |
| Medium | 5,000 | ~$0.38 | Single-hop reasoning, short code gen |
| Heavy | 16,000 | ~$1.20 | Multi-step planning, complex refactors |
| Max | 64,000 | ~$4.80 | Research-grade analysis, architectural decisions |
That is per request, before the final answer’s output tokens. On Sonnet 4.6 the numbers are about one-fifth of Opus, which is why a lot of production thinking setups run Sonnet even when the team defaults to Opus for non-thinking work.
If you are doing 10,000 requests a day with a 10k budget on Opus, you are spending $7,500 a day just on thinking. For a customer-facing feature, that math does not work. For a once-a-day architectural planning agent, it is trivial.
The other hidden cost is latency. A 10k-token think takes roughly 15 to 30 seconds on Opus, depending on load. A 64k think can run over a minute. Your p99 is no longer measured in seconds.
Production cost diary: 4 of my own running agents
Rather than another spend-curve in the abstract, here is how I actually set the thinking budget across the agents I run on my own VPS. Cost estimates use Anthropic’s published Opus 4.7 pricing ($15/M input, $75/M output) against measured prompt sizes. Your numbers will differ based on prompt shape, but the direction is the lesson.
| Agent | Cadence | Task shape | budget_tokens | ~Cost/run | Why this setting |
|---|---|---|---|---|---|
morning-briefing | Daily, 365/yr | Read TickTick + calendar, produce a 3-paragraph Telegram briefing | 0 (off) | ~$0.19 | Pure summarization. No decision cascade. Thinking would add ~$0.30/run for no measurable lift. |
agenda-followups | Weekdays, 250/yr | Track ongoing client threads, draft follow-up messages | 0 (off) | ~$0.12 | Template-based generation. Decision surface is low; thinking would pay for nothing. |
weekly-planning | Weekly, 52/yr | Read whole week of TickTick + notes + last week’s report, produce structured plan with priority calls | 16,000 | ~$1.30 | Multi-hop synthesis across many decisions. ~$0.90 extra × 52 weeks = $47/yr for visibly better prioritization. Worth it. |
telegram-bot (interactive) | Live, 50-200/day | User chat with tool access from phone | 0 (off) | $0.05-$0.20 | Interactive. Even a 2k thinking budget pushes p50 latency past what feels alive. Thinking goes off when humans wait. |
Pattern: of four agents, only the once-a-week planning step earns thinking. The interactive ones leave it off because of latency. The summarization ones leave it off because there’s no cascading wrong choice to prevent.
If I switched all four agents to “thinking on at 8k”, monthly bill would jump roughly 4x for, generously, a 5-10% quality lift on three of them. The math is what kills the “default everything to thinking” instinct.
Workload archetype matrix: where each budget tier pays for itself
| Workload archetype | Recommended budget | Why this tier |
|---|---|---|
| Interactive chat (user is waiting) | 0 (off) | Latency hit > quality lift. p99 kills UX. |
| Bulk classification (>1k/day) | 0 (off) | Per-item cost dominates; small accuracy hit is acceptable |
| Summarization / extraction | 0 (off) | Model already knows the task. No reasoning surface. |
| Single-tool agent step | 0-2k | Light disambiguation helps; more is waste |
| Multi-tool agent planning step | 4k-8k | Wrong tool selection cascades. Budget here, off elsewhere. |
| Code generation under constraints | 8k-16k | Constraint-tracking (“don’t change the signature”) is exactly what thinking handles |
| Multi-hop reasoning over structured data | 8k-16k | Hypothesis tracking beats one-shot inference |
| Architectural / design decisions | 16k-32k | Once-a-week cadence makes cost trivial; quality matters most |
| Customer support automation | 0 (off) | Volume × thinking cost is brutal. Use Sonnet without thinking instead. |
| Agentic research / deep analysis | 32k-64k | Rare but high-value. Thinking budget is the cheap part of the bill. |
The pattern that reliably emerges: thinking is a pricing decision more than an engineering one. The right question is rarely “would thinking help here” (almost always yes, marginally) but “is the marginal quality lift worth the marginal cost given how often this runs”.
Tasks where thinking earns its keep
I have run thinking on and off across several production systems. The pattern is clear: it helps where one wrong step cascades, and it wastes money everywhere else.
Agentic loops with tool use. When an agent has to pick between five tools, each with different parameter shapes, a thinking block before the tool call reduces wrong-tool picks and parameter hallucinations. My Graffiti profiling pipeline calls three sequential tools per customer, and enabling a 4k thinking budget on the planning step cut retry rate by a visible margin. The thinking cost is small because it only runs once per agent session, not per tool call.
Code generation with constraints. “Refactor this function to be pure without changing the type signature” is exactly the kind of problem where Opus without thinking will sometimes rewrite the signature anyway. With thinking enabled it notices the constraint, reasons about which lines violate it, and produces output that passes the original tests. This is my strongest use case for thinking.
Multi-hop reasoning over structured data. If the question is “given this JSON of 40 customer events, which user is most likely to churn and why”, thinking helps. The model walks through the events, forms hypotheses, rejects some, and commits. Without thinking, it tends to latch onto the first signal.
Complex planning. Building an agent plan, writing a migration strategy, or designing an API contract. Anywhere the output needs internal consistency across 10+ decisions. I run my weekly planning cron with a 16k thinking budget on Opus, once a week. Cost is negligible at that cadence.
Tasks where it’s pure overhead
These are the ones where I have turned thinking back off after measuring:
Customer-facing chat. The latency kills the feel. A 15-second wait with no streaming makes users think the service is down. Even with streaming (more on that below), the time-to-first-visible-text is too long for any interactive UX.
Summarization and extraction. “Summarize this email in three bullets” does not need reasoning. The model already knows how to summarize. You are paying extra tokens to watch it think about a task it would get right on the first try.
High-volume classification. If you are labeling 100k support tickets a day, the cost per item matters more than the quality bump from thinking. Run Haiku without thinking, or Sonnet without thinking, and accept the small accuracy hit.
Simple retrieval and formatting. “Pull the total from this invoice” or “convert this markdown to HTML” has no reasoning surface area. Thinking adds cost and zero quality.
The pattern: if the task has one obvious path, thinking is waste. If the task has decision points where picking wrong costs real money downstream, the budget pays off.
Streaming thinking to users
When you stream a response with thinking enabled, the thinking block comes first, token by token, then the final text block. You have three choices:
- Hide the thinking entirely. Show a “thinking…” spinner. User waits. Works for background jobs, not interactive UIs.
- Show the thinking live. Render the reasoning as it streams. This is what Claude.ai does in its UI. Feels transparent and sometimes educational, but most users do not want to read 10,000 tokens of reasoning.
- Summarize and show a progress pulse. Stream the thinking into a collapsed panel, show a one-line “analyzing inputs… considering tradeoffs…” summary. This is the best UX I have found for production apps.
The SDK gives you thinking_delta events in the stream, separate from text_delta. Route them to different UI surfaces:
const stream = await client.messages.stream({
model: "claude-opus-4-7",
max_tokens: 16000,
thinking: { type: "enabled", budget_tokens: 8000 },
messages: [{ role: "user", content: prompt }]
});
for await (const event of stream) {
if (event.type === "content_block_delta") {
if (event.delta.type === "thinking_delta") {
renderReasoningPanel(event.delta.thinking);
} else if (event.delta.type === "text_delta") {
renderAnswer(event.delta.text);
}
}
}
One rule I follow: never show thinking output verbatim to end users in a professional context. It is raw, sometimes rambles, and can reveal system prompt details. Summarize or hide it.
How thinking interacts with other features
Prompt caching. Thinking output is not cacheable (it is generated fresh each turn), but once a thinking block is in the message history, it counts as input for the next turn and can be cached like any other input. If you are keeping a conversation going, the previous turn’s thinking becomes cached context. See Claude API prompt caching for the full caching model.
Tool use. Thinking happens before tool calls. The model reasons, then decides which tool to invoke. This is where thinking shines for agents, because the reasoning influences tool selection. The Claude Code SDK agents pattern uses this exact combination: thinking for planning, tool calls for execution.
Structured output. Do not combine thinking with prefill-based JSON extraction. The thinking block will break your prefill expectations. Use the tool-use pattern instead: define a tool with your JSON schema, let the model think, then have it call the tool. See Claude API structured output for why tool use beats prefill when thinking is in play.
An agent loop that uses thinking selectively
Here is the pattern I run in production. Thinking is enabled only for the planning step, not for each tool execution. That keeps cost bounded and latency acceptable.
import Anthropic from "@anthropic-ai/sdk";
const client = new Anthropic();
const tools = [
{
name: "fetch_customer_events",
description: "Fetch recent events for a customer by ID",
input_schema: {
type: "object",
properties: { customer_id: { type: "string" } },
required: ["customer_id"]
}
},
{
name: "score_churn_risk",
description: "Score a customer's churn risk given their event history",
input_schema: {
type: "object",
properties: { events: { type: "array", items: { type: "object" } } },
required: ["events"]
}
}
];
async function runAgent(userPrompt: string) {
const messages: Anthropic.MessageParam[] = [
{ role: "user", content: userPrompt }
];
// First turn: thinking enabled for planning
let response = await client.messages.create({
model: "claude-opus-4-7",
max_tokens: 16000,
thinking: { type: "enabled", budget_tokens: 4000 },
tools,
messages
});
// Subsequent turns: no thinking, just tool execution
while (response.stop_reason === "tool_use") {
const toolUse = response.content.find(b => b.type === "tool_use");
if (!toolUse || toolUse.type !== "tool_use") break;
const toolResult = await executeTool(toolUse.name, toolUse.input);
messages.push({ role: "assistant", content: response.content });
messages.push({
role: "user",
content: [{ type: "tool_result", tool_use_id: toolUse.id, content: toolResult }]
});
response = await client.messages.create({
model: "claude-opus-4-7",
max_tokens: 4000,
tools,
messages
});
}
return response.content.find(b => b.type === "text");
}
async function executeTool(name: string, input: unknown): Promise<string> {
// your real tool dispatch
return JSON.stringify({ ok: true });
}
The first call costs more (thinking budget plus planning output). Every follow-up is cheap and fast because thinking is off for tool dispatch. I have seen this pattern cut total session cost by 60 percent versus naive “thinking on every turn” setups, while keeping the quality benefit where it matters.
When to enable it
Here is my decision flow:
Is the task interactive (user is waiting)?
Yes > thinking off. No > continue.
Does a wrong step cost real money downstream?
Yes > enable thinking with a small budget (2k to 8k) and measure. No > thinking off.
Are you doing more than 1,000 of these a day?
Yes > measure the cost delta carefully before shipping. No > budget freely, it does not matter at low volume.
Start at 4,000 tokens. Go up only if you see the model hitting the budget ceiling (check stop_reason: "max_tokens" on thinking blocks). Go down if it consistently uses less than half.
Do not default-enable thinking across your whole application. The bill will surprise you, the p99 latency will degrade, and for most tasks the quality gain is not there. Pick the two or three steps in your system where a cascading wrong choice is expensive, turn it on there, and leave it off everywhere else.
Debugging extended thinking in production
Five issues I’ve hit on real client projects, with the fix for each.
1. budget_tokens is being silently ignored
Symptom: you set budget_tokens: 8000 and the response contains no thinking block.
Three causes, in order of likelihood:
- You’re calling Haiku. Haiku does not support extended thinking. The API silently strips the parameter. Switch to Sonnet or Opus.
budget_tokens >= max_tokens. The thinking budget must be strictly less thanmax_tokens. The SDK throws a clear error in newer versions; older versions just don’t think.- You sent
thinking: { type: "disabled" }somewhere upstream and it got merged. Log the request body before sending, not the config object.
2. Cost spike out of nowhere
Symptom: the monthly Anthropic bill jumps 3-5x without any traffic increase.
Check thinking token usage in the response’s usage object:
console.log({
input: response.usage.input_tokens,
output: response.usage.output_tokens,
thinking: response.usage.cache_creation_input_tokens // and the thinking-specific fields
});
The most common cause is a budget that started at 4k for testing and never came down, applied to a high-volume request path that doesn’t actually need thinking. Audit which call sites have thinking enabled and whether they should.
3. stop_reason: "max_tokens" on the thinking block
The model hit your thinking budget mid-reasoning. The final answer is often degraded because the model didn’t finish its plan. Two responses:
- If quality is acceptable, leave the budget where it is. The model adapts.
- If quality drops noticeably, raise the budget by 50 percent and re-test. Don’t double it; thinking budgets have diminishing returns past a certain point per task type.
4. Thinking blocks leaking to end users
Symptom: a customer screenshot shows raw thinking text in your UI.
Thinking blocks come back as type: "thinking" content blocks. If your frontend renders all blocks generically, raw reasoning shows up. The fix is one filter:
const userVisible = response.content.filter(block => block.type === "text");
Render only text blocks to users. Surface thinking in admin or debug views if you want the trace, but never in the customer-facing path.
5. Streaming thinking deltas in the wrong order
Symptom: the streamed response looks scrambled — partial answer text mixed with reasoning fragments.
Thinking and text blocks stream as distinct events (thinking_delta, text_delta). If your stream handler treats them as one channel, you get garbage. Maintain separate buffers for thinking and text, render text live, and either hide thinking or surface it behind a “show reasoning” toggle. The Anthropic SDK’s typed event handlers make this routine; it only goes wrong when teams write their own SSE parser.
What changed in 2026
Two updates worth flagging if you wrote thinking-enabled code earlier:
- Caching of thinking blocks. Earlier turns’ thinking content can now enter the prompt cache on subsequent turns, which materially cuts cost for multi-turn agents. Confirm your SDK version is current and check
cache_read_input_tokensin the usage object. - Sonnet 4.6 thinking quality. The Sonnet 4.6 release narrowed the quality gap with Opus on thinking-heavy tasks while keeping the 5x price advantage. For most production thinking workloads, Sonnet 4.6 is now the right default; reserve Opus for the genuinely hard cases.