If your app sends the same long system prompt or the same big document on every call, prompt caching is the single most valuable optimization. Cache hits cost about 10% of the un-cached price. On document-heavy workloads, this turns a $1,000/month bill into $100.
"cache_control": {"type": "ephemeral"} on that content block.messages = [{
"role":"user",
"content":[
{"type":"text",
"text": LONG_SYSTEM_PROMPT,
"cache_control":{"type":"ephemeral"}},
{"type":"text", "text": "Now answer my question: …"}
]
}]
You can mark up to four cache breakpoints per request. Stack them from most-stable to least:
The user's current question goes after all four — un-cached.
cache_creation_input_tokens and cache_read_input_tokens.