Prompt Caching — Claude on WholeTech

01 — Why it matters

Your biggest cost lever.

If your app sends the same long system prompt or the same big document on every call, prompt caching is the single most valuable optimization. Cache hits cost about 10% of the un-cached price. On document-heavy workloads, this turns a $1,000/month bill into $100.

02 — How it works

Mark a block.

Identify the repeated content — system prompt, large reference doc, fixed instruction set.
Add a cache breakpoint: "cache_control": {"type": "ephemeral"} on that content block.
Send the call. The block is hashed; subsequent calls within the cache lifetime hit the cache.
Cache lifetime is currently 5 minutes by default; longer durations are available on some plans.

messages = [{
  "role":"user",
  "content":[
    {"type":"text",
     "text": LONG_SYSTEM_PROMPT,
     "cache_control":{"type":"ephemeral"}},
    {"type":"text", "text": "Now answer my question: …"}
  ]
}]

03 — When you'll see hits

Cache windows.

Bursts of related calls. User asks 10 questions about the same document — cache hits all the way down.
Pipelines that fire repeatedly. A daily news job that runs many calls with the same instructions — cache between successive calls in the run.
Multi-turn agents. Earlier turns become the cached prefix for later turns.

If your latency between calls exceeds the cache TTL, you won't see hits. Either pace your calls, or — for longer durations — request extended cache lifetimes from Anthropic.

04 — The cache ladder

Where to put breakpoints.

You can mark up to four cache breakpoints per request. Stack them from most-stable to least:

System prompt — almost never changes per call.
Tool definitions — change rarely.
Reference documents — same document across many calls in a session.
Conversation history — accumulating prefix.

The user's current question goes after all four — un-cached.

05 — Watch the metrics

Don't fly blind.

Every response carries usage fields including cache_creation_input_tokens and cache_read_input_tokens.
Build a dashboard that tracks cache hit ratio and blended cost per call. Two of the three best optimization wins of the past year for this network came from pushing this ratio up.
Don't change the cached prefix unnecessarily — even one whitespace difference invalidates the cache. Templating discipline matters.

Related products.

Every Anthropic surface has its own page on this guide.

Claude Codeagent in your terminal Claude APIbuild on top Agent SDKbuild agents MCPtool protocol Skillsslash commands Artifactslive previews Projectsworkspaces Files APIupload · cite Memorypersistent Tool Usefunction calling Computer Usescreen control Batch API50% cheaper Citationsgrounded Extended Thinkingdeep reason Managed Agentscloud · preview Mythosforward-looking Claude Coworkagentic desktop Claude Designprototypes · slides