Claude on WholeTech network
home/products/caching
90% off repeated context

Prompt Caching shipping

Mark long, repeated context as cacheable. Cache hits cost about 10% of the un-cached price.

01 — Why it matters

Your biggest cost lever.

If your app sends the same long system prompt or the same big document on every call, prompt caching is the single most valuable optimization. Cache hits cost about 10% of the un-cached price. On document-heavy workloads, this turns a $1,000/month bill into $100.

02 — How it works

Mark a block.

  1. Identify the repeated content — system prompt, large reference doc, fixed instruction set.
  2. Add a cache breakpoint: "cache_control": {"type": "ephemeral"} on that content block.
  3. Send the call. The block is hashed; subsequent calls within the cache lifetime hit the cache.
  4. Cache lifetime is currently 5 minutes by default; longer durations are available on some plans.
messages = [{
  "role":"user",
  "content":[
    {"type":"text",
     "text": LONG_SYSTEM_PROMPT,
     "cache_control":{"type":"ephemeral"}},
    {"type":"text", "text": "Now answer my question: …"}
  ]
}]
03 — When you'll see hits

Cache windows.

If your latency between calls exceeds the cache TTL, you won't see hits. Either pace your calls, or — for longer durations — request extended cache lifetimes from Anthropic.
04 — The cache ladder

Where to put breakpoints.

You can mark up to four cache breakpoints per request. Stack them from most-stable to least:

  1. System prompt — almost never changes per call.
  2. Tool definitions — change rarely.
  3. Reference documents — same document across many calls in a session.
  4. Conversation history — accumulating prefix.

The user's current question goes after all four — un-cached.

05 — Watch the metrics

Don't fly blind.