Prompt cache 用对一次，API 账单砍 60%

实测环境：Claude 3.5 Sonnet · prompt 7.1K token · 50K 请求/天 · 5 分钟内 burst 量 800 请求

prompt cache 是什么

简单说：你重复发同样的 system prompt，Anthropic 帮你把前 N 个 token 缓存 5 分钟，5 分钟内重复发就只收 0.1 倍的价格。

状态	价格（每 1M token，claude-3-5-sonnet）
Cache miss（正常）	$3.00
Cache write（首次写）	$3.75
Cache hit（命中）	$0.30
Cache hit（1h 扩展版，beta）	$0.60

换算下来：

5 分钟缓存：命中 0.1x 价
1 小时缓存：命中 0.2x 价

前提条件：

缓存前缀至少 1024 token
每次请求 system + tools 的前 N 个 token必须完全相同（不能有空格差异）
缓存默认 5 分钟 TTL

只要你的 system prompt 稳定（不变来变去），并且请求密度够（5 分钟内 ≥2 次相同前缀），就一定能省钱。

实际账单对比

我的爬虫 agent 改造前后：

改造前（无 cache）

system prompt: 7,100 token
请求/天: 50,000
input token/天: 7,100 × 50,000 = 355M token
价格: 355M × $3.00 / 1M = $1,065 / 天

$1,065/天 · $32,000/月

改造后（启用 5min cache）

首请求（cache write）: 7,100 × $3.75 / 1M = $0.027
后续 49,999 请求（cache hit，假设都命中）: 7,100 × $0.30 / 1M × 49,999 = $106.50
input token/天 ≈ $107

$107/天 · $3,210/月

节省 90%。

实际场景没这么理想（缓存会过期、会有变动），但稳态 60-80% 是稳的。

怎么开启

Anthropic 的 cache 控制通过 cache_control block 标在 system / tools 字段的最末。

Python SDK

import anthropic

client = anthropic.Anthropic()

SYSTEM_PROMPT = """<your 7K-token prompt here>"""

def ask(user_msg: str) -> str:
    r = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=1024,
        system=[
            {
                "type": "text",
                "text": SYSTEM_PROMPT,
                "cache_control": {"type": "ephemeral"}  # 5 分钟
                # "cache_control": {"type": "ephemeral", "ttl": "1h"}  # 1 小时（beta header）
            }
        ],
        messages=[{"role": "user", "content": user_msg}]
    )
    return r.content[0].text

关键：cache_control 标在 block 的最末，Anthropic 会把这个 block 之前的所有内容作为缓存前缀。

多 block 场景

system=[
    {"type": "text", "text": LONG_STATIC_PROMPT},          # 7K token 静态
    {"type": "text", "text": DYNAMIC_SESSION_CONTEXT},     # 1K token 动态
    {"type": "text", "text": USER_RULES,                    # 0.5K token 静态
     "cache_control": {"type": "ephemeral"}}
]

只在末尾那个 block 标 cache_control。前面所有内容都被缓存。

⚠️ 常见错误：每个 block 都标 cache_control。这样不会更省，反而会产生多个 cache entry，每次都得全部命中才能省钱。

怎么确认「真的命中了」

每次响应里有 usage 字段：

r = client.messages.create(...)
print(r.usage)  # Usage(input_tokens=7100, cache_creation_input_tokens=0, cache_read_input_tokens=7100)

关键字段：

字段	含义
`input_tokens`	本次未命中缓存的 token
`cache_creation_input_tokens`	本次新写入缓存的 token
`cache_read_input_tokens`	本次从缓存读出的 token

如果 cache_read_input_tokens 一直为 0，说明根本没命中——通常是前缀没对齐。

调试小脚本：

def ask_with_log(user_msg: str):
    r = client.messages.create(...)
    u = r.usage
    hit = u.cache_read_input_tokens
    miss = u.input_tokens
    total = hit + miss
    hit_pct = (hit / total) * 100 if total else 0
    print(f"hit={hit} miss={miss} hit%={hit_pct:.1f}")
    return r.content[0].text

跑 100 次，命中率 < 80% 就要查问题。

5 个常见「不命中」原因

1. 时间间隔超过 5 分钟。两个请求间隔 6 分钟，缓存过期。

2. 前缀对不齐。多了一个空格、换行符、引号不一致。

3. 用了动态内容在缓存前缀里。比如：

system=[
    {"type": "text", "text": f"现在是 {datetime.now()} ..."}  # ❌ 每次都变
]

把时间戳移到**cache_control 之后**的 block，或者不放进 system。

4. 缓存前缀 < 1024 token。Anthropic 不缓存 < 1K 的前缀。System prompt 短的（比如客服短 prompt）干脆别开 cache，省得浪费一次 cache write。

5. model 切了。claude-3-5-sonnet-20241022 和 claude-3-5-sonnet-20240620 是不同 cache。改 model 就得重建缓存。

跟 system prompt 优化的关系

Prompt cache 不替代 #002 那篇说的 7 个偏方。它是叠加的。

如果你的 system prompt 是 30K token（毫无优化），用 cache 之后输入成本 * 0.1 = 3K token 价。

但你本来就能用偏方 2（JSON 化）、偏方 4（XML 工具）把 prompt 压到 7K，那 cache 之后就是 700 token 价。

两层叠加是 1+1=3 的效果。

我那个爬虫 agent：

改前：30K system × 50K 请求 = $4,500/天
改后：7K system + cache hit 0.1x × 50K 请求 = $107/天

$4,500 → $107。42 倍降本。

1 小时 cache（beta）值不值得开

5 分钟 TTL 命中率在 80% 左右（实测）。改 1 小时 TTL 命中率会掉（缓存条数更少，但保留更久），但每条命中的成本是 0.2x 价，比 0.1x 贵一倍。

结论：

场景	推荐 TTL
高频小流量（每秒都有）	5min
中频中流量（每分钟 1-10 次）	5min
低频大流量（每 5-10 分钟 1 次）	1h
偶发请求	不开 cache

1h cache 需要加 beta header：

client = anthropic.Anthropic(
    default_headers={"anthropic-beta": "prompt-caching-2024-07-31"}
)
# ttl 1h 还需要额外 beta
# 实际启用以官方文档为准：https://docs.anthropic.com

总结 checklist

system prompt ≥ 1024 token 才考虑开 cache
用 cache_control: {"type": "ephemeral"} 标在 system 末尾 block
前缀必须完全一致——别在缓存前缀里塞动态内容
上线后用 cache_read_input_tokens 监控命中率
命中率 < 80% → 检查 5 个常见原因
先优化 prompt 长度，再叠 cache

Anthropic prompt cache 是 2024 年最被低估的成本优化。中文圈的 LLM 工程师很多 2025 年才发现这个特性——如果你还没用，今天就可以上线。

下一篇写 Mavis 的 cron 自我提醒机制——长任务怎么用 cron 避免 agent 跑到一半 session 断了又得重来。

— 怪招本 #003 · 2026-06-21