Skip to main content

MaaS API

openaianthropicgoogle
parallel_tool_callsdisable_parallel_tool_use
max_completion_tokensmax_tokens
  • “长尾分布”
  • “突发性”
  • "Fat Tail" (肥尾)
  • 3+Sigma + 15-30min 窗口检查异动

Gemini API

Multiple tools are supported only when they are all search tools

  • 内置 tool 和 functionDeclaration 工具不能同时使用
  • openai 里的 tool 映射为一个 functionDeclaration
  • 其他的 tool 是内置 tool,语义上有点区别

OpenAI API

streaming

first chunk

  • 有些为了紧凑,会在第一个 chunk 包含内容
  • 正常情况第一个 chunk 不应该包含内容

last chunk

  • vLLM, OpenAI 最后一个 chunk 的 content 为 空
{
"index": 0,
"delta": {
"content": ""
},
"logprobs": null,
"finish_reason": "stop",
"stop_reason": null
}

ToolChoice

  • auto
    • 自动选择工具
  • required
    • 必须使用工具
  • none
    • 不使用工具

Thinking

{
"contents": [
{
"parts": [
{
"text": "Provide a list of 3 famous physicists and their key contributions"
}
]
}
],
"generationConfig": {
"thinkingConfig": {
"thinkingLevel": "low"
}
}
}

Interleaved thinking

思考过程可以进行 tool call

  • Claude 4+
    • interleaved-thinking-2025-05-14
    • Messages API 才支持
  • MiniMax-M2
  • Kimi-K2-Thinking

reasoning_details

{
"type": "reasoning.summary",
"summary": "The model analyzed the problem by first identifying key constraints, then evaluating possible solutions...",
"id": "reasoning-summary-1",
"format": "anthropic-claude-v1",
"index": 0
}
  • type
    • reasoning.summary
    • reasoning.encrypted
    • reasoning.text
  • 维护思考细节信息
    • OpenAI o
    • Claude 3.7+ thinking
    • Gemini Reasoning
    • xAI Reasoning

Preserved thinking


role

  • developer
  • system
  • user
  • assistant
  • tool
    • 新版本 openai
    • Anthropic 使用 user role
  • function
    • 旧版本 openai

usage

  • 付费
    • 算力
    • pay per token
    • pay per request
    • pay per item
      • 图、语音

abort

  • stream 499 会产生费用
  • 非strema 中断也会产生费用
    • 极端情况会产生完整的费用
  • Agent 实现在中断时候需要预估 usage
    • 否则 context window 会失准

Prompt Cache

模型 / 场景最小缓存 Token 数
Claude Opus 4.54096
Claude Opus 4.1, 41024
Claude Sonnet 4.5, 4, 3.71024
Claude Haiku 4.54096
Claude Haiku 3.5, 32048
Gemini 3 Pro Preview4096
Gemini 3 Flash Preview1024
Gemini 2.5 Pro4096
Gemini 2.5 Flash1024
Gemini Explicit Caching (Vertex AI)4096
Gemini Context Caching (Early Versions)32768
OpenAI GPT1024
  • Implicit Caching: 提供 75% - 90% 的输入 Token 折扣。
  • Explicit Caching: 按生存时间 (TTL) 收取存储费用。
  • 容量: 最大缓存大小等同于模型完整上下文窗口(可超过 100 万 Token)。
  • Gemini 3 优化: 在 Gemini 3 系列中,建议 Prompt 前缀或缓存数据至少达到 4096 Token 以确保缓存生效并有效降低 API 成本。
  • Google OpenAI API extra body
  • ⚠️ Tool call 缓存实际缓存的是 schema+描述 等
{
"google": {
"cached_content": "cachedContents/XXX",
"thinking_config": {
"thinking_level": "low",
"include_thoughts": true
}
}
}

Anthropic

beta

dateflagforproviderfields
2024-07-31[prompt-caching-2024-07-31]Prompt caching breakpointsA,B,V,F*.cache_control, usage.cache_creation_input_tokens, usage.cache_read_input_tokens
2024-09-24message-batches-2024-09-24Batch message processingArequests[*].custom_id, requests[*].params, processing_status, request_counts.*, results_url
2024-09-25pdfs-2024-09-25PDF document supportA,B,V,Fmessages[*].content[*].source.media_type, messages[*].content[*].source.data, messages[*].content[*].source.type
2024-10-22computer-use-2024-10-22Computer use tools (v1)A,B,V,Ftools[*].type:"computer_20241022", tools[*].display_width_px, tools[*].display_height_px, tools[*].display_number
2024-11-01token-counting-2024-11-01Token counting endpointA,B,V,Finput_tokens, cache_creation_input_tokens, cache_read_input_tokens
2025-01-24computer-use-2025-01-24Computer use tools (v2)A,B,V,Ftools[*].type:"computer_20250124", tools[*].display_width_px, tools[*].display_height_px
2025-02-19token-efficient-tools-2025-02-19Reduce tool definition tokensA,B,V,Ftools[*].defer_loading, tools[*].strict
2025-02-19output-128k-2025-02-19Extend max output to 128kA,B,V,Fmax_tokens (up to 128000)
2025-04-04mcp-client-2025-04-04MCP server integration (v1)A,B,V,Fmcp_servers[*].url, mcp_servers[*].tool_configuration, content[*].type:"mcp_tool_use", content[*].type:"mcp_tool_result"
2025-04-11extended-cache-ttl-2025-04-11Extended cache TTL to 1hA,B,V,F*.cache_control.ttl:"1h"
2025-04-14files-api-2025-04-14File upload/download APIA,B,V,Fsource.type:"file", source.file_id
2025-05-14dev-full-thinking-2025-05-14Full thinking content (dev)A,B,V,Fthinking.type:"enabled", thinking.budget_tokens, content[*].type:"thinking"
2025-05-14interleaved-thinking-2025-05-14Thinking interleaved with tool useA,B,V,Fcontent[*].type:"thinking" interleaved with content[*].type:"tool_use"
2025-05-22code-execution-2025-05-22Code execution sandboxA,B,V,Ftools[*].type:"code_execution_20250522", tools[*].allowed_callers, content[*].caller, content[*].type:"code_execution_tool_result"
2025-06-27[context-management-2025-06-27]Auto context managementA,B,V,Fcontext_management.edits[*], context_management.edits[*].trigger, context_management.edits[*].keep
2025-08-07[context-1m-2025-08-07]1M token context windowA,B,V,Fmax_tokens (model context extended to 1M)
2025-08-26[model-context-window-exceeded-2025-08-26]Context window exceeded stop reasonA,B,V,Fstop_reason:"model_context_window_exceeded"
2025-10-02[skills-2025-10-02]Skills/container supportA,B,V,Fcontainer.skills[*].id, container.skills[*].type, container.id
2025-11-20[mcp-client-2025-11-20]MCP server integration (v2)A,B,V,Fmcp_servers[*].url, mcp_servers[*].tool_configuration
2026-02-01fast-mode-2026-02-01Fast inference modeAspeed:"fast"

A=Anthropic API, B=Bedrock, V=Vertex AI, F=Foundry

  • prompt-caching-2024-07-31
    • 5m 缓存
    • 不再需要,默认启用,cache_control 控制
    • cache_control: {type: "ephemeral"}
  • extended-cache-ttl-2025-04-11
    • 1h 缓存
    • cache_control: {type: "ephemeral", ttl: "1h"}

anthropic-beta: A,B
# SSE, 去掉 data: [DONE], named event
anthropic-version: 2023-06-01
# 第一个版本
anthropic-version: 2023-01-01

output-128k-2025-02-19

允许输出 128K

extended-cache-ttl-2025-04-11

  • messages[*].content[*].cache_control.ephemeral.ttl

code-execution-2025-05-22

curl https://api.anthropic.com/v1/messages \
-H "x-api-key: $ANTHROPIC_API_KEY" \
-H "content-type: application/json" \
-H "anthropic-version: 2023-06-01" \
-H "anthropic-beta: code-execution-2025-05-22" \
-d '{
"model": "claude-sonnet-4-20250514",
"max_tokens": 4096,
"tools": [
{
"type": "code_execution_20250522",
"name": "code_execution"
}
],
"messages": [
{
"role": "user",
"content": "Calculate the first 20 Fibonacci numbers"
}
]
}'
curl https://api.anthropic.com/v1/messages \
-H "x-api-key: $ANTHROPIC_API_KEY" \
-H "content-type: application/json" \
-H "anthropic-version: 2023-06-01" \
-H "anthropic-beta: code-execution-2025-05-22" \
-d '{
"model": "claude-sonnet-4-20250514",
"max_tokens": 4096,
"tools": [
{
"type": "code_execution_20250522",
"name": "code_execution"
},
{
"name": "get_stock_price",
"description": "Get the current stock price for a given ticker symbol",
"allowed_callers": ["direct", "code_execution_20250825"],
"input_schema": {
"type": "object",
"properties": {
"ticker": { "type": "string" }
},
"required": ["ticker"]
}
}
],
"messages": [
{
"role": "user",
"content": "Write Python code to fetch AAPL and GOOGL stock prices and calculate which one is more expensive"
},
{
"role": "assistant",
"content": [
{ "type": "text", "text": "I will write code to compare the stock prices." },
{
"type": "tool_use",
"id": "toolu_code_01",
"name": "code_execution",
"input": { "code": "aapl = tool.get_stock_price(ticker='AAPL')" }
},
{
"type": "tool_use",
"id": "toolu_stock_01",
"name": "get_stock_price",
"input": { "ticker": "AAPL" },
"caller": {
"type": "code_execution_20250825",
"tool_id": "toolu_code_01"
}
}
]
},
{
"role": "user",
"content": [
{
"type": "tool_result",
"tool_use_id": "toolu_stock_01",
"content": "{\"price\": 195.50, \"currency\": \"USD\"}"
}
]
}
]
}'
用户请求

模型发起 code_execution (toolu_code_01)

sandbox 代码调用 get_stock_price → 产生 tool_use + caller
↓ ↑
↓ caller.tool_id = "toolu_code_01"
↓ caller.type = "code_execution_20250825"

API 返回 stop_reason: "tool_use",你处理 tool_result 回传

模型拿到结果,sandbox 继续执行(或请求下一个工具)

最终输出结果
  • allowed_callers:在 tools 定义中,控制谁可以调用 → 请求侧
  • caller:在 tool_use block 中,标识谁实际触发了调用 → 响应侧
  • 如果去掉 "code_execution_20250825" 只保留 ["direct"],sandbox 代码就无权调用 get_stock_price,也不会出现带 caller 的 tool_use

context-management-2025-06-27

advanced-tool-use-2025-11-20

  • Claude API, Microsoft Foundry, 所有模型
  • 模型按需取搜索 tool
  • 工具可以定义 defer_loading: true
    • 只保留 name, description
    • 本质是 SDK 去做 RAG 相关逻辑,然后自动提供相关的检索
    • 有点类似 SKILL 的概念
    • 渐进式披露
  • 实现上千工具的使用
  • ⚠️ 用户实现 search_tools
    • 高效准确的工具搜索
    • 字符匹配、语义、分类
{
"tools": [
{ "type": "tool_search_tool_regex_20251119", "name": "tool_search_tool_regex" },
{
"name": "github.createPullRequest",
"description": "Create a pull request",
"input_schema": {},
"defer_loading": true
}
]
}

tool-examples-2025-10-29

  • Opus 4.5+
  • Vertex AI, Amazon Bedrock
  • input_examples 字段

fast-mode-2026-02-01

  • 支持情况
    • Claude Opus 4.6
    • 2.5x 速度
    • 6x 价格
  • fast: boolean
curl https://api.anthropic.com/v1/messages \
--header "x-api-key: $ANTHROPIC_API_KEY" \
--header "anthropic-version: 2023-06-01" \
--header "anthropic-beta: fast-mode-2026-02-01" \
--header "content-type: application/json" \
--data '{
"model": "claude-opus-4-6",
"max_tokens": 4096,
"speed": "fast",
"messages": [{
"role": "user",
"content": "Refactor this module to use dependency injection"
}]
}'

US-Only Inference

  • 1.1x 价格
  • inference_geo: us

effort

  • output: { effort: 'high' }
  • 支持 Opus 4.5, Opus 4.6
  • max, high, medium, low
  • 默认 high
  • Opus 4.6+ 替代以前的 budget_tokens

adaptive thinking

  • Opus 4.6+
  • thinking.type=adaptive
  • 弃用 thinking: {type: "enabled", budget_tokens: N}
curl https://api.anthropic.com/v1/messages \
--header "x-api-key: $ANTHROPIC_API_KEY" \
--header "anthropic-version: 2023-06-01" \
--header "content-type: application/json" \
--data \
'{
"model": "claude-opus-4-6",
"max_tokens": 16000,
"thinking": {
"type": "adaptive"
},
"messages": [
{
"role": "user",
"content": "Explain why the sum of two even numbers is always even."
}
]
}'
adaptive thinking is not supported on this model