-
Notifications
You must be signed in to change notification settings - Fork 4.1k
Open
Labels
area:context-providersRelates to context providersRelates to context providerskind:bugIndicates an unexpected problem or unintended behaviorIndicates an unexpected problem or unintended behavior
Description
Problem
Context length calculation appears to be significantly off for Anthropic Claude models. The pruning/compaction logic thinks the context is at 130k tokens (70% full) when the actual token count is around 200k tokens. This is for Claude models which have a 200k context window.
Root Cause Analysis
Based on codebase investigation, the issue likely stems from one or more of the following:
1. Tokenizer Mismatch
- Location:
core/llm/countTokens.tslines 73-88 - Claude models use
autodetectTemplateType()which returns"none"for Claude models (seecore/llm/autodetect.tsline 343) - When template type is
"none", the system falls back to GPT-4 tiktoken encoder (encodingForModel()) - Issue: Anthropic uses a different tokenizer than OpenAI's tiktoken. Using GPT-4 tokenizer to count Claude tokens will produce inaccurate counts
function encodingForModel(modelName: string): Encoding {
const modelType = autodetectTemplateType(modelName);
if (!modelType || modelType === "none") {
if (!gptEncoding) {
gptEncoding = _encodingForModel("gpt-4"); // ❌ Wrong tokenizer for Claude
}
return gptEncoding;
}
return llamaEncoding;
}2. Context Percentage Calculation
- Location:
core/llm/countTokens.tslines 530-537 - The context percentage is calculated as:
inputTokens / availableTokens - If token counting is off by ~35%, this would explain 130k showing as 70% instead of closer to 65% for 200k actual tokens
const inputTokens = currentTotal + systemMsgTokens + toolTokens + lastMessagesTokens;
const availableTokens = contextLength - countingSafetyBuffer - minOutputTokens;
const contextPercentage = inputTokens / availableTokens;3. Unaccounted Token Overhead
Additional sources of token discrepancy to investigate:
- Message formatting tokens:
BASE_TOKENS = 4per message (line 186) - may not match Anthropic's actual overhead - Tool call tokens:
TOOL_CALL_EXTRA_TOKENS = 10per tool call (line 187) - estimate may be off - Tool definition tokens:
countToolsTokens()uses OpenAI's formula (lines 137-181), not Anthropic's - Image tokens: Fixed at 1024 tokens per image (line 90) - may differ for Anthropic
- Thinking/reasoning tokens: Special Claude feature with
redactedThinkingandsignaturefields - overhead not fully accounted for - Cache control blocks: When using prompt caching, special tokens may not be counted
4. Safety Buffer May Be Masking Issues
- Location:
core/llm/countTokens.tslines 361-368 - Safety buffer is 2% of context length (max 1000 tokens)
- For 200k context, this is 1000 tokens buffer
- This buffer is meant to account for tokenizer inaccuracies, but a 35% error (70k tokens) far exceeds this
Reproduction
- Use an Anthropic Claude model (e.g.,
claude-3-5-sonnet-latestorclaude-sonnet-4-5-20250929) - Send messages that should result in ~200k tokens actual usage
- Observe the UI shows ~130k tokens at ~70% context usage
- Verify actual token usage from Anthropic API response (usage.input_tokens)
Expected Behavior
- Token counting should be accurate to within ~5% of actual usage
- Context percentage should reflect true token consumption
- 200k tokens should show as ~100% context usage for a 200k context window model
Proposed Solutions
Option 1: Use Anthropic's Actual Token Counts (Recommended)
- Anthropic returns exact token counts in API responses:
usage.input_tokensandusage.output_tokens - Location:
core/llm/llms/Anthropic.tslines 307-315 already captures this - Instead of estimating tokens with GPT tokenizer, use these real values for context calculations
- Cache and reuse these counts for subsequent message compilation
Option 2: Implement Proper Anthropic Tokenizer
- Add support for Anthropic's actual tokenizer (if available via SDK/library)
- Update
encodingForModel()to use correct tokenizer for Anthropic models - This would be more accurate for pre-flight token estimation
Option 3: Use LLM-Specific Token Counting
- Add a method in BaseLLM:
estimateTokens(content: MessageContent): number - Let each provider override with their own tokenization logic
- Anthropic class can use actual counts from previous responses as calibration
Option 4: Improve Safety Buffer for Known Inaccurate Cases
- Detect when using mismatched tokenizer (e.g., GPT tokenizer for Claude)
- Increase safety buffer proportionally (e.g., 35% for this case)
- This is a bandaid but would prevent context overflow errors
Implementation Plan
-
Immediate Fix:
- Store and use actual token counts from Anthropic API responses
- Update
compileChatMessagesto accept optionalactualTokenCounts - Use these for calculating context percentage instead of estimates
-
Medium Term:
- Research if Anthropic provides a tokenizer library
- If available, integrate proper Anthropic tokenizer
- Add tests comparing estimated vs actual token counts
-
Long Term:
- Refactor token counting to be provider-specific
- Each LLM provider handles its own tokenization
- Fall back to conservative estimates when exact counting unavailable
Related Files
core/llm/countTokens.ts- Token counting logiccore/llm/llms/Anthropic.ts- Anthropic provider (captures real usage)core/llm/autodetect.ts- Template detection (returns 'none' for Claude)core/llm/constants.ts- DEFAULT_PRUNING_LENGTH = 128000packages/llm-info/src/providers/anthropic.ts- Model definitions (200k context)core/llm/index.ts- compileChatMessages usage
Additional Context
- This affects all Anthropic Claude models
- The DEFAULT_PRUNING_LENGTH is 128000, but Claude models have 200k context
- The contextLength from llm-info correctly shows 200000 for Claude models
- Issue manifests in UI showing incorrect context usage percentage
Metadata
Metadata
Assignees
Labels
area:context-providersRelates to context providersRelates to context providerskind:bugIndicates an unexpected problem or unintended behaviorIndicates an unexpected problem or unintended behavior
Type
Projects
Status
Todo