Skip to content

Context length calculation incorrect for Anthropic Claude models (shows 130k tokens at 70% when actually 200k) #9231

@continue

Description

@continue

Problem

Context length calculation appears to be significantly off for Anthropic Claude models. The pruning/compaction logic thinks the context is at 130k tokens (70% full) when the actual token count is around 200k tokens. This is for Claude models which have a 200k context window.

Root Cause Analysis

Based on codebase investigation, the issue likely stems from one or more of the following:

1. Tokenizer Mismatch

  • Location: core/llm/countTokens.ts lines 73-88
  • Claude models use autodetectTemplateType() which returns "none" for Claude models (see core/llm/autodetect.ts line 343)
  • When template type is "none", the system falls back to GPT-4 tiktoken encoder (encodingForModel())
  • Issue: Anthropic uses a different tokenizer than OpenAI's tiktoken. Using GPT-4 tokenizer to count Claude tokens will produce inaccurate counts
function encodingForModel(modelName: string): Encoding {
  const modelType = autodetectTemplateType(modelName);

  if (!modelType || modelType === "none") {
    if (!gptEncoding) {
      gptEncoding = _encodingForModel("gpt-4");  // ❌ Wrong tokenizer for Claude
    }
    return gptEncoding;
  }
  return llamaEncoding;
}

2. Context Percentage Calculation

  • Location: core/llm/countTokens.ts lines 530-537
  • The context percentage is calculated as: inputTokens / availableTokens
  • If token counting is off by ~35%, this would explain 130k showing as 70% instead of closer to 65% for 200k actual tokens
const inputTokens = currentTotal + systemMsgTokens + toolTokens + lastMessagesTokens;
const availableTokens = contextLength - countingSafetyBuffer - minOutputTokens;
const contextPercentage = inputTokens / availableTokens;

3. Unaccounted Token Overhead

Additional sources of token discrepancy to investigate:

  • Message formatting tokens: BASE_TOKENS = 4 per message (line 186) - may not match Anthropic's actual overhead
  • Tool call tokens: TOOL_CALL_EXTRA_TOKENS = 10 per tool call (line 187) - estimate may be off
  • Tool definition tokens: countToolsTokens() uses OpenAI's formula (lines 137-181), not Anthropic's
  • Image tokens: Fixed at 1024 tokens per image (line 90) - may differ for Anthropic
  • Thinking/reasoning tokens: Special Claude feature with redactedThinking and signature fields - overhead not fully accounted for
  • Cache control blocks: When using prompt caching, special tokens may not be counted

4. Safety Buffer May Be Masking Issues

  • Location: core/llm/countTokens.ts lines 361-368
  • Safety buffer is 2% of context length (max 1000 tokens)
  • For 200k context, this is 1000 tokens buffer
  • This buffer is meant to account for tokenizer inaccuracies, but a 35% error (70k tokens) far exceeds this

Reproduction

  1. Use an Anthropic Claude model (e.g., claude-3-5-sonnet-latest or claude-sonnet-4-5-20250929)
  2. Send messages that should result in ~200k tokens actual usage
  3. Observe the UI shows ~130k tokens at ~70% context usage
  4. Verify actual token usage from Anthropic API response (usage.input_tokens)

Expected Behavior

  • Token counting should be accurate to within ~5% of actual usage
  • Context percentage should reflect true token consumption
  • 200k tokens should show as ~100% context usage for a 200k context window model

Proposed Solutions

Option 1: Use Anthropic's Actual Token Counts (Recommended)

  • Anthropic returns exact token counts in API responses: usage.input_tokens and usage.output_tokens
  • Location: core/llm/llms/Anthropic.ts lines 307-315 already captures this
  • Instead of estimating tokens with GPT tokenizer, use these real values for context calculations
  • Cache and reuse these counts for subsequent message compilation

Option 2: Implement Proper Anthropic Tokenizer

  • Add support for Anthropic's actual tokenizer (if available via SDK/library)
  • Update encodingForModel() to use correct tokenizer for Anthropic models
  • This would be more accurate for pre-flight token estimation

Option 3: Use LLM-Specific Token Counting

  • Add a method in BaseLLM: estimateTokens(content: MessageContent): number
  • Let each provider override with their own tokenization logic
  • Anthropic class can use actual counts from previous responses as calibration

Option 4: Improve Safety Buffer for Known Inaccurate Cases

  • Detect when using mismatched tokenizer (e.g., GPT tokenizer for Claude)
  • Increase safety buffer proportionally (e.g., 35% for this case)
  • This is a bandaid but would prevent context overflow errors

Implementation Plan

  1. Immediate Fix:

    • Store and use actual token counts from Anthropic API responses
    • Update compileChatMessages to accept optional actualTokenCounts
    • Use these for calculating context percentage instead of estimates
  2. Medium Term:

    • Research if Anthropic provides a tokenizer library
    • If available, integrate proper Anthropic tokenizer
    • Add tests comparing estimated vs actual token counts
  3. Long Term:

    • Refactor token counting to be provider-specific
    • Each LLM provider handles its own tokenization
    • Fall back to conservative estimates when exact counting unavailable

Related Files

  • core/llm/countTokens.ts - Token counting logic
  • core/llm/llms/Anthropic.ts - Anthropic provider (captures real usage)
  • core/llm/autodetect.ts - Template detection (returns 'none' for Claude)
  • core/llm/constants.ts - DEFAULT_PRUNING_LENGTH = 128000
  • packages/llm-info/src/providers/anthropic.ts - Model definitions (200k context)
  • core/llm/index.ts - compileChatMessages usage

Additional Context

  • This affects all Anthropic Claude models
  • The DEFAULT_PRUNING_LENGTH is 128000, but Claude models have 200k context
  • The contextLength from llm-info correctly shows 200000 for Claude models
  • Issue manifests in UI showing incorrect context usage percentage

Metadata

Metadata

Assignees

Labels

area:context-providersRelates to context providerskind:bugIndicates an unexpected problem or unintended behavior

Type

No type

Projects

Status

Todo

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions