Context length calculation incorrect for Anthropic Claude models (shows 130k tokens at 70% when actually 200k)

## Problem

Context length calculation appears to be significantly off for Anthropic Claude models. The pruning/compaction logic thinks the context is at 130k tokens (70% full) when the actual token count is around 200k tokens. This is for Claude models which have a 200k context window.

## Root Cause Analysis

Based on codebase investigation, the issue likely stems from one or more of the following:

### 1. **Tokenizer Mismatch**
- **Location**: `core/llm/countTokens.ts` lines 73-88
- Claude models use `autodetectTemplateType()` which returns `"none"` for Claude models (see `core/llm/autodetect.ts` line 343)
- When template type is `"none"`, the system falls back to GPT-4 tiktoken encoder (`encodingForModel()`)
- **Issue**: Anthropic uses a different tokenizer than OpenAI's tiktoken. Using GPT-4 tokenizer to count Claude tokens will produce inaccurate counts

```typescript
function encodingForModel(modelName: string): Encoding {
  const modelType = autodetectTemplateType(modelName);

  if (!modelType || modelType === "none") {
    if (!gptEncoding) {
      gptEncoding = _encodingForModel("gpt-4");  // ❌ Wrong tokenizer for Claude
    }
    return gptEncoding;
  }
  return llamaEncoding;
}
```

### 2. **Context Percentage Calculation**
- **Location**: `core/llm/countTokens.ts` lines 530-537
- The context percentage is calculated as: `inputTokens / availableTokens`
- If token counting is off by ~35%, this would explain 130k showing as 70% instead of closer to 65% for 200k actual tokens

```typescript
const inputTokens = currentTotal + systemMsgTokens + toolTokens + lastMessagesTokens;
const availableTokens = contextLength - countingSafetyBuffer - minOutputTokens;
const contextPercentage = inputTokens / availableTokens;
```

### 3. **Unaccounted Token Overhead**
Additional sources of token discrepancy to investigate:
- **Message formatting tokens**: `BASE_TOKENS = 4` per message (line 186) - may not match Anthropic's actual overhead
- **Tool call tokens**: `TOOL_CALL_EXTRA_TOKENS = 10` per tool call (line 187) - estimate may be off
- **Tool definition tokens**: `countToolsTokens()` uses OpenAI's formula (lines 137-181), not Anthropic's
- **Image tokens**: Fixed at 1024 tokens per image (line 90) - may differ for Anthropic
- **Thinking/reasoning tokens**: Special Claude feature with `redactedThinking` and `signature` fields - overhead not fully accounted for
- **Cache control blocks**: When using prompt caching, special tokens may not be counted

### 4. **Safety Buffer May Be Masking Issues**
- **Location**: `core/llm/countTokens.ts` lines 361-368
- Safety buffer is 2% of context length (max 1000 tokens)
- For 200k context, this is 1000 tokens buffer
- This buffer is meant to account for tokenizer inaccuracies, but a 35% error (70k tokens) far exceeds this

## Reproduction

1. Use an Anthropic Claude model (e.g., `claude-3-5-sonnet-latest` or `claude-sonnet-4-5-20250929`)
2. Send messages that should result in ~200k tokens actual usage
3. Observe the UI shows ~130k tokens at ~70% context usage
4. Verify actual token usage from Anthropic API response (usage.input_tokens)

## Expected Behavior

- Token counting should be accurate to within ~5% of actual usage
- Context percentage should reflect true token consumption
- 200k tokens should show as ~100% context usage for a 200k context window model

## Proposed Solutions

### Option 1: Use Anthropic's Actual Token Counts (Recommended)
- Anthropic returns exact token counts in API responses: `usage.input_tokens` and `usage.output_tokens`
- **Location**: `core/llm/llms/Anthropic.ts` lines 307-315 already captures this
- Instead of estimating tokens with GPT tokenizer, use these real values for context calculations
- Cache and reuse these counts for subsequent message compilation

### Option 2: Implement Proper Anthropic Tokenizer
- Add support for Anthropic's actual tokenizer (if available via SDK/library)
- Update `encodingForModel()` to use correct tokenizer for Anthropic models
- This would be more accurate for pre-flight token estimation

### Option 3: Use LLM-Specific Token Counting
- Add a method in BaseLLM: `estimateTokens(content: MessageContent): number`
- Let each provider override with their own tokenization logic
- Anthropic class can use actual counts from previous responses as calibration

### Option 4: Improve Safety Buffer for Known Inaccurate Cases
- Detect when using mismatched tokenizer (e.g., GPT tokenizer for Claude)
- Increase safety buffer proportionally (e.g., 35% for this case)
- This is a bandaid but would prevent context overflow errors

## Implementation Plan

1. **Immediate Fix**: 
   - Store and use actual token counts from Anthropic API responses
   - Update `compileChatMessages` to accept optional `actualTokenCounts`
   - Use these for calculating context percentage instead of estimates

2. **Medium Term**:
   - Research if Anthropic provides a tokenizer library
   - If available, integrate proper Anthropic tokenizer
   - Add tests comparing estimated vs actual token counts

3. **Long Term**:
   - Refactor token counting to be provider-specific
   - Each LLM provider handles its own tokenization
   - Fall back to conservative estimates when exact counting unavailable

## Related Files

- `core/llm/countTokens.ts` - Token counting logic
- `core/llm/llms/Anthropic.ts` - Anthropic provider (captures real usage)
- `core/llm/autodetect.ts` - Template detection (returns 'none' for Claude)
- `core/llm/constants.ts` - DEFAULT_PRUNING_LENGTH = 128000
- `packages/llm-info/src/providers/anthropic.ts` - Model definitions (200k context)
- `core/llm/index.ts` - compileChatMessages usage

## Additional Context

- This affects all Anthropic Claude models
- The DEFAULT_PRUNING_LENGTH is 128000, but Claude models have 200k context
- The contextLength from llm-info correctly shows 200000 for Claude models
- Issue manifests in UI showing incorrect context usage percentage

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Context length calculation incorrect for Anthropic Claude models (shows 130k tokens at 70% when actually 200k) #9231

Problem

Root Cause Analysis

1. Tokenizer Mismatch

2. Context Percentage Calculation

3. Unaccounted Token Overhead

4. Safety Buffer May Be Masking Issues

Reproduction

Expected Behavior

Proposed Solutions

Option 1: Use Anthropic's Actual Token Counts (Recommended)

Option 2: Implement Proper Anthropic Tokenizer

Option 3: Use LLM-Specific Token Counting

Option 4: Improve Safety Buffer for Known Inaccurate Cases

Implementation Plan

Related Files

Additional Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Context length calculation incorrect for Anthropic Claude models (shows 130k tokens at 70% when actually 200k) #9231

Description

Problem

Root Cause Analysis

1. Tokenizer Mismatch

2. Context Percentage Calculation

3. Unaccounted Token Overhead

4. Safety Buffer May Be Masking Issues

Reproduction

Expected Behavior

Proposed Solutions

Option 1: Use Anthropic's Actual Token Counts (Recommended)

Option 2: Implement Proper Anthropic Tokenizer

Option 3: Use LLM-Specific Token Counting

Option 4: Improve Safety Buffer for Known Inaccurate Cases

Implementation Plan

Related Files

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions