thushan · thushan · Feb 16, 2026 · Feb 12, 2026 · Feb 12, 2026 · Feb 13, 2026
@@ -107,9 +107,17 @@ olla/
 - `config.yaml` - Main configuration
 - `internal/app/handlers/server_routes.go` - Route registration & API setup
 - `internal/app/handlers/handler_proxy.go` - Request routing logic
+- `internal/app/handlers/handler_translation.go` - Translation handler with passthrough logic
 - `internal/adapter/proxy/sherpa/service.go` - Sherpa proxy implementation
 - `internal/adapter/proxy/olla/service.go` - Olla proxy implementation
 - `internal/adapter/translator/` - API translation layer (OpenAI ↔ Provider formats)
+- `internal/adapter/translator/types.go` - PassthroughCapable interface and translator types
+- `internal/adapter/translator/anthropic/` - Anthropic translator implementation
+- `internal/adapter/stats/translator_collector.go` - Translator metrics collector
+- `internal/core/constants/translator.go` - TranslatorMode and FallbackReason constants
+- `internal/core/ports/stats.go` - StatsCollector interface with translator tracking
+- `internal/core/domain/profile_config.go` - AnthropicSupportConfig for backend profiles
+- `config/profiles/*.yaml` - Backend profiles with `anthropic_support` sections
 - `internal/version/version.go` - Version information embedded at build time
 - `/test/scripts/logic/test-model-routing.sh` - Test routing & headers
 
@@ -121,6 +129,7 @@ olla/
 - `/internal/status/endpoints` - Endpoints status details
 - `/internal/status/models` - Models status details
 - `/internal/stats/models` - Model statistics
+- `/internal/stats/translators` - Translator statistics
 - `/internal/process` - Process statistics
 - `/version` - Version information
 
@@ -135,12 +144,20 @@ olla/
 ### Translator Endpoints
 Dynamically registered based on configured translators (e.g., Anthropic Messages API)
 
+- `/olla/anthropic/v1/messages` - Anthropic Messages API (POST) - supports passthrough and translation modes
+- `/olla/anthropic/v1/models` - List models in Anthropic format (GET)
+- `/olla/anthropic/v1/messages/count_tokens` - Token count estimation (POST)
+
 ## Response Headers
 - `X-Olla-Endpoint`: Backend name
 - `X-Olla-Model`: Model used
 - `X-Olla-Backend-Type`: ollama/openai/openai-compatible/lm-studio/vllm/sglang/llamacpp/lemonade
 - `X-Olla-Request-ID`: Request ID
 - `X-Olla-Response-Time`: Total processing time
+- `X-Olla-Mode`: Translator mode used (`passthrough` or absent for translation) - set on Anthropic translator requests
+- `X-Olla-Routing-Strategy`: Routing strategy used (when model routing is active)
+- `X-Olla-Routing-Decision`: Routing decision made (routed/fallback/rejected)
+- `X-Olla-Routing-Reason`: Human-readable reason for routing decision
 
 ## Testing
 
@@ -181,7 +198,9 @@ Always run `make ready` before committing changes.
 - **Application Layer** (`internal/app`): HTTP handlers, middleware, and services
 
 ### Key Components
-- **Translator Layer**: Enables API format translation (e.g., OpenAI ↔ Anthropic)
+- **Translator Layer**: Enables API format translation (e.g., OpenAI ↔ Anthropic) with passthrough optimisation for backends with native support
+- **Passthrough Mode**: When a backend natively supports the Anthropic Messages API (vLLM, llama.cpp, LM Studio, Ollama), requests bypass translation entirely
+- **Translator Metrics**: Thread-safe per-translator statistics tracking passthrough/translation rates, fallback reasons, latency, and streaming breakdown (`internal/adapter/stats/translator_collector.go`)
 - **Proxy Engines**: Choose Sherpa (simple) or Olla (high-performance)
 - **Load Balancing**: Priority-based recommended for production
 - **Version Management**: Build-time version injection via `internal/version`

@@ -130,12 +130,15 @@ model_registry:
 
 translators:
   #####
-  # !Experimental! v0.0.20+
-  # Anthropic translation is very early stages of development, so please let us know
-  # if you come across issues or have feedback.
+  # Anthropic Messages API Translation (v0.0.20+)
+  # Enabled by default. Still actively being improved - please report any issues or feedback.
   #####
   anthropic:
-    enabled: false
+    enabled: true
+    # passthrough_enabled only applies when enabled=true
+    # When true: Forwards requests directly to backends with native Anthropic support (optimal performance)
+    # When false: Always translates Anthropic ↔ OpenAI format (useful for debugging/testing)
+    passthrough_enabled: true
     max_message_size: 10485760  # 10MB - Anthropic API limit
     # !! WARNING: Do not enable inspector in production without reviewing data privacy !!
     #             Anthropic messages may contain sensitive user data.

@@ -16,6 +16,16 @@ routing:
 # API compatibility
 api:
   openai_compatible: true
+
+  # Anthropic Messages API support (b4847+)
+  # llama.cpp is the ONLY backend that supports full token counting via /v1/messages/count_tokens
+  # This enables accurate prompt token estimation without making actual inference requests
+  anthropic_support:
+    enabled: true
+    messages_path: /v1/messages
+    token_count: true
+    min_version: "b4847"
+
   paths:
     # Model management (OpenAI-compatible)
     - /v1/models           # 4: list models (typically returns single model)

@@ -14,6 +14,16 @@ routing:
 # API compatibility
 api:
   openai_compatible: true
+
+  # Anthropic Messages API support (v0.4.1+)
+  # Added specifically for Claude Code integration, enabling native Anthropic API support
+  # without requiring translation middleware
+  anthropic_support:
+    enabled: true
+    messages_path: /v1/messages
+    token_count: false
+    min_version: "0.4.1"
+
   paths:
     - /v1/models          # 0: health check & models
     - /v1/chat/completions # 1: chat completions

@@ -12,6 +12,19 @@ routing:
 # API compatibility
 api:
   openai_compatible: true
+
+  # Anthropic Messages API support (v0.14.0+)
+  # UNSUPPORTED:
+  #           - /v1/messages/count_tokens
+  #             [11-01-2026]: https://docs.ollama.com/api/anthropic-compatibility#not-supported
+  anthropic_support:
+    enabled: true
+    messages_path: /v1/messages
+    token_count: false
+    min_version: "0.14.0"
+    limitations:
+      - token_counting_404
+
   paths:
     - /                    # 0: health check
     - /api/generate        # 1: text completion

@@ -13,6 +13,18 @@ routing:
 # API compatibility
 api:
   openai_compatible: true
+
+  # Anthropic Messages API support (v0.11.1+)
+  # vLLM v0.11.1+ natively supports the Anthropic Messages API, allowing direct forwarding
+  # of Anthropic-format requests without translation overhead
+  anthropic_support:
+    enabled: true
+    messages_path: /v1/messages
+    token_count: false
+    min_version: "0.11.1"
+    limitations:
+      - no_token_counting
+
   paths:
     # Health and system endpoints
     - /health              # 0: health check (vLLM-specific endpoint)

@@ -15,10 +15,14 @@ The Anthropic translator accepts requests in Anthropic Messages API format at `/
 **Key Features**:
 
 - ✅ Full Anthropic Messages API compatibility
+- ✅ **Passthrough mode** for backends with native Anthropic support (vLLM, llama.cpp, LM Studio, Ollama)
+- ✅ **Translation mode** for OpenAI-compatible backends without native support
+- ✅ Automatic fallback from passthrough to translation when needed
 - ✅ Streaming via Server-Sent Events (SSE)
 - ✅ Tool use (function calling)
 - ✅ Works with all OpenAI-compatible backends
 - ✅ Zero backend changes required
+- ✅ Translator metrics for observability (passthrough/translation rates, latency, fallback tracking)
 - ⚠️ **Vision Support**: Image content blocks accepted but not yet processed
 - ⛔ **Async Support**: Asynchronous workflows are not supported
 
@@ -31,6 +35,37 @@ The Anthropic translator accepts requests in Anthropic Messages API format at `/
 
 ## How it Works
 
+Olla supports two modes for handling Anthropic API requests:
+
+### Passthrough Mode (Preferred)
+
+When a backend natively supports the Anthropic Messages API, requests are forwarded directly without any translation overhead.
+
+```mermaid
+sequenceDiagram
+    participant Client as Claude Code
+    participant Olla as Olla (Passthrough)
+    participant Backend as Anthropic-Compatible Backend
+
+    Client->>Olla: POST /olla/anthropic/v1/messages<br/>(Anthropic format)
+
+    Note over Olla: 1. Detect native Anthropic support
+    Note over Olla: 2. Forward request as-is
+
+    Olla->>Backend: POST /v1/messages<br/>(Anthropic format - unchanged)
+    Backend->>Olla: Response (Anthropic format)
+
+    Olla->>Client: Response (Anthropic format - unchanged)
+```
+
+**Compatible backends**: vLLM (v0.11.1+), llama.cpp (b4847+), LM Studio (v0.4.1+), Ollama (v0.14.0+)
+
+**Observability**: Responses include `X-Olla-Mode: passthrough` header.
+
+### Translation Mode (Fallback)
+
+When no backend supports native Anthropic format, requests are translated to OpenAI format and responses are translated back.
+
 ```mermaid
 sequenceDiagram
     participant Client as Claude Code
@@ -51,16 +86,9 @@ sequenceDiagram
     Olla->>Client: Response (Anthropic format)
 ```
 
-**Translation Process**:
-
-1. Client sends Anthropic-formatted request
-2. Olla translates request to OpenAI format
-3. Request routed through standard Olla pipeline (load balancing, health checks)
-4. Backend processes request (unaware of original format)
-5. Olla translates OpenAI response back to Anthropic format
-6. Client receives Anthropic-formatted response
+**Mode Selection**: Olla automatically selects the best mode based on available backend capabilities. No client-side configuration is required.
 
-For detailed explanation, see [API Translation Concept](../concepts/api-translation.md).
+For detailed explanation of both modes, see [API Translation Concept](../concepts/api-translation.md).
 
 ## Endpoints Overview
 
@@ -600,6 +628,13 @@ All responses include standard Olla headers:
 | `X-Olla-Model` | Actual model used | `llama4:latest` |
 | `X-Olla-Backend-Type` | Backend type | `ollama` |
 | `X-Olla-Response-Time` | Total processing time | `1.234s` |
+| `X-Olla-Mode` | Translator mode (only present for passthrough) | `passthrough` |
+
+!!! tip "Detecting Passthrough Mode"
+    When passthrough mode is active, the `X-Olla-Mode: passthrough` header is included in the response. When translation mode is used, this header is absent. This allows monitoring and debugging to distinguish between the two modes.
+
+!!! info "Translator Statistics"
+    For aggregate translator metrics including passthrough rates, success rates, fallback reasons, and latency data, query the [`GET /internal/stats/translators`](system.md#get-internalstatstranslators) endpoint.
 
 
 ## Authentication
@@ -681,6 +716,7 @@ Errors follow Anthropic API format:
 - Stop sequences
 - Temperature, top_p, top_k parameters
 - Content blocks (text, tool_use, tool_result)
+- **Passthrough mode** for backends with native Anthropic support (zero translation overhead)
 
 **Tool Choice Mapping**:
 
@@ -707,12 +743,13 @@ Errors follow Anthropic API format:
 
 ## Configuration
 
-Enable Anthropic translation in `config.yaml`:
+Anthropic translation is enabled by default. To customise, edit `config.yaml`:
 
 ```yaml
 translators:
   anthropic:
-    enabled: true                   # Enable Anthropic API translator
+    enabled: true                   # Enabled by default
+    passthrough_enabled: true       # Forward directly to backends with native Anthropic support (default)
     max_message_size: 10485760     # Max request size (10MB)
 
 # Standard Olla configuration
@@ -730,8 +767,43 @@ discovery:
 
 | Option | Type | Default | Description |
 |--------|------|---------|-------------|
-| `enabled` | boolean | `false` | Enable Anthropic translator |
+| `enabled` | boolean | `true` | Enable Anthropic translator (enabled by default) |
 | `max_message_size` | integer | `10485760` | Max request size in bytes (10MB) |
+| `passthrough_enabled` | boolean | `true` | Passthrough optimisation mode. When `true` (default), requests are forwarded directly to backends with native Anthropic support for zero translation overhead. When `false`, all requests go through translation regardless of backend capabilities. Only applies when `enabled: true`. Individual backends must also declare `anthropic_support` in their profile. |
+
+### Passthrough Configuration
+
+Passthrough mode requires two things to be active:
+
+1. The `passthrough_enabled` field must be set to `true` in the translator configuration
+2. Backend profiles must declare native Anthropic support via `anthropic_support.enabled: true`
+
+```yaml
+translators:
+  anthropic:
+    enabled: true
+    passthrough_enabled: true  # Required to enable passthrough mode
+```
+
+When `passthrough_enabled` is `true` (the default), Olla forwards requests directly to backends with native Anthropic support. Set `passthrough_enabled` to `false` to force all requests through the translation pipeline regardless of backend capabilities, which can be useful for debugging or testing the translation layer.
+
+**Backends with native Anthropic support**:
+
+| Backend | Profile | Min Version | Notes |
+|---------|---------|-------------|-------|
+| vLLM | `config/profiles/vllm.yaml` | v0.11.1+ | No token counting |
+| llama.cpp | `config/profiles/llamacpp.yaml` | b4847+ | Supports token counting |
+| LM Studio | `config/profiles/lmstudio.yaml` | v0.4.1+ | No token counting |
+| Ollama | `config/profiles/ollama.yaml` | v0.14.0+ | No token counting |
+
+To disable passthrough for a specific backend, set `anthropic_support.enabled: false` in the profile:
+
+```yaml
+# config/profiles/vllm.yaml (custom override)
+api:
+  anthropic_support:
+    enabled: false  # Force translation mode for this backend
+```
 
 
 ## Performance Considerations

@@ -19,10 +19,14 @@ If you ever need to remember the port, think - what's the port, 4 OLLA?!
 ## API Sections
 
 ### [System Endpoints](system.md)
-Internal endpoints for health monitoring and system status.
+Internal endpoints for health monitoring, system status, and statistics.
 
 - `/internal/health` - Health check endpoint
 - `/internal/status` - System status and statistics
+- `/internal/status/endpoints` - Endpoint status details
+- `/internal/status/models` - Model registry status
+- `/internal/stats/models` - Model usage statistics
+- `/internal/stats/translators` - Translator usage and performance statistics
 - `/internal/process` - Process information
 
 ### [Unified Models API](models.md)
@@ -88,21 +92,24 @@ Anthropic-compatible API endpoints for Claude clients.
 **Endpoints**:
 - `POST /olla/anthropic/v1/messages` - Create a message (chat)
 - `GET /olla/anthropic/v1/models` - List available models
+- `POST /olla/anthropic/v1/messages/count_tokens` - Estimate token count
 
 **Features**:
 - Full Anthropic Messages API v1 support
-- Automatic translation to OpenAI format
+- **Passthrough mode** for backends with native Anthropic support (vLLM, llama.cpp, LM Studio, Ollama)
+- Automatic fallback to translation mode when needed
 - Streaming with Server-Sent Events
 - Tool use (function calling)
 - Vision support (multi-modal)
+- Translator metrics for observability
 
 **Use With**:
 - Claude Code
 - OpenCode
 - Crush CLI
 - Any Anthropic API client
 
-See [API Translation](../concepts/api-translation.md) for how translation works.
+See [API Translation](../concepts/api-translation.md) for how passthrough and translation modes work.
 
 ## Authentication
 
@@ -145,6 +152,7 @@ All responses include:
 | `X-Olla-Routing-Strategy` | Routing strategy used (when model routing is active) |
 | `X-Olla-Routing-Decision` | Routing decision made (routed/fallback/rejected) |
 | `X-Olla-Routing-Reason` | Human-readable reason for routing decision |
+| `X-Olla-Mode` | Translator mode (`passthrough` when native format used; absent for translation mode) |
 
 ### Provider Metrics (Debug Logs)