Skip to content
Merged
Show file tree
Hide file tree
Changes from 9 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 20 additions & 1 deletion CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -107,9 +107,17 @@ olla/
- `config.yaml` - Main configuration
- `internal/app/handlers/server_routes.go` - Route registration & API setup
- `internal/app/handlers/handler_proxy.go` - Request routing logic
- `internal/app/handlers/handler_translation.go` - Translation handler with passthrough logic
- `internal/adapter/proxy/sherpa/service.go` - Sherpa proxy implementation
- `internal/adapter/proxy/olla/service.go` - Olla proxy implementation
- `internal/adapter/translator/` - API translation layer (OpenAI ↔ Provider formats)
- `internal/adapter/translator/types.go` - PassthroughCapable interface and translator types
- `internal/adapter/translator/anthropic/` - Anthropic translator implementation
- `internal/adapter/stats/translator_collector.go` - Translator metrics collector
- `internal/core/constants/translator.go` - TranslatorMode and FallbackReason constants
- `internal/core/ports/stats.go` - StatsCollector interface with translator tracking
- `internal/core/domain/profile_config.go` - AnthropicSupportConfig for backend profiles
- `config/profiles/*.yaml` - Backend profiles with `anthropic_support` sections
- `internal/version/version.go` - Version information embedded at build time
- `/test/scripts/logic/test-model-routing.sh` - Test routing & headers

Expand All @@ -121,6 +129,7 @@ olla/
- `/internal/status/endpoints` - Endpoints status details
- `/internal/status/models` - Models status details
- `/internal/stats/models` - Model statistics
- `/internal/stats/translators` - Translator statistics
- `/internal/process` - Process statistics
- `/version` - Version information

Expand All @@ -135,12 +144,20 @@ olla/
### Translator Endpoints
Dynamically registered based on configured translators (e.g., Anthropic Messages API)

- `/olla/anthropic/v1/messages` - Anthropic Messages API (POST) - supports passthrough and translation modes
- `/olla/anthropic/v1/models` - List models in Anthropic format (GET)
- `/olla/anthropic/v1/messages/count_tokens` - Token count estimation (POST)

## Response Headers
- `X-Olla-Endpoint`: Backend name
- `X-Olla-Model`: Model used
- `X-Olla-Backend-Type`: ollama/openai/openai-compatible/lm-studio/vllm/sglang/llamacpp/lemonade
- `X-Olla-Request-ID`: Request ID
- `X-Olla-Response-Time`: Total processing time
- `X-Olla-Mode`: Translator mode used (`passthrough` or absent for translation) - set on Anthropic translator requests
- `X-Olla-Routing-Strategy`: Routing strategy used (when model routing is active)
- `X-Olla-Routing-Decision`: Routing decision made (routed/fallback/rejected)
- `X-Olla-Routing-Reason`: Human-readable reason for routing decision

## Testing

Expand Down Expand Up @@ -181,7 +198,9 @@ Always run `make ready` before committing changes.
- **Application Layer** (`internal/app`): HTTP handlers, middleware, and services

### Key Components
- **Translator Layer**: Enables API format translation (e.g., OpenAI ↔ Anthropic)
- **Translator Layer**: Enables API format translation (e.g., OpenAI ↔ Anthropic) with passthrough optimisation for backends with native support
- **Passthrough Mode**: When a backend natively supports the Anthropic Messages API (vLLM, llama.cpp, LM Studio, Ollama), requests bypass translation entirely
- **Translator Metrics**: Thread-safe per-translator statistics tracking passthrough/translation rates, fallback reasons, latency, and streaming breakdown (`internal/adapter/stats/translator_collector.go`)
- **Proxy Engines**: Choose Sherpa (simple) or Olla (high-performance)
- **Load Balancing**: Priority-based recommended for production
- **Version Management**: Build-time version injection via `internal/version`
Expand Down
Binary file modified assets/diagrams/features.excalidraw.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
11 changes: 7 additions & 4 deletions config/config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -130,12 +130,15 @@ model_registry:

translators:
#####
# !Experimental! v0.0.20+
# Anthropic translation is very early stages of development, so please let us know
# if you come across issues or have feedback.
# Anthropic Messages API Translation (v0.0.20+)
# Enabled by default. Still actively being improved - please report any issues or feedback.
#####
anthropic:
enabled: false
enabled: true
# passthrough_enabled only applies when enabled=true
# When true: Forwards requests directly to backends with native Anthropic support (optimal performance)
# When false: Always translates Anthropic ↔ OpenAI format (useful for debugging/testing)
passthrough_enabled: true
max_message_size: 10485760 # 10MB - Anthropic API limit
# !! WARNING: Do not enable inspector in production without reviewing data privacy !!
# Anthropic messages may contain sensitive user data.
Expand Down
10 changes: 10 additions & 0 deletions config/profiles/llamacpp.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,16 @@ routing:
# API compatibility
api:
openai_compatible: true

# Anthropic Messages API support (b4847+)
# llama.cpp is the ONLY backend that supports full token counting via /v1/messages/count_tokens
# This enables accurate prompt token estimation without making actual inference requests
anthropic_support:
enabled: true
messages_path: /v1/messages
token_count: true
min_version: "b4847"

paths:
# Model management (OpenAI-compatible)
- /v1/models # 4: list models (typically returns single model)
Expand Down
10 changes: 10 additions & 0 deletions config/profiles/lmstudio.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,16 @@ routing:
# API compatibility
api:
openai_compatible: true

# Anthropic Messages API support (v0.4.1+)
# Added specifically for Claude Code integration, enabling native Anthropic API support
# without requiring translation middleware
anthropic_support:
enabled: true
messages_path: /v1/messages
token_count: false
min_version: "0.4.1"

paths:
- /v1/models # 0: health check & models
- /v1/chat/completions # 1: chat completions
Expand Down
13 changes: 13 additions & 0 deletions config/profiles/ollama.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,19 @@ routing:
# API compatibility
api:
openai_compatible: true

# Anthropic Messages API support (v0.14.0+)
# UNSUPPORTED:
# - /v1/messages/count_tokens
# [11-01-2026]: https://docs.ollama.com/api/anthropic-compatibility#not-supported
anthropic_support:
enabled: true
messages_path: /v1/messages
token_count: false
min_version: "0.14.0"
limitations:
- token_counting_404

paths:
- / # 0: health check
- /api/generate # 1: text completion
Expand Down
12 changes: 12 additions & 0 deletions config/profiles/vllm.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,18 @@ routing:
# API compatibility
api:
openai_compatible: true

# Anthropic Messages API support (v0.11.1+)
# vLLM v0.11.1+ natively supports the Anthropic Messages API, allowing direct forwarding
# of Anthropic-format requests without translation overhead
anthropic_support:
enabled: true
messages_path: /v1/messages
token_count: false
min_version: "0.11.1"
limitations:
- no_token_counting

paths:
# Health and system endpoints
- /health # 0: health check (vLLM-specific endpoint)
Expand Down
96 changes: 84 additions & 12 deletions docs/content/api-reference/anthropic.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,10 +15,14 @@ The Anthropic translator accepts requests in Anthropic Messages API format at `/
**Key Features**:

- ✅ Full Anthropic Messages API compatibility
- ✅ **Passthrough mode** for backends with native Anthropic support (vLLM, llama.cpp, LM Studio, Ollama)
- ✅ **Translation mode** for OpenAI-compatible backends without native support
- ✅ Automatic fallback from passthrough to translation when needed
- ✅ Streaming via Server-Sent Events (SSE)
- ✅ Tool use (function calling)
- ✅ Works with all OpenAI-compatible backends
- ✅ Zero backend changes required
- ✅ Translator metrics for observability (passthrough/translation rates, latency, fallback tracking)
- ⚠️ **Vision Support**: Image content blocks accepted but not yet processed
- ⛔ **Async Support**: Asynchronous workflows are not supported

Expand All @@ -31,6 +35,37 @@ The Anthropic translator accepts requests in Anthropic Messages API format at `/

## How it Works

Olla supports two modes for handling Anthropic API requests:

### Passthrough Mode (Preferred)

When a backend natively supports the Anthropic Messages API, requests are forwarded directly without any translation overhead.

```mermaid
sequenceDiagram
participant Client as Claude Code
participant Olla as Olla (Passthrough)
participant Backend as Anthropic-Compatible Backend

Client->>Olla: POST /olla/anthropic/v1/messages<br/>(Anthropic format)

Note over Olla: 1. Detect native Anthropic support
Note over Olla: 2. Forward request as-is

Olla->>Backend: POST /v1/messages<br/>(Anthropic format - unchanged)
Backend->>Olla: Response (Anthropic format)

Olla->>Client: Response (Anthropic format - unchanged)
```

**Compatible backends**: vLLM (v0.11.1+), llama.cpp (b4847+), LM Studio (v0.4.1+), Ollama (v0.14.0+)

**Observability**: Responses include `X-Olla-Mode: passthrough` header.

### Translation Mode (Fallback)

When no backend supports native Anthropic format, requests are translated to OpenAI format and responses are translated back.

```mermaid
sequenceDiagram
participant Client as Claude Code
Expand All @@ -51,16 +86,9 @@ sequenceDiagram
Olla->>Client: Response (Anthropic format)
```

**Translation Process**:

1. Client sends Anthropic-formatted request
2. Olla translates request to OpenAI format
3. Request routed through standard Olla pipeline (load balancing, health checks)
4. Backend processes request (unaware of original format)
5. Olla translates OpenAI response back to Anthropic format
6. Client receives Anthropic-formatted response
**Mode Selection**: Olla automatically selects the best mode based on available backend capabilities. No client-side configuration is required.

For detailed explanation, see [API Translation Concept](../concepts/api-translation.md).
For detailed explanation of both modes, see [API Translation Concept](../concepts/api-translation.md).

## Endpoints Overview

Expand Down Expand Up @@ -600,6 +628,13 @@ All responses include standard Olla headers:
| `X-Olla-Model` | Actual model used | `llama4:latest` |
| `X-Olla-Backend-Type` | Backend type | `ollama` |
| `X-Olla-Response-Time` | Total processing time | `1.234s` |
| `X-Olla-Mode` | Translator mode (only present for passthrough) | `passthrough` |

!!! tip "Detecting Passthrough Mode"
When passthrough mode is active, the `X-Olla-Mode: passthrough` header is included in the response. When translation mode is used, this header is absent. This allows monitoring and debugging to distinguish between the two modes.

!!! info "Translator Statistics"
For aggregate translator metrics including passthrough rates, success rates, fallback reasons, and latency data, query the [`GET /internal/stats/translators`](system.md#get-internalstatstranslators) endpoint.


## Authentication
Expand Down Expand Up @@ -681,6 +716,7 @@ Errors follow Anthropic API format:
- Stop sequences
- Temperature, top_p, top_k parameters
- Content blocks (text, tool_use, tool_result)
- **Passthrough mode** for backends with native Anthropic support (zero translation overhead)

**Tool Choice Mapping**:

Expand All @@ -707,12 +743,13 @@ Errors follow Anthropic API format:

## Configuration

Enable Anthropic translation in `config.yaml`:
Anthropic translation is enabled by default. To customise, edit `config.yaml`:

```yaml
translators:
anthropic:
enabled: true # Enable Anthropic API translator
enabled: true # Enabled by default
passthrough_enabled: true # Forward directly to backends with native Anthropic support (default)
max_message_size: 10485760 # Max request size (10MB)

# Standard Olla configuration
Expand All @@ -730,8 +767,43 @@ discovery:

| Option | Type | Default | Description |
|--------|------|---------|-------------|
| `enabled` | boolean | `false` | Enable Anthropic translator |
| `enabled` | boolean | `true` | Enable Anthropic translator (enabled by default) |
| `max_message_size` | integer | `10485760` | Max request size in bytes (10MB) |
| `passthrough_enabled` | boolean | `true` | Passthrough optimisation mode. When `true` (default), requests are forwarded directly to backends with native Anthropic support for zero translation overhead. When `false`, all requests go through translation regardless of backend capabilities. Only applies when `enabled: true`. Individual backends must also declare `anthropic_support` in their profile. |

### Passthrough Configuration

Passthrough mode requires two things to be active:

1. The `passthrough_enabled` field must be set to `true` in the translator configuration
2. Backend profiles must declare native Anthropic support via `anthropic_support.enabled: true`

```yaml
translators:
anthropic:
enabled: true
passthrough_enabled: true # Required to enable passthrough mode
```

When `passthrough_enabled` is `true` (the default), Olla forwards requests directly to backends with native Anthropic support. Set `passthrough_enabled` to `false` to force all requests through the translation pipeline regardless of backend capabilities, which can be useful for debugging or testing the translation layer.

**Backends with native Anthropic support**:

| Backend | Profile | Min Version | Notes |
|---------|---------|-------------|-------|
| vLLM | `config/profiles/vllm.yaml` | v0.11.1+ | No token counting |
| llama.cpp | `config/profiles/llamacpp.yaml` | b4847+ | Supports token counting |
| LM Studio | `config/profiles/lmstudio.yaml` | v0.4.1+ | No token counting |
| Ollama | `config/profiles/ollama.yaml` | v0.14.0+ | No token counting |

To disable passthrough for a specific backend, set `anthropic_support.enabled: false` in the profile:

```yaml
# config/profiles/vllm.yaml (custom override)
api:
anthropic_support:
enabled: false # Force translation mode for this backend
```


## Performance Considerations
Expand Down
14 changes: 11 additions & 3 deletions docs/content/api-reference/overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,10 +19,14 @@ If you ever need to remember the port, think - what's the port, 4 OLLA?!
## API Sections

### [System Endpoints](system.md)
Internal endpoints for health monitoring and system status.
Internal endpoints for health monitoring, system status, and statistics.

- `/internal/health` - Health check endpoint
- `/internal/status` - System status and statistics
- `/internal/status/endpoints` - Endpoint status details
- `/internal/status/models` - Model registry status
- `/internal/stats/models` - Model usage statistics
- `/internal/stats/translators` - Translator usage and performance statistics
- `/internal/process` - Process information

### [Unified Models API](models.md)
Expand Down Expand Up @@ -88,21 +92,24 @@ Anthropic-compatible API endpoints for Claude clients.
**Endpoints**:
- `POST /olla/anthropic/v1/messages` - Create a message (chat)
- `GET /olla/anthropic/v1/models` - List available models
- `POST /olla/anthropic/v1/messages/count_tokens` - Estimate token count

**Features**:
- Full Anthropic Messages API v1 support
- Automatic translation to OpenAI format
- **Passthrough mode** for backends with native Anthropic support (vLLM, llama.cpp, LM Studio, Ollama)
- Automatic fallback to translation mode when needed
- Streaming with Server-Sent Events
- Tool use (function calling)
- Vision support (multi-modal)
- Translator metrics for observability

**Use With**:
- Claude Code
- OpenCode
- Crush CLI
- Any Anthropic API client

See [API Translation](../concepts/api-translation.md) for how translation works.
See [API Translation](../concepts/api-translation.md) for how passthrough and translation modes work.

## Authentication

Expand Down Expand Up @@ -145,6 +152,7 @@ All responses include:
| `X-Olla-Routing-Strategy` | Routing strategy used (when model routing is active) |
| `X-Olla-Routing-Decision` | Routing decision made (routed/fallback/rejected) |
| `X-Olla-Routing-Reason` | Human-readable reason for routing decision |
| `X-Olla-Mode` | Translator mode (`passthrough` when native format used; absent for translation mode) |

### Provider Metrics (Debug Logs)

Expand Down
Loading
Loading