thushan · thushan · Apr 16, 2026 · Feb 26, 2026 · Feb 26, 2026 · Apr 12, 2026
@@ -114,6 +114,8 @@ olla/
 - `internal/adapter/translator/types.go` - PassthroughCapable interface and translator types
 - `internal/adapter/translator/anthropic/` - Anthropic translator implementation
 - `internal/adapter/stats/translator_collector.go` - Translator metrics collector
+- `internal/adapter/balancer/sticky.go` - Sticky session wrapper
+- `internal/app/handlers/handler_stats_sticky.go` - Sticky session stats endpoint
 - `internal/core/constants/translator.go` - TranslatorMode and FallbackReason constants
 - `internal/core/ports/stats.go` - StatsCollector interface with translator tracking
 - `internal/core/domain/profile_config.go` - AnthropicSupportConfig for backend profiles
@@ -130,6 +132,7 @@ olla/
 - `/internal/status/models` - Models status details
 - `/internal/stats/models` - Model statistics
 - `/internal/stats/translators` - Translator statistics
+- `/internal/stats/sticky` - Sticky session statistics (returns `{"enabled":false}` when disabled)
 - `/internal/process` - Process statistics
 - `/version` - Version information
 
@@ -158,6 +161,9 @@ Dynamically registered based on configured translators (e.g., Anthropic Messages
 - `X-Olla-Routing-Strategy`: Routing strategy used (when model routing is active)
 - `X-Olla-Routing-Decision`: Routing decision made (routed/fallback/rejected)
 - `X-Olla-Routing-Reason`: Human-readable reason for routing decision
+- `X-Olla-Sticky-Session`: Sticky session status (hit/miss/repin/disabled)
+- `X-Olla-Sticky-Key-Source`: Key source used (session_header/prefix_hash/auth_header/ip)
+- `X-Olla-Session-ID`: Echoed session ID when client supplies one
 
 ## Testing
 
@@ -201,6 +207,7 @@ Always run `make ready` before committing changes.
 - **Translator Layer**: Enables API format translation (e.g., OpenAI ↔ Anthropic) with passthrough optimisation for backends with native support
 - **Passthrough Mode**: When a backend natively supports the Anthropic Messages API (vLLM, llama.cpp, LM Studio, Ollama), requests bypass translation entirely
 - **Translator Metrics**: Thread-safe per-translator statistics tracking passthrough/translation rates, fallback reasons, latency, and streaming breakdown (`internal/adapter/stats/translator_collector.go`)
+- **Sticky Sessions**: Optional decorator on the endpoint selector that pins multi-turn LLM conversations to the backend that handled the first turn, maximising KV-cache reuse. FNV-64a hashed keys, TTL + LRU bounded, purged on routable→non-routable health transitions (`internal/adapter/balancer/sticky.go`)
 - **Proxy Engines**: Choose Sherpa (simple) or Olla (high-performance)
 - **Load Balancing**: Priority-based recommended for production
 - **Version Management**: Build-time version injection via `internal/version`
@@ -212,6 +219,20 @@ Always run `make ready` before committing changes.
 - Always run `make ready` before committing
 - Use `make help` to see all available commands
 
+## Dependencies (Endorsed)
+
+```go
+"github.com/docker/go-units"     // Human-readable sizes
+"github.com/json-iterator/go"    // High-performance JSON encoding/decoding
+"github.com/puzpuzpuz/xsync/v4"  // Concurrent maps/counters
+"github.com/tidwall/gjson"       // Fast JSON parsing
+"github.com/jellydator/ttlcache" // Time-to-live cache
+"golang.org/x/sync"              // errgroup
+"golang.org/x/time"              // rate limiting
+```
+
+Do not add additional dependencies unless explicitly asked.
+
 ## SUB-AGENT DELEGATION
 
 CRITICAL: Always delegate tasks to the appropriate subagent. Do NOT perform work directly in the main context.

@@ -41,6 +41,20 @@ proxy:
   response_timeout: 900s
   read_timeout: 600s
 
+  # KV-cache affinity routing (opt-in)
+  # Routes repeat turns in a conversation to the same backend, maximising KV-cache
+  # hit rates for long multi-turn sessions at the cost of reduced load distribution.
+  sticky_sessions:
+    enabled: false              # opt-in: route same-prefix conversations to same backend
+    idle_ttl_seconds: 600       # 10-min sliding window (refreshed on each matched request)
+    max_sessions: 10000         # LRU evicts oldest sessions when full
+    key_sources:                # tried in order; first match wins
+      - "session_header"        # X-Olla-Session-ID header (explicit client opt-in)
+      - "prefix_hash"           # FNV-1a hash of first N bytes of messages JSON (best cache locality)
+      - "auth_header"           # Authorization header hash (per-user affinity)
+      # - "ip"                  # client IP (opt-in; unreliable behind NAT/Docker)
+    prefix_hash_bytes: 512      # how many leading bytes of the messages field to hash
+
   # DEPRECATED as of v0.0.16 - These fields are no longer used
   # max_retries: 3        # Replaced by retry.max_attempts
   # retry_backoff: 500ms  # Now uses intelligent exponential backoff

@@ -15,6 +15,8 @@
 
 Olla provides multiple load balancing strategies to distribute requests across backend endpoints efficiently. Each strategy has specific use cases and characteristics to optimise for different deployment scenarios.
 
+For multi-turn LLM workloads, combine any strategy with [Sticky Sessions](sticky-sessions.md) to preserve KV-cache affinity across turns.
+
 ## Overview
 
 Load balancing determines which backend endpoint receives each incoming request. The strategy you choose affects:

@@ -0,0 +1,219 @@
+# Sticky Sessions
+
+> :memo: **Default Configuration**
+> ```yaml
+> proxy:
+>   sticky_sessions:
+>     enabled: false
+> ```
+> Sticky sessions are **opt-in**. Set `enabled: true` under `proxy.sticky_sessions` to activate KV-cache affinity routing.
+>
+> **Environment Variable**: `OLLA_PROXY_STICKY_SESSIONS_ENABLED`
+
+Sticky sessions route repeat turns in a multi-turn conversation to the same backend endpoint, maximising KV-cache reuse across turns. The feature wraps the configured load balancer as a decorator. The underlying strategy (priority, round-robin, least-connections) is unchanged for new sessions and fallback cases.
+
+## Why sticky sessions
+
+Modern LLM backends maintain a KV-cache for the token sequence they have already processed. When the next turn of a conversation lands on the **same** backend, the backend can skip re-ingesting the full context and jump straight to generating new tokens. For long conversations this produces a substantial reduction in both time-to-first-token and compute cost; the benefit scales with context length.
+
+Without affinity, a load balancer may distribute turn N and turn N+1 to different backends. The receiving backend for turn N+1 has a cold cache and must process the entire prompt from scratch. For workloads with short prompts or single-turn completions this overhead is negligible; for chat-style applications with growing context it compounds with every turn.
+
+## How it works
+
+On each request, Olla computes a **session key** from one of the configured key sources (see [Key sources](#key-sources) below). The key is looked up in an in-memory LRU/TTL store:
+
+- **Hit**: the pinned backend is still routable; the request is sent there and the TTL is refreshed.
+- **Miss**: no entry exists; the request is forwarded to the underlying balancer, and the selected backend is stored.
+- **Repin**: an entry exists but the pinned backend is no longer routable; the underlying balancer selects a new backend and the entry is overwritten.
+- **Disabled**: no key source produced a usable key (e.g. no `X-Olla-Session-ID` header and no other sources configured); the request is passed through to the underlying balancer without recording anything.
+
+```mermaid
+sequenceDiagram
+    participant C as Client
+    participant O as Olla
+    participant Store as Session Store
+    participant B as Backend
+
+    C->>O: POST /olla/proxy/... (turn N+1)
+    O->>O: Derive session key
+    O->>Store: Lookup key
+    alt Cache hit (backend routable)
+        Store-->>O: Pinned backend URL
+        O->>B: Forward to pinned backend
+        B-->>O: Response
+        O-->>C: Response + X-Olla-Sticky-Session: hit
+    else Cache miss / repin
+        Store-->>O: No entry (or dead backend)
+        O->>O: Delegate to underlying balancer
+        O->>Store: Store key → selected backend
+        O->>B: Forward to selected backend
+        B-->>O: Response
+        O-->>C: Response + X-Olla-Sticky-Session: miss
+    end
+```
+
+## Key sources
+
+The `key_sources` list is evaluated in order; the first source that produces a non-empty value wins. All keys are scoped to the model name so the same client talking to different models maintains independent session state.
+
+| Source | How the key is derived | When to prefer it | Caveats |
+|---|---|---|---|
+| `session_header` | FNV-64a hash of the `X-Olla-Session-ID` request header | Explicit client opt-in; most reliable | Client must send the header consistently |
+| `prefix_hash` | FNV-64a hash of the first `prefix_hash_bytes` bytes of the `messages` JSON field | No client changes needed; best cache locality | Two conversations with identical opening messages share a session |
+| `auth_header` | FNV-64a hash of the `Authorization` header value | Per-user affinity without client changes | Breaks if the token rotates mid-conversation; unreliable with shared tokens |
+| `ip` | Client IP address (extracted via `net.SplitHostPort`) | Simple deployments with no NAT | Unreliable behind NAT, load balancers, or Docker networking |
+
+All header and token values are hashed before storage; plaintext secrets are never written to the session store.
+
+The default configuration enables `session_header`, `prefix_hash`, and `auth_header` (in that order) and comments out `ip` because it is unreliable behind typical container networking. Adjust the list to suit your deployment.
+
+## Session lifecycle and eviction
+
+Sessions do not live forever. Three mechanisms remove them:
+
+**Sliding TTL**: every cache hit refreshes the expiry timer. A session that goes idle for longer than `idle_ttl_seconds` is expired automatically. Active conversations are never interrupted mid-session.
+
+**LRU eviction**: when the store reaches `max_sessions`, the least-recently-used entry is evicted to make room. Under normal load this should never occur; it acts as a safety cap to bound memory usage.
+
+**Health-based purge**: when the health checker transitions a backend to an unhealthy state, Olla immediately calls `PurgeDeadEndpoints` with the current routable set. Any session entry pointing to the now-dead backend is deleted without waiting for TTL. The next request for that session falls through to the underlying balancer and receives a `repin`.
+
+!!! note "Busy endpoints are not purged"
+    A backend in the **Busy** state is still considered routable (`IsRoutable() == true`). Sticky sessions are preserved through Busy transitions; the backend is overloaded but still serving. Only transitions to Unhealthy, Offline, or Unknown trigger a purge.
+
+```mermaid
+stateDiagram-v2
+    [*] --> Active: First request (miss)
+    Active --> Active: Subsequent requests (hit, TTL refreshed)
+    Active --> Expired: Idle longer than idle_ttl_seconds
+    Active --> Purged: Backend becomes unhealthy
+    Active --> Evicted: LRU cap (max_sessions) reached
+    Expired --> [*]
+    Purged --> [*]
+    Evicted --> [*]
+```
+
+## Response headers
+
+Olla writes three response headers so clients and operators can observe affinity decisions:
+
+| Header | Values | Meaning |
+|---|---|---|
+| `X-Olla-Sticky-Session` | `hit` / `miss` / `repin` / `disabled` | Outcome of the affinity lookup for this request |
+| `X-Olla-Sticky-Key-Source` | `session_header` / `prefix_hash` / `auth_header` / `ip` / `none` | Which key source was used (absent when outcome is `disabled`) |
+| `X-Olla-Session-ID` | _(echoed from request)_ | Present in the response only when the client sent `X-Olla-Session-ID`; lets stateless clients confirm the header was received |
+
+Example: first request (miss), client provides explicit session ID:
+
+```bash
+curl -i -X POST http://localhost:40114/olla/proxy/api/chat \
+  -H "X-Olla-Session-ID: conv-abc123" \
+  -H "Content-Type: application/json" \
+  -d '{"model":"llama3.2","messages":[{"role":"user","content":"Hello"}]}'
+```
+
+```http
+HTTP/1.1 200 OK
+X-Olla-Endpoint: gpu-server-1
+X-Olla-Sticky-Session: miss
+X-Olla-Sticky-Key-Source: session_header
+X-Olla-Session-ID: conv-abc123
+```
+
+Subsequent request (hit):
+
+```http
+HTTP/1.1 200 OK
+X-Olla-Endpoint: gpu-server-1
+X-Olla-Sticky-Session: hit
+X-Olla-Sticky-Key-Source: session_header
+X-Olla-Session-ID: conv-abc123
+```
+
+## Configuration
+
+All fields live under `proxy.sticky_sessions`:
+
+```yaml
+proxy:
+  sticky_sessions:
+    enabled: false              # opt-in: set true to activate affinity routing
+
+    idle_ttl_seconds: 600       # sliding TTL in seconds; 0 = sessions never expire by TTL
+                                # (not recommended, sessions accumulate until LRU eviction)
+
+    max_sessions: 10000         # LRU capacity; oldest entries are evicted when full
+
+    key_sources:                # ordered cascade, first match wins
+      - "session_header"        # X-Olla-Session-ID header (explicit client opt-in)
+      - "prefix_hash"           # hash of first N bytes of messages JSON
+      - "auth_header"           # hash of Authorization header (per-user affinity)
+      # - "ip"                  # client IP, opt-in; unreliable behind NAT/Docker
+
+    prefix_hash_bytes: 512      # bytes of the messages field to hash for prefix_hash source;
+                                # larger values reduce false collisions at a small CPU cost
+```
+
+The only env var exposed for this feature is `OLLA_PROXY_STICKY_SESSIONS_ENABLED` (boolean). The remaining fields are configuration-file only.
+
+## Observability
+
+### Stats endpoint
+
+```bash
+curl http://localhost:40114/internal/stats/sticky
+```
+
+When sticky sessions are **enabled**:
+
+```json
+{
+  "enabled": true,
+  "active_sessions": 142,
+  "insertions": 1500,
+  "hits": 9231,
+  "misses": 1500,
+  "evictions": 0,
+  "max_sessions": 10000,
+  "idle_ttl_seconds": 600
+}
+```
+
+When sticky sessions are **disabled** (stable shape for scripting):
+
+```json
+{
+  "enabled": false
+}
+```
+
+Scripts should branch on the `enabled` field; the endpoint always returns `200 OK` regardless of whether the feature is active.
+
+## When NOT to use sticky sessions
+
+Sticky sessions trade load distribution for cache locality. They are not always the right choice:
+
+- **Stateless or single-turn workloads**: embeddings, one-shot completions, and batch jobs gain nothing from affinity; use the plain load balancer.
+- **Model-routing-dominated traffic**: if requests are already hard-routed to specific endpoints by model routing, the sticky wrapper adds overhead with no benefit.
+- **Very small deployments**: two endpoints with priority load balancing already behave predictably; adding stickiness is unnecessary complexity.
+- **Homogeneous short-prompt workloads**: when prompts are short and vary widely, KV-cache hit rates on the backend are already low; affinity provides little gain and reduces load distribution.
+- **Deployments with aggressive autoscaling**: if backends are added and removed frequently, sessions will repin often and the affinity benefit is diluted.
+
+## Developer notes
+
+**Decorator pattern**: `StickySessionWrapper` in `internal/adapter/balancer/sticky.go` wraps any `domain.EndpointSelector` implementation. No factory or registry changes are needed to add a new inner balancer; the wrapper is applied in `ProxyServiceWrapper.applyStickySessions()` inside `internal/app/services/proxy.go`.
+
+**Hashing**: FNV-64a (`hash/fnv`) is used for all key derivation. It is non-cryptographic and intended only as a compact routing hint; collisions are acceptable (two different sessions occasionally land on the same backend). Do not use these keys as security tokens.
+
+**Import cycle avoidance**: `StickyOutcome` is defined in `internal/core/domain/routing.go` rather than in `internal/adapter/balancer/sticky.go`. This allows `internal/adapter/proxy/core` (the proxy engine shared layer) to read the outcome and write response headers without importing the balancer package, which would create a cycle.
+
+**Purge wiring order**: `applyStickySessions()` assigns `s.stickyWrapper` and then immediately calls `s.discoverySvc.SetPurgeDeadEndpointsFn(s.PurgeDeadEndpoints)`. This ensures the write of `stickyWrapper` happens-before the health-checker goroutine can observe it via the purge callback. The registration happens inside `ProxyServiceWrapper.Start()`, not at construction time, to respect service startup ordering.
+
+**Session store**: backed by `github.com/jellydator/ttlcache/v3` with `WithCapacity` (LRU) and `WithTTL` (sliding expiry). The `ttlcache.Get` call inside `Select` refreshes the TTL automatically on every hit.
+
+Relevant source files: `internal/adapter/balancer/sticky.go`, `internal/core/domain/routing.go`, `internal/app/services/proxy.go`, `internal/app/services/discovery.go`, `internal/app/handlers/handler_proxy.go`, `internal/app/handlers/handler_stats_sticky.go`.
+
+## See also
+
+- [Load Balancing](load-balancing.md): underlying strategies that sticky sessions wrap
+- [Health Checking](health-checking.md): health states and the routable concept
+- [Configuration Reference](../configuration/reference.md#sticky-sessions): complete field reference
@@ -211,6 +211,36 @@ proxy:
       - "*debug*"       # Exclude debug profiles
 ```
 
+### Sticky Sessions {#sticky-sessions}
+
+KV-cache affinity routing for multi-turn LLM conversations. See [Sticky Sessions](../concepts/sticky-sessions.md) for a full explanation.
+
+| Field | Type | Default | Description |
+|-------|------|---------|-------------|
+| `sticky_sessions.enabled` | bool | `false` | Enable affinity routing (opt-in) |
+| `sticky_sessions.idle_ttl_seconds` | int | `600` | Sliding TTL in seconds; 0 = no TTL expiry |
+| `sticky_sessions.max_sessions` | uint64 | `10000` | LRU capacity; oldest entry evicted when full |
+| `sticky_sessions.key_sources` | []string | `["session_header","prefix_hash","auth_header"]` | Ordered key source cascade; first match wins |
+| `sticky_sessions.prefix_hash_bytes` | int | `512` | Bytes of the messages field to hash for `prefix_hash` |
+
+**Environment Variable**: `OLLA_PROXY_STICKY_SESSIONS_ENABLED` (only `enabled` is exposed as an env var)
+
+Example:
+
+```yaml
+proxy:
+  sticky_sessions:
+    enabled: true               # opt-in
+    idle_ttl_seconds: 600       # 10-min sliding window
+    max_sessions: 10000         # LRU cap
+    key_sources:
+      - "session_header"        # X-Olla-Session-ID header
+      - "prefix_hash"           # hash of messages prefix
+      - "auth_header"           # hash of Authorization header
+      # - "ip"                  # client IP (unreliable behind NAT)
+    prefix_hash_bytes: 512
+```
+
 ## Discovery Configuration
 
 Endpoint discovery and health checking.

@@ -129,6 +129,7 @@ nav:
   - Concepts:
       - Overview: concepts/overview.md
       - Load Balancing: concepts/load-balancing.md
+      - Sticky Sessions: concepts/sticky-sessions.md
       - Model Routing: concepts/model-routing.md
       - Model Aliases: concepts/model-aliases.md
       - Model Unification: concepts/model-unification.md

@@ -25,6 +25,7 @@ require (
 	github.com/containerd/console v1.0.5 // indirect
 	github.com/davecgh/go-spew v1.1.1 // indirect
 	github.com/gookit/color v1.6.0 // indirect
+	github.com/jellydator/ttlcache/v3 v3.4.0 // indirect
 	github.com/kr/pretty v0.3.1 // indirect
 	github.com/lithammer/fuzzysearch v1.1.8 // indirect
 	github.com/mattn/go-runewidth v0.0.20 // indirect