Skip to content
Merged
Show file tree
Hide file tree
Changes from 8 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
174 changes: 174 additions & 0 deletions config/profiles/lmdeploy.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,174 @@
# LMDeploy inference platform profile
name: lmdeploy
home: "https://github.com/InternLM/lmdeploy"
version: "1.0"
display_name: "LMDeploy"
description: "LMDeploy efficient inference and serving for LLMs and VLMs"

# Routing configuration
routing:
prefixes:
- lmdeploy

# API compatibility
api:
openai_compatible: true
paths:
# Health and system endpoints
- /health # 0: health check (HTTP 200, empty body)

# Model management
- /v1/models # 1: list models (OpenAI-compatible ModelList)

# Text generation endpoints (OpenAI-compatible)
- /v1/chat/completions # 2: chat completions with SSE streaming
- /v1/completions # 3: text completions

# LMDeploy-specific endpoints
- /v1/encode # 4: token encoding (LMDeploy-specific)
- /generate # 5: native generation endpoint
- /pooling # 6: reward/score pooling (the real embeddings path)
# NOTE: /v1/embeddings is intentionally excluded — LMDeploy returns HTTP 400
# unconditionally for all backends on that path.

# Sleep state probe (LMDeploy-specific, used for auto-detection)
- /is_sleeping # 7: sleep state probe
Comment on lines +27 to +35
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

What are the exact URL paths exposed by lmdeploy serve api_server for /generate, /pooling, and /is_sleeping endpoints? Are they prefixed with /v1?

💡 Result:

The lmdeploy serve api_server exposes the following endpoints for the specified paths, based on source code and documentation: - /generate: This is an LMDeploy-specific endpoint for raw generation requests (e.g., POST /generate with prompt, input_ids, etc.). It is NOT prefixed with /v1. - /pooling: This endpoint is for reward model pooling (e.g., POST /pooling with model and input). It is NOT prefixed with /v1. Available when using PyTorch backend with reward models like internlm2-1_8b-reward. - /is_sleeping: No evidence of this exact endpoint in api_server. There is a "sleep engine" feature in PyTorch backend for resource management (e.g., PR fixing sleep for dp=1), but no exposed HTTP endpoint named /is_sleeping. Likely internal or not publicly exposed. Not prefixed with /v1 if it exists. Primary endpoints are OpenAI-compatible under /v1 (e.g., /v1/chat/completions, /v1/models, /v1/completions), plus custom ones like /generate and /pooling at root level. Full list available via /openapi.json after starting the server.

Citations:


Remove or verify the /is_sleeping endpoint with LMDeploy upstream.

The web search against LMDeploy's source code confirms that /generate and /pooling are correctly listed without the /v1 prefix—they are custom root-level endpoints, not OpenAI-compatible ones. However, there is no evidence that /is_sleeping is an actual exposed HTTP endpoint in lmdeploy serve api_server. LMDeploy does have an internal sleep engine feature, but it does not appear to expose an HTTP endpoint for it. Including /is_sleeping in the config's api.paths and detection.path_indicators may cause detection to fail or requests to 404.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@config/profiles/lmdeploy.yaml` around lines 27 - 35, The config lists an
LMDeploy-only probe endpoint `/is_sleeping` which upstream does not expose;
remove `/is_sleeping` from api.paths and any detection.path_indicators, or
verify and replace it with a real LMDeploy probe if you confirm one exists.
Specifically, update the entries that mention `/is_sleeping` in the profile
(look for api.paths and detection.path_indicators) so detection relies only on
verified endpoints like `/generate` and `/pooling`, avoiding a probe that will
404.


model_discovery_path: /v1/models
health_check_path: /health

# Platform characteristics
characteristics:
timeout: 2m
max_concurrent_requests: 100
default_priority: 82 # between SGLang (85) and vLLM (80)
streaming_support: true

# Detection hints for auto-discovery
detection:
path_indicators:
- "/v1/encode" # LMDeploy-specific token encoding
- "/generate" # LMDeploy native generation
- "/pooling" # LMDeploy reward/score path
- "/is_sleeping" # distinct from vLLM and SGLang
default_ports:
- 23333 # api_server default (not 8000 which is proxy_server)

# Request/response handling
request:
model_field_paths:
- "model"
response_format: "lmdeploy"
parsing_rules:
chat_completions_path: "/v1/chat/completions"
completions_path: "/v1/completions"
model_field_name: "model"
supports_streaming: true

# Path indices for specific functions
path_indices:
health: 0
models: 1
chat_completions: 2
completions: 3

# Model handling
models:
name_format: "{{.Name}}"
capability_patterns:
chat:
- "*-Chat-*"
- "*-Instruct*"
- "*-chat-*"
vision:
- "*vision*"
- "*llava*"
- "*VL*"
code:
- "*code*"
- "*Code*"
# Context window patterns for common LMDeploy models
context_patterns:
- pattern: "*llama-3.1*"
context: 131072
- pattern: "*llama-3*"
context: 8192
- pattern: "*internlm2_5*"
context: 32768
- pattern: "*internlm2*"
context: 32768
- pattern: "*mistral*"
context: 32768
- pattern: "*qwen2*"
context: 32768

# Resource management
resources:
model_sizes:
- patterns: ["*70b*", "*72b*"]
min_memory_gb: 140
recommended_memory_gb: 160
min_gpu_memory_gb: 140
estimated_load_time_ms: 60000
- patterns: ["*34b*", "*33b*", "*30b*"]
min_memory_gb: 70
recommended_memory_gb: 80
min_gpu_memory_gb: 70
estimated_load_time_ms: 45000
- patterns: ["*13b*", "*14b*"]
min_memory_gb: 30
recommended_memory_gb: 40
min_gpu_memory_gb: 30
estimated_load_time_ms: 30000
- patterns: ["*7b*", "*8b*"]
min_memory_gb: 16
recommended_memory_gb: 24
min_gpu_memory_gb: 16
estimated_load_time_ms: 20000
- patterns: ["*3b*"]
min_memory_gb: 8
recommended_memory_gb: 12
min_gpu_memory_gb: 8
estimated_load_time_ms: 15000
- patterns: ["*1b*", "*1.1b*", "*1.5b*"]
min_memory_gb: 4
recommended_memory_gb: 8
min_gpu_memory_gb: 4
estimated_load_time_ms: 10000

defaults:
min_memory_gb: 8
recommended_memory_gb: 16
min_gpu_memory_gb: 8
requires_gpu: true
estimated_load_time_ms: 30000

concurrency_limits:
- min_memory_gb: 100
max_concurrent: 10
- min_memory_gb: 50
max_concurrent: 20
- min_memory_gb: 20
max_concurrent: 50
- min_memory_gb: 0
max_concurrent: 100

timeout_scaling:
base_timeout_seconds: 120
load_time_buffer: true

# Metrics extraction for LMDeploy responses
metrics:
extraction:
enabled: true
source: response_body
format: json
paths:
model: "$.model"
finish_reason: "$.choices[0].finish_reason"
input_tokens: "$.usage.prompt_tokens"
output_tokens: "$.usage.completion_tokens"
total_tokens: "$.usage.total_tokens"
calculations:
is_complete: 'len(finish_reason) > 0'
tokens_per_second: "generation_time_ms > 0 ? (output_tokens * 1000.0) / generation_time_ms : 0"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check whether generation_time_ms is supplied by Olla's metrics layer
# (rather than being parsed from the response body).
rg -nP --type=go -C3 '\bgeneration_time_ms\b'
rg -nP --type=yaml -C2 '\bgeneration_time_ms\b' config/profiles

Repository: thushan/olla

Length of output: 2560


🏁 Script executed:

cat -n config/profiles/lmdeploy.yaml | sed -n '160,180p'

Repository: thushan/olla

Length of output: 673


generation_time_ms is missing from paths but used in calculations.

The tokens_per_second calculation references generation_time_ms, which is not extracted in the metrics.extraction.paths section of this profile. Other profiles (vllm, vllm-mlx, sglang) define generation_time_ms: "$.metrics.generation_time_ms" in their paths sections. Add the missing extraction to lmdeploy.yaml, otherwise the calculation will always evaluate to 0.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@config/profiles/lmdeploy.yaml` around lines 167 - 174, The calculations block
uses generation_time_ms in tokens_per_second but lmdeploy.yaml's
metrics.extraction.paths lacks extraction for generation_time_ms; update the
metrics.extraction.paths section to include generation_time_ms with the same
extraction key used in other profiles (e.g., generation_time_ms:
"$.metrics.generation_time_ms") so the tokens_per_second calculation in
calculations.tokens_per_second can compute correctly.

Loading
Loading