Skip to content

Support vLLM-deployed Qwen3 models (matryoshka rejection + missing instruction prefix + reasoning model output handling) #573

@igs-rogenlo

Description

@igs-rogenlo

Summary

When deploying Qwen3 models via vLLM (with the standard OpenAI-compatible API), memory-lancedb-pro fails in three ways:

  1. vLLM rejects the dimensions parameter for Qwen3-Embedding-8B because the model is not loaded in matryoshka mode → all embed calls error with HTTP 400.
  2. Query recall quality is sub-optimal because the plugin doesn't apply Qwen3's recommended instruction-prefix format on the query side.
  3. Smart extraction silently fails for reasoning models like Qwen3.5-27B-FP8 because the LLM client passes thinking-mode output (Thinking Process: ...</think>{json}) directly to the JSON parser, which then fails or extracts garbage.

This issue describes all three problems, suggests minimal fixes, and includes a working patch script that handles all of them.


Environment

  • memory-lancedb-pro version: 1.1.0-beta.10
  • OpenClaw: 2026.4.8
  • Embedding endpoint: vllm (OpenAI-compatible API), running Qwen/Qwen3-Embedding-8B
  • Embedding model native dimension: 4096
  • Plugin config:
    "embedding": {
      "provider": "openai-compatible",
      "baseURL": "http://<host>:15001/v1",
      "model": "/models/Qwen3-Embedding-8B",
      "dimensions": 4096
    }

Problem 1 — vLLM 400 on dimensions parameter

Reproduce

curl -s -X POST "http://<vllm-host>/v1/embeddings" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "/models/Qwen3-Embedding-8B",
    "input": "test",
    "dimensions": 4096
  }'

Returns:

{
  "error": {
    "message": "Model \"/models/Qwen3-Embedding-8B\" does not support matryoshka representation, changing output dimensions will lead to poor results.",
    "type": "BadRequestError",
    "code": 400
  }
}

Even sending dimensions equal to the model's native dimension is rejected.

Root cause in plugin

src/embedder.ts always sends dimensions for openai-compatible providers when it is configured, because the provider profile sets:

case "openai-compatible":
  return {
    ...,
    taskField: null,
    dimensionsField: "dimensions",   // ← always sent
  };

Workaround currently required

  1. Add the model to EMBEDDING_DIMENSIONS lookup so plugin uses internal dim metadata
  2. Remove dimensions from user config
  3. The plugin then doesn't include dimensions in the API request

But the model isn't in upstream's lookup table:

// Existing entries
"ai/qwen3-embedding": 1024,
"ai/qwen3-embedding:0.6B-F16": 1024,
"ai/qwen3-embedding:4B": 1024,
"ai/qwen3-embedding:4B-Q4_K_M": 1024,
"ai/qwen3-embedding:8B-Q4_K_M": 1024,

These are Docker Model Runner names. vLLM exposes the model with a different name (the path-prefixed /models/Qwen3-Embedding-8B) and the full 8B model has dim 4096, not 1024.

Suggested fix

Two options:

A. Add vLLM Qwen3 entries (minimal):

// Qwen3 (vLLM hosted, native 4096-dim)
"/models/Qwen3-Embedding-8B": 4096,
"Qwen3-Embedding-8B": 4096,

B. Add config flag to suppress dimensions parameter (more general):

embedding: {
  ...
  sendDimensions?: boolean;  // default true; set false for vLLM/non-matryoshka
}

Option B is more general and benefits any non-matryoshka model deployed via OpenAI-compatible APIs.


Problem 2 — Missing query-side instruction prefix for Qwen3

Background

Qwen3-Embedding model card explicitly recommends adding an instruction prefix on queries only (documents are embedded as-is):

def get_detailed_instruct(task_description: str, query: str) -> str:
    return f'Instruct: {task_description}\nQuery:{query}'

Not using an instruct on the query side can lead to a ~1-5% drop in retrieval performance.

Current state in plugin

embedQuery and embedBatchQuery pass the text unchanged. There's no mechanism to apply a query-side prefix (the existing taskQuery field maps to a separate API parameter, which Qwen3 doesn't use).

Real-world impact

Tested on a memory store of 410 entries, query "PlayerAction" against a BigQuery PlayerAction table document:

Mode Top 3 vector results
No prefix PlayGD Project (sim 0.15) 🟡 unrelated, DimPlayerAction (0.13), gwt mummy query (0.10) 🟡 unrelated
With Qwen3 instruction prefix DimPlayerAction (0.22) ✓, PlayerAction full schema (0.15) ✓, _p_PlayerAction (0.14)

For short queries the improvement is dramatic (top-3 went from 1/3 relevant to 3/3 relevant). For longer queries it's the documented 1-5%.

Suggested fix

Add a query-only wrapping helper, gated by model detection:

private isQwen3Model(): boolean {
  const m = this._model.toLowerCase();
  return m.includes("qwen3-embedding") || m.includes("qwen3_embedding");
}

private wrapQwen3Query(text: string): string {
  if (!this.isQwen3Model()) return text;
  const task = this._taskQuery && this._taskQuery.length > 0
    ? this._taskQuery
    : "Given a memory search query, retrieve relevant stored knowledge entries that match the query";
  return `Instruct: ${task}\nQuery:${text}`;
}

async embedQuery(text: string): Promise<number[]> {
  const wrapped = this.wrapQwen3Query(text);
  return this.embedSingle(wrapped, this._taskQuery);
}

async embedBatchQuery(texts: string[]): Promise<number[][]> {
  const wrapped = this.isQwen3Model() ? texts.map((t) => this.wrapQwen3Query(t)) : texts;
  return this.embedMany(wrapped, this._taskQuery);
}

Important: only embedQuery/embedBatchQuery are affected. embedPassage/embedBatchPassage remain unchanged so existing document embeddings remain valid (no need to re-embed).

A more general approach would be a config-driven prefix template:

embedding: {
  ...
  queryPrefixTemplate?: string;  // e.g. "Instruct: {task}\nQuery:{text}"
}

This would handle Qwen3 today and any future model that needs prefix-based query instructions.



Problem 3 — Smart extraction breaks on reasoning models (Qwen3.5, DeepSeek-R1, etc.)

Background

Modern reasoning models (Qwen3.5, DeepSeek-R1) output their answer with a thinking trace prefix:

Thinking Process:
1. Analyze the request...
2. Draft output...
3. Final check...
</think>

{"actual": "json"}

The thinking text appears in response.choices[0].message.content, NOT in a separate reasoning_content field (depends on vLLM build).

Reproduce

Using Qwen3.5-27B-FP8 via vLLM as the smart-extractor LLM:

  1. Trigger any conversation that produces a memory candidate
  2. src/llm-client.ts calls extractJsonFromResponse(raw) directly on the unfiltered response
  3. JSON parse fails because the response starts with Thinking Process: text
  4. Smart extractor returns null and the whole turn's extraction is silently dropped

Current code (src/llm-client.ts)

const raw = response.choices?.[0]?.message?.content;
if (!raw) { ... return null; }
if (typeof raw !== "string") { ... return null; }

const jsonStr = extractJsonFromResponse(raw);  // ← parses raw including thinking text

Suggested fix

Strip the thinking block before passing to extractJsonFromResponse:

// Strip Qwen3/DeepSeek-R1-style reasoning thinking blocks before JSON extraction.
// Reasoning models output `Thinking Process: ...</think>{actual content}`.
// Take everything after the LAST </think> marker if present.
const cleaned = (() => {
  const idx = raw.lastIndexOf("</think>");
  return idx === -1 ? raw : raw.slice(idx + "</think>".length).trim();
})();

const jsonStr = extractJsonFromResponse(cleaned);

This is universal — it handles any model that uses the </think> reasoning marker (Qwen3.x, DeepSeek-R1, future reasoning models). For non-reasoning models, the lastIndexOf returns -1 and the original raw is used unchanged. Zero impact on existing models.


Test results after applying all fixes

$ openclaw memory-pro search "PlayerAction" --limit 3
Found 3 memories:
1. [73f0e38d] _p_PlayerAction 欄位結構... (62%, vector)
2. [ca331a36] PlayerAction 完整欄位結構... (61%, vector)
3. [f46b34eb] DimPlayerAction 四欄位... (60%, vector)

All top 3 are PlayerAction-related (versus 1/3 without the fix). Existing 410 stored embeddings remain compatible — only query-side embedding paths change.


Compatibility / migration notes

  • No re-embedding required — passage embedder is unchanged
  • Backwards compatible — non-Qwen3 models hit if (!this.isQwen3Model()) return text; and behave identically
  • Cosine similarity check: same model deployed on Ollama (qwen3-embedding:8b-fp16) and vLLM (/models/Qwen3-Embedding-8B) produces vectors with cosine similarity 0.999999 — they are interchangeable for stored embeddings

Patch (working, tested)

A drop-in patch script that applies all changes idempotently and works against both 1.1.0-beta.10 and current main:

I have a working idempotent patch script (Python-based, regex anchors so it survives upstream version drift). Happy to submit as a PR with one of the suggested approaches above — please indicate preference.


References

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions