Summary
When deploying Qwen3 models via vLLM (with the standard OpenAI-compatible API), memory-lancedb-pro fails in three ways:
- vLLM rejects the
dimensions parameter for Qwen3-Embedding-8B because the model is not loaded in matryoshka mode → all embed calls error with HTTP 400.
- Query recall quality is sub-optimal because the plugin doesn't apply Qwen3's recommended instruction-prefix format on the query side.
- Smart extraction silently fails for reasoning models like
Qwen3.5-27B-FP8 because the LLM client passes thinking-mode output (Thinking Process: ...</think>{json}) directly to the JSON parser, which then fails or extracts garbage.
This issue describes all three problems, suggests minimal fixes, and includes a working patch script that handles all of them.
Environment
memory-lancedb-pro version: 1.1.0-beta.10
- OpenClaw:
2026.4.8
- Embedding endpoint:
vllm (OpenAI-compatible API), running Qwen/Qwen3-Embedding-8B
- Embedding model native dimension:
4096
- Plugin config:
"embedding": {
"provider": "openai-compatible",
"baseURL": "http://<host>:15001/v1",
"model": "/models/Qwen3-Embedding-8B",
"dimensions": 4096
}
Problem 1 — vLLM 400 on dimensions parameter
Reproduce
curl -s -X POST "http://<vllm-host>/v1/embeddings" \
-H "Content-Type: application/json" \
-d '{
"model": "/models/Qwen3-Embedding-8B",
"input": "test",
"dimensions": 4096
}'
Returns:
{
"error": {
"message": "Model \"/models/Qwen3-Embedding-8B\" does not support matryoshka representation, changing output dimensions will lead to poor results.",
"type": "BadRequestError",
"code": 400
}
}
Even sending dimensions equal to the model's native dimension is rejected.
Root cause in plugin
src/embedder.ts always sends dimensions for openai-compatible providers when it is configured, because the provider profile sets:
case "openai-compatible":
return {
...,
taskField: null,
dimensionsField: "dimensions", // ← always sent
};
Workaround currently required
- Add the model to
EMBEDDING_DIMENSIONS lookup so plugin uses internal dim metadata
- Remove
dimensions from user config
- The plugin then doesn't include
dimensions in the API request
But the model isn't in upstream's lookup table:
// Existing entries
"ai/qwen3-embedding": 1024,
"ai/qwen3-embedding:0.6B-F16": 1024,
"ai/qwen3-embedding:4B": 1024,
"ai/qwen3-embedding:4B-Q4_K_M": 1024,
"ai/qwen3-embedding:8B-Q4_K_M": 1024,
These are Docker Model Runner names. vLLM exposes the model with a different name (the path-prefixed /models/Qwen3-Embedding-8B) and the full 8B model has dim 4096, not 1024.
Suggested fix
Two options:
A. Add vLLM Qwen3 entries (minimal):
// Qwen3 (vLLM hosted, native 4096-dim)
"/models/Qwen3-Embedding-8B": 4096,
"Qwen3-Embedding-8B": 4096,
B. Add config flag to suppress dimensions parameter (more general):
embedding: {
...
sendDimensions?: boolean; // default true; set false for vLLM/non-matryoshka
}
Option B is more general and benefits any non-matryoshka model deployed via OpenAI-compatible APIs.
Problem 2 — Missing query-side instruction prefix for Qwen3
Background
Qwen3-Embedding model card explicitly recommends adding an instruction prefix on queries only (documents are embedded as-is):
def get_detailed_instruct(task_description: str, query: str) -> str:
return f'Instruct: {task_description}\nQuery:{query}'
Not using an instruct on the query side can lead to a ~1-5% drop in retrieval performance.
Current state in plugin
embedQuery and embedBatchQuery pass the text unchanged. There's no mechanism to apply a query-side prefix (the existing taskQuery field maps to a separate API parameter, which Qwen3 doesn't use).
Real-world impact
Tested on a memory store of 410 entries, query "PlayerAction" against a BigQuery PlayerAction table document:
| Mode |
Top 3 vector results |
| No prefix |
PlayGD Project (sim 0.15) 🟡 unrelated, DimPlayerAction (0.13), gwt mummy query (0.10) 🟡 unrelated |
| With Qwen3 instruction prefix |
DimPlayerAction (0.22) ✓, PlayerAction full schema (0.15) ✓, _p_PlayerAction (0.14) ✓ |
For short queries the improvement is dramatic (top-3 went from 1/3 relevant to 3/3 relevant). For longer queries it's the documented 1-5%.
Suggested fix
Add a query-only wrapping helper, gated by model detection:
private isQwen3Model(): boolean {
const m = this._model.toLowerCase();
return m.includes("qwen3-embedding") || m.includes("qwen3_embedding");
}
private wrapQwen3Query(text: string): string {
if (!this.isQwen3Model()) return text;
const task = this._taskQuery && this._taskQuery.length > 0
? this._taskQuery
: "Given a memory search query, retrieve relevant stored knowledge entries that match the query";
return `Instruct: ${task}\nQuery:${text}`;
}
async embedQuery(text: string): Promise<number[]> {
const wrapped = this.wrapQwen3Query(text);
return this.embedSingle(wrapped, this._taskQuery);
}
async embedBatchQuery(texts: string[]): Promise<number[][]> {
const wrapped = this.isQwen3Model() ? texts.map((t) => this.wrapQwen3Query(t)) : texts;
return this.embedMany(wrapped, this._taskQuery);
}
Important: only embedQuery/embedBatchQuery are affected. embedPassage/embedBatchPassage remain unchanged so existing document embeddings remain valid (no need to re-embed).
A more general approach would be a config-driven prefix template:
embedding: {
...
queryPrefixTemplate?: string; // e.g. "Instruct: {task}\nQuery:{text}"
}
This would handle Qwen3 today and any future model that needs prefix-based query instructions.
Problem 3 — Smart extraction breaks on reasoning models (Qwen3.5, DeepSeek-R1, etc.)
Background
Modern reasoning models (Qwen3.5, DeepSeek-R1) output their answer with a thinking trace prefix:
Thinking Process:
1. Analyze the request...
2. Draft output...
3. Final check...
</think>
{"actual": "json"}
The thinking text appears in response.choices[0].message.content, NOT in a separate reasoning_content field (depends on vLLM build).
Reproduce
Using Qwen3.5-27B-FP8 via vLLM as the smart-extractor LLM:
- Trigger any conversation that produces a memory candidate
src/llm-client.ts calls extractJsonFromResponse(raw) directly on the unfiltered response
- JSON parse fails because the response starts with
Thinking Process: text
- Smart extractor returns
null and the whole turn's extraction is silently dropped
Current code (src/llm-client.ts)
const raw = response.choices?.[0]?.message?.content;
if (!raw) { ... return null; }
if (typeof raw !== "string") { ... return null; }
const jsonStr = extractJsonFromResponse(raw); // ← parses raw including thinking text
Suggested fix
Strip the thinking block before passing to extractJsonFromResponse:
// Strip Qwen3/DeepSeek-R1-style reasoning thinking blocks before JSON extraction.
// Reasoning models output `Thinking Process: ...</think>{actual content}`.
// Take everything after the LAST </think> marker if present.
const cleaned = (() => {
const idx = raw.lastIndexOf("</think>");
return idx === -1 ? raw : raw.slice(idx + "</think>".length).trim();
})();
const jsonStr = extractJsonFromResponse(cleaned);
This is universal — it handles any model that uses the </think> reasoning marker (Qwen3.x, DeepSeek-R1, future reasoning models). For non-reasoning models, the lastIndexOf returns -1 and the original raw is used unchanged. Zero impact on existing models.
Test results after applying all fixes
$ openclaw memory-pro search "PlayerAction" --limit 3
Found 3 memories:
1. [73f0e38d] _p_PlayerAction 欄位結構... (62%, vector)
2. [ca331a36] PlayerAction 完整欄位結構... (61%, vector)
3. [f46b34eb] DimPlayerAction 四欄位... (60%, vector)
All top 3 are PlayerAction-related (versus 1/3 without the fix). Existing 410 stored embeddings remain compatible — only query-side embedding paths change.
Compatibility / migration notes
- ✅ No re-embedding required — passage embedder is unchanged
- ✅ Backwards compatible — non-Qwen3 models hit
if (!this.isQwen3Model()) return text; and behave identically
- ✅ Cosine similarity check: same model deployed on Ollama (
qwen3-embedding:8b-fp16) and vLLM (/models/Qwen3-Embedding-8B) produces vectors with cosine similarity 0.999999 — they are interchangeable for stored embeddings
Patch (working, tested)
A drop-in patch script that applies all changes idempotently and works against both 1.1.0-beta.10 and current main:
I have a working idempotent patch script (Python-based, regex anchors so it survives upstream version drift). Happy to submit as a PR with one of the suggested approaches above — please indicate preference.
References
Summary
When deploying Qwen3 models via vLLM (with the standard OpenAI-compatible API),
memory-lancedb-profails in three ways:dimensionsparameter forQwen3-Embedding-8Bbecause the model is not loaded in matryoshka mode → all embed calls error with HTTP 400.Qwen3.5-27B-FP8because the LLM client passes thinking-mode output (Thinking Process: ...</think>{json}) directly to the JSON parser, which then fails or extracts garbage.This issue describes all three problems, suggests minimal fixes, and includes a working patch script that handles all of them.
Environment
memory-lancedb-proversion:1.1.0-beta.102026.4.8vllm(OpenAI-compatible API), runningQwen/Qwen3-Embedding-8B4096Problem 1 — vLLM 400 on
dimensionsparameterReproduce
Returns:
{ "error": { "message": "Model \"/models/Qwen3-Embedding-8B\" does not support matryoshka representation, changing output dimensions will lead to poor results.", "type": "BadRequestError", "code": 400 } }Even sending
dimensionsequal to the model's native dimension is rejected.Root cause in plugin
src/embedder.tsalways sendsdimensionsforopenai-compatibleproviders when it is configured, because the provider profile sets:Workaround currently required
EMBEDDING_DIMENSIONSlookup so plugin uses internal dim metadatadimensionsfrom user configdimensionsin the API requestBut the model isn't in upstream's lookup table:
These are Docker Model Runner names. vLLM exposes the model with a different name (the path-prefixed
/models/Qwen3-Embedding-8B) and the full 8B model has dim 4096, not 1024.Suggested fix
Two options:
A. Add vLLM Qwen3 entries (minimal):
B. Add config flag to suppress
dimensionsparameter (more general):Option B is more general and benefits any non-matryoshka model deployed via OpenAI-compatible APIs.
Problem 2 — Missing query-side instruction prefix for Qwen3
Background
Qwen3-Embedding model card explicitly recommends adding an instruction prefix on queries only (documents are embedded as-is):
Current state in plugin
embedQueryandembedBatchQuerypass the text unchanged. There's no mechanism to apply a query-side prefix (the existingtaskQueryfield maps to a separate API parameter, which Qwen3 doesn't use).Real-world impact
Tested on a memory store of 410 entries, query
"PlayerAction"against aBigQuery PlayerAction tabledocument:PlayGD Project (sim 0.15)🟡 unrelated,DimPlayerAction (0.13),gwt mummy query (0.10)🟡 unrelatedDimPlayerAction (0.22)✓,PlayerAction full schema (0.15)✓,_p_PlayerAction (0.14)✓For short queries the improvement is dramatic (top-3 went from 1/3 relevant to 3/3 relevant). For longer queries it's the documented 1-5%.
Suggested fix
Add a query-only wrapping helper, gated by model detection:
Important: only
embedQuery/embedBatchQueryare affected.embedPassage/embedBatchPassageremain unchanged so existing document embeddings remain valid (no need to re-embed).A more general approach would be a config-driven prefix template:
This would handle Qwen3 today and any future model that needs prefix-based query instructions.
Problem 3 — Smart extraction breaks on reasoning models (Qwen3.5, DeepSeek-R1, etc.)
Background
Modern reasoning models (Qwen3.5, DeepSeek-R1) output their answer with a thinking trace prefix:
The thinking text appears in
response.choices[0].message.content, NOT in a separatereasoning_contentfield (depends on vLLM build).Reproduce
Using
Qwen3.5-27B-FP8via vLLM as the smart-extractor LLM:src/llm-client.tscallsextractJsonFromResponse(raw)directly on the unfiltered responseThinking Process:textnulland the whole turn's extraction is silently droppedCurrent code (
src/llm-client.ts)Suggested fix
Strip the thinking block before passing to
extractJsonFromResponse:This is universal — it handles any model that uses the
</think>reasoning marker (Qwen3.x, DeepSeek-R1, future reasoning models). For non-reasoning models, thelastIndexOfreturns -1 and the originalrawis used unchanged. Zero impact on existing models.Test results after applying all fixes
All top 3 are PlayerAction-related (versus 1/3 without the fix). Existing 410 stored embeddings remain compatible — only query-side embedding paths change.
Compatibility / migration notes
if (!this.isQwen3Model()) return text;and behave identicallyqwen3-embedding:8b-fp16) and vLLM (/models/Qwen3-Embedding-8B) produces vectors with cosine similarity0.999999— they are interchangeable for stored embeddingsPatch (working, tested)
A drop-in patch script that applies all changes idempotently and works against both 1.1.0-beta.10 and current
main:I have a working idempotent patch script (Python-based, regex anchors so it survives upstream version drift). Happy to submit as a PR with one of the suggested approaches above — please indicate preference.
References