Support vLLM-deployed Qwen3 models (matryoshka rejection + missing instruction prefix + reasoning model output handling)

## Summary

When deploying Qwen3 models via **vLLM** (with the standard OpenAI-compatible API), `memory-lancedb-pro` fails in three ways:

1. **vLLM rejects the `dimensions` parameter** for `Qwen3-Embedding-8B` because the model is not loaded in matryoshka mode → all embed calls error with HTTP 400.
2. **Query recall quality is sub-optimal** because the plugin doesn't apply Qwen3's recommended instruction-prefix format on the query side.
3. **Smart extraction silently fails for reasoning models** like `Qwen3.5-27B-FP8` because the LLM client passes thinking-mode output (`Thinking Process: ...</think>{json}`) directly to the JSON parser, which then fails or extracts garbage.

This issue describes all three problems, suggests minimal fixes, and includes a working patch script that handles all of them.

---

## Environment

- `memory-lancedb-pro` version: `1.1.0-beta.10`
- OpenClaw: `2026.4.8`
- Embedding endpoint: `vllm` (OpenAI-compatible API), running `Qwen/Qwen3-Embedding-8B`
- Embedding model native dimension: `4096`
- Plugin config:
  ```json
  "embedding": {
    "provider": "openai-compatible",
    "baseURL": "http://<host>:15001/v1",
    "model": "/models/Qwen3-Embedding-8B",
    "dimensions": 4096
  }
  ```

---

## Problem 1 — vLLM 400 on `dimensions` parameter

### Reproduce

```bash
curl -s -X POST "http://<vllm-host>/v1/embeddings" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "/models/Qwen3-Embedding-8B",
    "input": "test",
    "dimensions": 4096
  }'
```

Returns:
```json
{
  "error": {
    "message": "Model \"/models/Qwen3-Embedding-8B\" does not support matryoshka representation, changing output dimensions will lead to poor results.",
    "type": "BadRequestError",
    "code": 400
  }
}
```

Even sending `dimensions` equal to the model's native dimension is rejected.

### Root cause in plugin

`src/embedder.ts` always sends `dimensions` for `openai-compatible` providers when it is configured, because the provider profile sets:

```typescript
case "openai-compatible":
  return {
    ...,
    taskField: null,
    dimensionsField: "dimensions",   // ← always sent
  };
```

### Workaround currently required

1. Add the model to `EMBEDDING_DIMENSIONS` lookup so plugin uses internal dim metadata
2. Remove `dimensions` from user config
3. The plugin then doesn't include `dimensions` in the API request

But the model isn't in upstream's lookup table:

```typescript
// Existing entries
"ai/qwen3-embedding": 1024,
"ai/qwen3-embedding:0.6B-F16": 1024,
"ai/qwen3-embedding:4B": 1024,
"ai/qwen3-embedding:4B-Q4_K_M": 1024,
"ai/qwen3-embedding:8B-Q4_K_M": 1024,
```

These are **Docker Model Runner** names. vLLM exposes the model with a different name (the path-prefixed `/models/Qwen3-Embedding-8B`) and the **full** 8B model has dim **4096**, not 1024.

### Suggested fix

Two options:

**A. Add vLLM Qwen3 entries** (minimal):
```typescript
// Qwen3 (vLLM hosted, native 4096-dim)
"/models/Qwen3-Embedding-8B": 4096,
"Qwen3-Embedding-8B": 4096,
```

**B. Add config flag to suppress `dimensions` parameter** (more general):
```typescript
embedding: {
  ...
  sendDimensions?: boolean;  // default true; set false for vLLM/non-matryoshka
}
```

Option B is more general and benefits any non-matryoshka model deployed via OpenAI-compatible APIs.

---

## Problem 2 — Missing query-side instruction prefix for Qwen3

### Background

[Qwen3-Embedding model card](https://huggingface.co/Qwen/Qwen3-Embedding-8B) explicitly recommends adding an instruction prefix on **queries only** (documents are embedded as-is):

```python
def get_detailed_instruct(task_description: str, query: str) -> str:
    return f'Instruct: {task_description}\nQuery:{query}'
```

> Not using an `instruct` on the query side can lead to a ~1-5% drop in retrieval performance.

### Current state in plugin

`embedQuery` and `embedBatchQuery` pass the text unchanged. There's no mechanism to apply a query-side prefix (the existing `taskQuery` field maps to a separate API parameter, which Qwen3 doesn't use).

### Real-world impact

Tested on a memory store of 410 entries, query `"PlayerAction"` against a `BigQuery PlayerAction table` document:

| Mode | Top 3 vector results |
|---|---|
| **No prefix** | `PlayGD Project (sim 0.15)` 🟡 unrelated, `DimPlayerAction (0.13)`, `gwt mummy query (0.10)` 🟡 unrelated |
| **With Qwen3 instruction prefix** | `DimPlayerAction (0.22)` ✓, `PlayerAction full schema (0.15)` ✓, `_p_PlayerAction (0.14)` ✓ |

For short queries the improvement is dramatic (top-3 went from 1/3 relevant to 3/3 relevant). For longer queries it's the documented 1-5%.

### Suggested fix

Add a query-only wrapping helper, gated by model detection:

```typescript
private isQwen3Model(): boolean {
  const m = this._model.toLowerCase();
  return m.includes("qwen3-embedding") || m.includes("qwen3_embedding");
}

private wrapQwen3Query(text: string): string {
  if (!this.isQwen3Model()) return text;
  const task = this._taskQuery && this._taskQuery.length > 0
    ? this._taskQuery
    : "Given a memory search query, retrieve relevant stored knowledge entries that match the query";
  return `Instruct: ${task}\nQuery:${text}`;
}

async embedQuery(text: string): Promise<number[]> {
  const wrapped = this.wrapQwen3Query(text);
  return this.embedSingle(wrapped, this._taskQuery);
}

async embedBatchQuery(texts: string[]): Promise<number[][]> {
  const wrapped = this.isQwen3Model() ? texts.map((t) => this.wrapQwen3Query(t)) : texts;
  return this.embedMany(wrapped, this._taskQuery);
}
```

**Important**: only `embedQuery`/`embedBatchQuery` are affected. `embedPassage`/`embedBatchPassage` remain unchanged so existing document embeddings remain valid (no need to re-embed).

A more general approach would be a config-driven prefix template:

```typescript
embedding: {
  ...
  queryPrefixTemplate?: string;  // e.g. "Instruct: {task}\nQuery:{text}"
}
```

This would handle Qwen3 today and any future model that needs prefix-based query instructions.

---

---

## Problem 3 — Smart extraction breaks on reasoning models (Qwen3.5, DeepSeek-R1, etc.)

### Background

Modern reasoning models (Qwen3.5, DeepSeek-R1) output their answer with a thinking trace prefix:

```
Thinking Process:
1. Analyze the request...
2. Draft output...
3. Final check...
</think>

{"actual": "json"}
```

The thinking text appears in `response.choices[0].message.content`, NOT in a separate `reasoning_content` field (depends on vLLM build).

### Reproduce

Using `Qwen3.5-27B-FP8` via vLLM as the smart-extractor LLM:
1. Trigger any conversation that produces a memory candidate
2. `src/llm-client.ts` calls `extractJsonFromResponse(raw)` directly on the unfiltered response
3. JSON parse fails because the response starts with `Thinking Process:` text
4. Smart extractor returns `null` and the whole turn's extraction is silently dropped

### Current code (`src/llm-client.ts`)

```typescript
const raw = response.choices?.[0]?.message?.content;
if (!raw) { ... return null; }
if (typeof raw !== "string") { ... return null; }

const jsonStr = extractJsonFromResponse(raw);  // ← parses raw including thinking text
```

### Suggested fix

Strip the thinking block before passing to `extractJsonFromResponse`:

```typescript
// Strip Qwen3/DeepSeek-R1-style reasoning thinking blocks before JSON extraction.
// Reasoning models output `Thinking Process: ...</think>{actual content}`.
// Take everything after the LAST </think> marker if present.
const cleaned = (() => {
  const idx = raw.lastIndexOf("</think>");
  return idx === -1 ? raw : raw.slice(idx + "</think>".length).trim();
})();

const jsonStr = extractJsonFromResponse(cleaned);
```

This is universal — it handles any model that uses the `</think>` reasoning marker (Qwen3.x, DeepSeek-R1, future reasoning models). For non-reasoning models, the `lastIndexOf` returns -1 and the original `raw` is used unchanged. **Zero impact on existing models.**

---

## Test results after applying all fixes

```
$ openclaw memory-pro search "PlayerAction" --limit 3
Found 3 memories:
1. [73f0e38d] _p_PlayerAction 欄位結構... (62%, vector)
2. [ca331a36] PlayerAction 完整欄位結構... (61%, vector)
3. [f46b34eb] DimPlayerAction 四欄位... (60%, vector)
```

All top 3 are PlayerAction-related (versus 1/3 without the fix). Existing 410 stored embeddings remain compatible — only query-side embedding paths change.

---

## Compatibility / migration notes

- ✅ **No re-embedding required** — passage embedder is unchanged
- ✅ **Backwards compatible** — non-Qwen3 models hit `if (!this.isQwen3Model()) return text;` and behave identically
- ✅ **Cosine similarity check**: same model deployed on Ollama (`qwen3-embedding:8b-fp16`) and vLLM (`/models/Qwen3-Embedding-8B`) produces vectors with cosine similarity `0.999999` — they are interchangeable for stored embeddings

---

## Patch (working, tested)

A drop-in patch script that applies all changes idempotently and works against both 1.1.0-beta.10 and current `main`:

I have a working idempotent patch script (Python-based, regex anchors so it survives upstream version drift). Happy to submit as a PR with one of the suggested approaches above — please indicate preference.


---

## References

- Qwen3-Embedding-8B model card: https://huggingface.co/Qwen/Qwen3-Embedding-8B
- vLLM Qwen3 embedding tutorial: https://docs.vllm.ai/projects/ascend/en/main/tutorials/Qwen3_embedding.html
- Qwen3 paper: https://arxiv.org/pdf/2506.05176


Mode	Top 3 vector results
No prefix	`PlayGD Project (sim 0.15)` 🟡 unrelated, `DimPlayerAction (0.13)`, `gwt mummy query (0.10)` 🟡 unrelated
With Qwen3 instruction prefix	`DimPlayerAction (0.22)` ✓, `PlayerAction full schema (0.15)` ✓, `_p_PlayerAction (0.14)` ✓

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support vLLM-deployed Qwen3 models (matryoshka rejection + missing instruction prefix + reasoning model output handling) #573

Summary

Environment

Problem 1 — vLLM 400 on `dimensions` parameter

Reproduce

Root cause in plugin

Workaround currently required

Suggested fix

Problem 2 — Missing query-side instruction prefix for Qwen3

Background

Current state in plugin

Real-world impact

Suggested fix

Problem 3 — Smart extraction breaks on reasoning models (Qwen3.5, DeepSeek-R1, etc.)

Background

Reproduce

Current code (`src/llm-client.ts`)

Suggested fix

Test results after applying all fixes

Compatibility / migration notes

Patch (working, tested)

References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Support vLLM-deployed Qwen3 models (matryoshka rejection + missing instruction prefix + reasoning model output handling) #573

Description

Summary

Environment

Problem 1 — vLLM 400 on dimensions parameter

Reproduce

Root cause in plugin

Workaround currently required

Suggested fix

Problem 2 — Missing query-side instruction prefix for Qwen3

Background

Current state in plugin

Real-world impact

Suggested fix

Problem 3 — Smart extraction breaks on reasoning models (Qwen3.5, DeepSeek-R1, etc.)

Background

Reproduce

Current code (src/llm-client.ts)

Suggested fix

Test results after applying all fixes

Compatibility / migration notes

Patch (working, tested)

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Problem 1 — vLLM 400 on `dimensions` parameter

Current code (`src/llm-client.ts`)