Autocomplete not working with VLLM Backend #8617

tg-arraylabs · 2025-11-06T17:55:14Z

tg-arraylabs
Nov 6, 2025

Hi, I'm self hosting VLLM using Qwen/Qwen3-Coder-30B-A3B-Instruct as my model. I'm able to use this model without issue for the chat role, but autocomplete doesn't work. I can see the autocomplete requests getting POSTed to VLLM, but nothing comes back in my IDE.

Any advice would be helpful.

Here is my continue.dev configuration:

name: config
version: 0.1.0
schema: v1
models:
  - name: Qwen3-Coder
    provider: vllm
    model: Qwen3
    apiBase: https://vllm.myserver.com/v1
    roles:
      - autocomplete
      - chat 
    autocompleteOptions:
      disable: false
      debounceDelay: 250
      modelTimeout: 5000
      maxSuffixPercentage: 0.2
      prefixPercentage: 0.3

VLLM configuration:

services:
  vllm:
    image: vllm/vllm-openai:latest
    container_name: vllm
    ports:
      - "8080:8080"
    runtime: nvidia
    ipc: host
    volumes:
      - ./models:/models
    environment:
      HF_HOME: /models
      HUGGINGFACE_HUB_CACHE: /models
    command: >
        --model Qwen/Qwen3-Coder-30B-A3B-Instruct
        --served-model-name Qwen3
        --gpu-memory-utilization 0.95
        --port 8080
        --max-model-len 32768
        --max-num-seqs 16
        --tensor-parallel-size 4
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 4
              capabilities: [gpu]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Autocomplete not working with VLLM Backend #8617

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Autocomplete not working with VLLM Backend #8617

Uh oh!

tg-arraylabs Nov 6, 2025

Replies: 0 comments

tg-arraylabs
Nov 6, 2025