Add Support for Guided Decoding to On Device Sampling #624

quic-sanising · 2025-11-19T01:59:19Z

✨ Add Support for Guided Decoding to On Device Sampling

📌 Overview

This PR introduces guided decoding capabilities in On Device Sampling for QEffForCausalLM and QEffCausalLMForTextImageToTextModel models.

🚀 Motivation

As outlined in this blog on structured decoding, structured decoding represents a fundamental shift in controlling LLM outputs. Instead of relying on post-processing, constraints are enforced during token generation via logits manipulation. This approach ensures:

Format compliance at generation time.
Reduced error rates for structured outputs.
Performance improvements through optimized backends like XGrammar, which can deliver up to 5× faster token generation under load.

The constraints are provided through token_bitmasks which is a Boolean matrix of shape (batch_size, vocab_size). Here, each element indicates whether a token should be kept (1) or masked (0). During sampling, this mask is applied to the logits before token selection, ensuring that only allowed tokens are considered.

By performing this operation directly on the device, we eliminate host-device transfers, reduce latency, and improve throughput for structured decoding workloads.

🛠️ Implementation Details

The guided decoding logic is injected via include_guided_decoding=True during model loading. No changes to the model architecture are required.

from QEfficient import QEFFAutoModelForCausalLM as AutoModelForCausalLM

# Load model with On Device Sampler enabled
qeff_model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B",
    continuous_batching=True,
    qaic_config={
        "include_sampler": True,
        "return_pdfs": False,
        "max_top_k_ids": 512,
        "include_guided_decoding": True,
    },
)

# Compile as usual
qeff_model.compile(
    prefill_seq_length=128,
    ctx_len=256,
    full_batch_size=16,
    num_devices=4,
    num_speculative_tokens=0,
    mxint8_kv_cache=True,
    mxfp6_matmul=True,
)

To disable guided decoding, simply set include_guided_decoding=False.

Signed-off-by: quic-xiyushi <[email protected]>

Signed-off-by: quic-sanising <[email protected]> Signed-off-by: sanising <[email protected]>

Signed-off-by: quic-xiyushi <[email protected]>

Signed-off-by: sanising <[email protected]>

quic-sanising · 2025-11-19T01:59:46Z

Depends on #597

Signed-off-by: sanising <[email protected]>

Signed-off-by: quic-xiyushi <[email protected]>

Signed-off-by: sanising <[email protected]>

Signed-off-by: quic-xiyushi <[email protected]>

Signed-off-by: sanising <[email protected]>

Signed-off-by: quic-xiyushi <[email protected]>

Signed-off-by: sanising <[email protected]>

quic-sanising · 2025-11-21T21:04:54Z

Ready for review

Signed-off-by: sanising <[email protected]>

Signed-off-by: quic-xiyushi <[email protected]>

Signed-off-by: Mamta Singh <[email protected]>

quic-mamta · 2025-12-07T18:24:02Z

QEfficient/transformers/sampler/sampler.py

    top_ps: Optional[torch.Tensor] = None,
    min_ps: Optional[torch.Tensor] = None,
    random_numbers: Optional[torch.Tensor] = None,
+    vision_embeds: Optional[torch.Tensor] = None,


please keep dtype of these 2 consistent as per lines 27-28. also update function docstring for these newly added args.

quic-mamta · 2025-12-07T20:38:38Z