Release release/v1.21.0 · quic/efficient-transformers

Newly Onboarded Models

Causal Models

Multi Models

Qwen 2.5-VL (Vision-Language)

Example script: qwen2_5_vl_example.py

Molmo (Vision-Language)

Example script: molmo_example.py

Gemma 3 (Vision-Language)

Example script: gemma3_example

InternVL 3.5 (Vision-Language)

Example script: intern_example

Audio

Wave2Vec2 (ASR)

Example script: wav2vec2_example

Diffusion Models

Example scripts: efficient-transformers/examples/diffusers

New Features

Diffusers Pipeline Support

Diffusers pipeline support enables seamless integration of diffusion models in QEfficient library. (#604) (#669)

Supported models:

Example scripts: efficient-transformers/examples/diffusers

GPT OSS with Disaggregate Serving Model

Support for GPT OSS using disaggregate serving model. (#608)

Example scripts: efficient-transformers/examples/disagg_serving

Compute-Context-Length (CCL) Support

Feature allows users to optimize the throughput of large language models (LLMs) when handling very large context lengths. (#576) (#663)

Example scripts: efficient-transformers/examples/performance/compute_context_length

ONNX Sub Functions Export Feature for AutoModelForCausalLM

Feature enabling more efficient model compilation and execution on hardware. Users can enable the feature by passing use_onnx_subfunctions=True during export (#621) (#642)

model.export(tmp_path, use_onnx_subfunctions=True)

Note: Currently, we are seeing some performance degradation and output discrepancies with the subfunction. We will continue to monitor and evaluate its behavior, and once these issues are resolved, the subfunction will be enabled by default.

Continuous Batching for VLMs

Now VLMs support continuous batching by including scenarios with multiple images and prompts. (#610)

BlockedKV Attention in CausalLM

Implements a blocked K/V cache layout so attention reads/processes the cache block-by-block, improving long-context decode performance. (#618)

Memory Profiling Tool

Adds scripts to profile memory during export/compile/infer (peak usage, cache footprint) for quicker diagnosis. (#674)

Scripts: efficient-transformers/scripts/memory_profiling

Extended On-Device Sampling

Feature extends on-device sampling support to the language decoder of dual QPC vision language models and adds guided decoding capabilities in On Device Sampling. (#597) (#624)

Example script: efficient-transformers/examples/performance/on_device_sampling.py

ONNX Transform, Memory & Time Optimizations

Adds periodic memory cleanup (e.g., to FP16ClipTransform / SplitTensorsTransform) during large-tensor processing, and avoids redundant external data loading when already present. (#640)

Dependency Upgrades

Transformers 4.55
Torch 2.7.0+cpu
Torchvision 0.22.0+cpu
Python ≥3.9

Removed Platform SDK Dependency

Support QPC generation on systems without the Platform SDK. (#609)

Example Scripts Revamp

This includes:

Onboarding Guide for adding new Causal models (#574)
Onboarding Guide for adding new Custom ops in QEff (#638)
Organized examples into domain-specific subdirectories (#615)

Fine Tuning

Checkpoint Management

Resume from epochs with proper state restoration. Adds resume-from-epoch & epoch checkpoint loading so runs can be restarted with the correct optimizer/scaler/model state. (#614)

Enhanced Loss Tracking

Corrected data type handling for accurate loss computation. Refinements in the finetune/eval path improve numerical stability when computing losses and metrics. (#606)

Custom Dataset Support

Improved handling with better tokenization. Fixes around padding/token typing (e.g., pad_to_max_length) ensure robust dataset ingestion across varied corpora. (#599)

Device-Aware Scaling

Optimized GradScaler for multi-device training. DDP + pipeline-parallel fixes improve device mapping/scaling behavior during mixed-precision training. (#544)

release/v1.21.0