Newly Onboarded Models
Causal Models
GPT OSS (Text)
Example script: efficient-transformers/examples/disagg_serving
Mistral 3.1 (Text)
Example script: mistral3_example.py
Qwen 3 / Qwen3-MoE
Example script: qwen3moe_example
Olmo
Example script: efficient-transformers/examples/text_generation
Multi Models
Qwen 2.5-VL (Vision-Language)
Example script: qwen2_5_vl_example.py
Molmo (Vision-Language)
Example script: molmo_example.py
Gemma 3 (Vision-Language)
Example script: gemma3_example
InternVL 3.5 (Vision-Language)
Example script: intern_example
Audio
Wave2Vec2 (ASR)
Example script: wav2vec2_example
Diffusion Models
Example scripts: efficient-transformers/examples/diffusers
New Features
Diffusers Pipeline Support
Diffusers pipeline support enables seamless integration of diffusion models in QEfficient library. (#604) (#669)
Supported models:
Example scripts: efficient-transformers/examples/diffusers
GPT OSS with Disaggregate Serving Model
Support for GPT OSS using disaggregate serving model. (#608)
Example scripts: efficient-transformers/examples/disagg_serving
Compute-Context-Length (CCL) Support
Feature allows users to optimize the throughput of large language models (LLMs) when handling very large context lengths. (#576) (#663)
Example scripts: efficient-transformers/examples/performance/compute_context_length
ONNX Sub Functions Export Feature for AutoModelForCausalLM
Feature enabling more efficient model compilation and execution on hardware. Users can enable the feature by passing use_onnx_subfunctions=True during export (#621) (#642)
model.export(tmp_path, use_onnx_subfunctions=True)Note: Currently, we are seeing some performance degradation and output discrepancies with the subfunction. We will continue to monitor and evaluate its behavior, and once these issues are resolved, the subfunction will be enabled by default.
Continuous Batching for VLMs
Now VLMs support continuous batching by including scenarios with multiple images and prompts. (#610)
BlockedKV Attention in CausalLM
Implements a blocked K/V cache layout so attention reads/processes the cache block-by-block, improving long-context decode performance. (#618)
Memory Profiling Tool
Adds scripts to profile memory during export/compile/infer (peak usage, cache footprint) for quicker diagnosis. (#674)
Scripts: efficient-transformers/scripts/memory_profiling
Extended On-Device Sampling
Feature extends on-device sampling support to the language decoder of dual QPC vision language models and adds guided decoding capabilities in On Device Sampling. (#597) (#624)
Example script: efficient-transformers/examples/performance/on_device_sampling.py
ONNX Transform, Memory & Time Optimizations
Adds periodic memory cleanup (e.g., to FP16ClipTransform / SplitTensorsTransform) during large-tensor processing, and avoids redundant external data loading when already present. (#640)
Dependency Upgrades
- Transformers 4.55
- Torch 2.7.0+cpu
- Torchvision 0.22.0+cpu
- Python ≥3.9
Removed Platform SDK Dependency
Support QPC generation on systems without the Platform SDK. (#609)
Example Scripts Revamp
This includes:
- Onboarding Guide for adding new Causal models (#574)
- Onboarding Guide for adding new Custom ops in QEff (#638)
- Organized examples into domain-specific subdirectories (#615)
Fine Tuning
Checkpoint Management
Resume from epochs with proper state restoration. Adds resume-from-epoch & epoch checkpoint loading so runs can be restarted with the correct optimizer/scaler/model state. (#614)
Enhanced Loss Tracking
Corrected data type handling for accurate loss computation. Refinements in the finetune/eval path improve numerical stability when computing losses and metrics. (#606)
Custom Dataset Support
Improved handling with better tokenization. Fixes around padding/token typing (e.g., pad_to_max_length) ensure robust dataset ingestion across varied corpora. (#599)
Device-Aware Scaling
Optimized GradScaler for multi-device training. DDP + pipeline-parallel fixes improve device mapping/scaling behavior during mixed-precision training. (#544)