onnx · gramalingam · Jun 24, 2025 · May 15, 2025 · May 29, 2025 · Jun 5, 2025
diff --git a/README.md b/README.md
@@ -23,6 +23,7 @@ Contact e-mail addresses of WG leads can be found [here](https://wiki.lfaidata.f
 | [Preprocessing](https://lfaifoundation.slack.com/archives/C02AANGFBJB) | Data pre/post processing and featurization |
 | [Multi-device](https://lfaifoundation.slack.com/archives/C05JY32GCCS)  | Multi-device support in ONNX               |
 | Safety-Related-Profile | |
+| [Generative-AI](https://lfaifoundation.slack.com/archives/C08MERYU84T) | Accelerate GenAI support in ONNX |
 
 ## Completed working groups
 

diff --git a/generative-ai/meetings/meeting_4_date_May_15.md b/generative-ai/meetings/meeting_4_date_May_15.md
@@ -0,0 +1,42 @@
+# Recording and Transcript:
+
+https://zoom.us/rec/share/Zp0h_Wni7ERIsa4ixCIUffc_H6Kh9vkgr4Heh9kvtXYAwHTx6uqhczZ-psWy3EU.Ihb-C-mKMKL8a0Vl
+
+# Meeting Minutes:
+
+- ONNX Release Timeline
+  - Next release is more likely to be 1.19 rather than 2.0 to avoid breaking compatibility. A 2.0 discussion might occur at a community meetup. The potential timeframe for the next release (1.19) was mentioned as July-September.
+  - Yuan is the release manager. We can potentially align some of the proposals from the GenAI group with the next release.
+- KV Cache Operator Proposal (Document: Attention op proposal):
+  - Separation from Attention: Yuan proposed separating KV cache updates into a new, distinct operator rather than being part of the main Attention operator.
+    - Reasoning: Handling batched updates for KV cache (where new tokens are appended to the end of the actual valid data, considering padding) becomes too complex to express with primitive ops if integrated directly into the Attention op.
+  - Functionality: The new KV Cache op would take past_kv, present_kv (new tokens), and past_sequence_length as inputs and output the updated present_k and present_v. It essentially performs a "scatter" or "in-place update" into a larger tensor representing the cache.
+  - In-Place Updates:
+    - The design supports in-place updates where past_k/v and present_k/v have the same shape (dimension being max_sequence_length).
+    - It was clarified that the ONNX op itself remains functional; true in-place memory optimization is a backend/compiler responsibility. The op design facilitates this by ensuring shape compatibility and relying on backends to potentially use aliasing APIs or lifetime analysis to reuse buffers if the user provides the same buffer for input and output of the cache.
+  - Circular Buffers:
+    - An open question from Gaurav was how to handle circular buffer updates (when kv_sequence_length exceeds max_sequence_length).
+    - If supported, the modulo logic for wrap-around indexing would need to be part of the op's specification.
+    - max_sequence_length would likely be derived from the input tensor shapes rather than being an explicit attribute to the op, allowing a single model to work with varying sequence lengths.
+- Attention Operator Updates (Related to KV Cache):
+  - With a separate KV Cache op, the Attention op would take q, k, v (where k and v are outputs of the KV Cache op), attention_mask, and kv_sequence_length as inputs.
+  - Quantization Subgraphs:
+    - Yuan discussed incorporating attribute subgraphs for quantization/dequantization (Q/DQ) operations related to attention inputs/outputs.
+    - Rama expressed reservations about adding these as attributes for prologue/epilogue Q/DQ ops, suggesting they could remain outside the Attention op. He was more open to it for operations within the attention mechanism. Yuan suggested that internalizing them could simplify fusion for backends. Rama planned to review this further.
+    - The restrictiveness of these subgraphs (e.g., allowing only Q/DQ/Cast vs. more complex operations like score_mod for Flex Attention) was an open point.
+- Flex Attention Representation:
+  - Yamini raised the issue of how to represent Flex Attention models in ONNX, as current Pytorch exports result in a higher-order op, not standard ONNX ops.
+  - Justin Chu suggested that submodules captured within the exported program could potentially be represented as ONNX functions. The main logic of the higher-order op could then leverage these functions (similar to how the ‘If’ operator works).
+  - This approach requires further investigation into the Pytorch-side compilation and main logic. Performance optimizations for such a representation in ONNX would be crucial.
+- Exporter Considerations:
+  - The team acknowledged the need to examine how current model exporters (e.g., Optimum, torch.dynamo via Transformers Executorch) handle KV caching to ensure the new ops align with existing practices.
+  - Yamini mentioned recent experience with Transformers Executorch integration using torch.dynamo with static cache.
+
+## Action Items:
+- Yuan: Continue developing the specifications for the new KV Cache operator and updates to the Attention operator.
+- Rama: To review the attention proposal regarding attribute subgraphs for quantization in the Attention operator and provide feedback.
+- Yamini: To post a link on Slack regarding the Transformers Executorch exporter with static cache and to experiment with more score_mod and mask_mod configurations for Flex Attention.
+- Justin Chu: To take a deeper look into representing Flex Attention's exported program logic in ONNX, exploring the use of ONNX functions.
+- All Attendees:
+  - Review Yuan's proposed spec (Attention op proposal) updates for the KV Cache and Attention operators.
+  - Contribute to the backend representations document shared by Yamini on Slack.
diff --git a/generative-ai/meetings/meeting_5_date_May_28.md b/generative-ai/meetings/meeting_5_date_May_28.md
@@ -0,0 +1,52 @@
+# Recording and Transcript:
+
+https://zoom.us/rec/share/CBYugUyenm4v6AVBZAbeV_0FtTwGq-AVXQZbC-lpBe_o2-hEgVIc4NI3Sx0IozNA.oxt5XVcovb1VMUd1
+
+# Meeting Minutes:
+
+## Exporters & ONNX Script
+- Attention Export (Opset 23): 
+  - Kshitij Khode reported a PR is ready for converting Scaled Dot Product Attention (SDPA) to the ONNX Attention operator as part of an Opset 23 check-in (PR link to be shared later).
+- Fusion for New Ops (RoPE, RMS Norm in Opset 23): 
+  - Kshitij is looking into other Opset 23 ops like RoPE and RMS Norm. These are often implemented in PyTorch (e.g., via TorchTune) and require graph fusion on the FX graph.
+  - Rama confirmed ONNX Script has existing fusion implementations for ops like RMS Norm and Rotary Embedding (RoPE). Links: https://github.com/microsoft/onnxscript/blob/main/onnxscript/rewriter/ort_fusions/rms_normalization.py, https://github.com/microsoft/onnxscript/blob/main/onnxscript/rewriter/ort_fusions/rotary_embedding.py
+  - These fusions currently require explicit calls by the user. The goal is to automate these fusions during the export process if the target opset is 23 or higher.
+  - Path Forward: The exporter has a hook to an optimize method in ONNX Script and this method needs to be extended to automatically call these fusion patterns: https://github.com/microsoft/onnxscript/blob/5a8b9e616ead90069914b8693f30bb7e71a561c6/onnxscript/_framework_apis/torch_2_6.py#L32
+- Optimum (Hugging Face) Integration: 
+  - Kshitij highlighted that Hugging Face Optimum does not currently support opsets higher than 21 and does not use the TorchDynamo export path by default. This prevents leveraging newer ops like Attention at opset 23.
+  - Justin mentioned that switching Optimum to use the Dynamo exporter by default is a breaking change for downstream tools that rely on the TorchScript exporter's graph structure or naming conventions.
+  - Potential Solution: Gradually migrate Optimum, possibly by adding CLI parameters to enable Dynamo export and specify higher opsets. Contacting Joshua (https://github.com/xenova) from Hugging Face (active in the ONNX space) was suggested
+  - Justin noted that Optimum lacks dedicated full-time maintainers at Hugging Face, so community contributions are welcome.
+
+## Operators: Attention & KV Cache
+- Continued discussion on Attention operator updates:
+  - KV Cache Management: A central point of discussion was whether KV cache management should be within the ONNX graph (e.g., as part of the attention operator or a dedicated KV cache operator) or handled by the user outside the graph (i.e., KV cache as graph input/output).
+  - Variations & Complexity: 
+    - The group discussed supporting different KV cache variations (e.g., Hugging Face's "static cache" for in-place updates, which the ONNX spec aims for, vs. "dynamic cache").
+    - Yuan noted difficulties exporting using Hugging Face's static cache export for GPT-2 using torch.export (FX graph generation errors).
+    - The interaction of KV caching with other operations, like positional embeddings (citing the "Efficient Streaming Language Models" paper where embeddings are added after cache retrieval), adds complexity to a unified operator design.
+    - Handling scenarios where the prefill sequence length exceeds the cache size (a point raised by Gaurav) would require careful design, potentially different graphs for prefill and decode phases if cache management is in-graph.
+  - Performance Implications: 
+    - Yamini and Ankit shared that "stateful transformation" (making KV cache an internal graph variable rather than I/O) shows performance benefits due to data locality (e.g., keeping cache on NPU memory without constant transfers).
+    - Rama questioned whether similar performance could be achieved with graph I/O using I/O bindings and direct pointer passing if no actual memory copy occurs, highlighting the distinction between logical data flow and physical memory management.
+- FlexAttention: Yamini confirmed that PyTorch models using flex attention can be exported via torch.dynamo.export to a torch.higher_order.ops.flexattention op. This could then be mapped to the ONNX attention operator if a mapping is defined.
+
+## Backend Representations & Backend Context
+- Yamini proposed exploring the use of ONNX Script's fusion capabilities to capture backend-specific subgraphs and represent them as a "backend context" function within the ONNX model. This could allow for more standardized representation of compiled blobs or hardware-specific optimizations.
+- Rama noted that backend context is generally more arbitrary than specific operator fusions and would likely require significant backend compiler involvement to identify and create these subgraphs.
+- The discussion touched on whether this process would be part of the main ONNX export (e.g., torch.onnx.export) or a subsequent optimization step.
+- The aim is to allow native toolchains (beyond just ONNX Runtime EPs) to recognize and leverage these backend contexts if standardized.
+
+## Pipeline Interfaces
+- Yamini mentioned she is working on a proposal regarding pipeline interfaces and will share it with the working group for further review and discussion. Here is the proposal: https://docs.google.com/presentation/d/1cAwOvHTF18Gbr58OVtslQjQtvXX0RIg2
+
+## ONNX Community Meetup Presentation
+- The working group will prepare a presentation for the upcoming ONNX Community Meetup.
+- It will be a high-level update, likely around 6-8 minutes, summarizing the group's activities and progress.
+
+## Action Items:
+- Yamini: Draft a high-level summary presentation for the ONNX Community Meetup and share it with the group for feedback.
+- Kshitij/Justin: Create an issue/reach out (e.g., to Joshua from Hugging Face via Optimum GitHub) to discuss enabling newer opsets and the Dynamo exporter path in Optimum, referencing the relevant PRs and needs of the working group.
+- Yuan: Continue investigating issues with exporting Hugging Face's static KV cache (e.g., for GPT-2) via torch.export and refining the Attention operator proposal, considering the discussed complexities around KV cache management.
+- Rama/All: Investigate extending the optimize method in ONNX Script (called by the exporter) to automatically apply fusions for new ops (like RoPE, RMS Norm) when opset 23+ is targeted.
+
diff --git a/generative-ai/meetings/meeting_6_date_June_6.md b/generative-ai/meetings/meeting_6_date_June_6.md
@@ -0,0 +1,43 @@
+# Recording and Transcript:
+
+https://zoom.us/rec/share/ufcA6SlCvg50p8SnFKhoQWINvwjxCRfxOBCLlCtCfBmSqfWWqsQEDG0ADHVCSpQh.9a36U85GbcDGELuH
+
+# Meeting Minutes:
+
+## Logistics:
+- Yamini will be on vacation for the next two weeks.
+- The regular meeting next week will be canceled due to the ONNX meetup.
+- Rama will lead the meeting on the 18th.
+
+## Gen AI Interfaces for ONNX 
+- Yamini presented a proposal to standardize GenAI interfaces (Link)
+- Problem: Fragmentation exists across various Gen AI deployment stacks (ONNX Runtime GenAI, OpenVINO GenAI, TensorRT etc.), leading to different APIs and user experiences despite some supporting ONNX.
+- Proposal: Define high-level interfaces (not a new runtime) to standardize how applications interact with these deployment stacks using ONNX models. The goal is to reduce fragmentation and make it easier for developers to deploy ONNX models across different optimized backends.
+- Phased Approach: Start with high-level pipeline APIs (e.g., text generation, diffusion) and later potentially introduce mid-level/modular interfaces for more granular control (akin to Transformers library's flexibility).
+- Discussion Highlights: 
+    - General positive feedback on the idea.
+    - Clarification that TensorRT-LLM doesn't directly use ONNX models (Updated now in the proposal).
+    - The workflow involves exporting ONNX models (e.g., from Transformers using Optimum) before these interfaces come into play.
+    - Advanced features like vLLM's continuous batching, paged attention, and speculative decoding are considered relevant and could be incorporated, though a "client-first" approach was favored over immediately tackling complex data center-specific optimizations.
+    - Debate on whether to capture entire pipelines (preprocessing, model, postprocessing) within a single ONNX model versus the proposed interface approach. More thought needs to be given to understand the benefits and challenges.
+- Next Steps: Yamini to explore and suggest PoCs to assess feasibility.
+
+## Sage Attention (Open from Georgy):
+- Discussion on whether Sage Attention needs to be an ONNX standard.
+- Yuan noted Sage Attention builds on Flash Attention, which is often a backend-specific kernel optimization rather than an ONNX op. If Flash Attention isn't in ONNX, Sage Attention might not be either.
+- Max mentioned seeing good results with Sage Attention for long sequences and video generation models, especially quantized models, but also viewed it as an optimized kernel.
+- Next Steps: Yamini will follow up with Georgy for more details.
+
+## SSM Operations (e.g., Mamba, Zamba):
+- Yamini inquired about discussions or needs for these ops.
+- Rama stated no current discussions in the operators SIG. The key is to determine if they are new op definitions or just optimized kernels.
+
+## Attention Op & KV Cache Op Design Update:
+- Yuan presented updates to the design for handling KV cache in attention ops.
+- Proposal: Introduce more generic ops: TensorScatter (to update past KV cache with new KV at a specific index) and TensorGather (to read a segment from a tensor).
+- This aims to provide more flexibility for managing KV cache, especially for sliding window attention and cases where prefill length exceeds cache size.
+-	A discussion with Gaurav explored using these ops to potentially create a single graph that can handle both prefill and decode phases by having TensorGather truncate the KV cache for output while the full KV goes to the attention mechanism.
+-	The intention is to retain the existing simpler KV cache behavior in the Attention op (from opset 23 carried to 24) and add these new ops for more advanced/explicit cache management.
+-	Next Steps: Yuan will update the design document with the new figures and incorporate feedback.
+
+