Linearize gpt_oss model and add separate example to qunatize it to w4a8 #2091

isharif168 · 2025-12-03T13:22:19Z

SUMMARY:
Add test to convert the gpt_oss model and then quantize model to w8a8

TEST PLAN:
Tested with gpt_oss model and run using vllm

Sample output

=== Prompt 0 ===
<|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI.
Knowledge cutoff: 2024-06
Current date: 2025-12-04

Reasoning: medium

You are a helpful assistant.

<|end|><|start|>user<|message|>Give 3 reasons to use AI.<|end|><|start|>assistant

--- Candidate 0 ---
analysisWe need to answer with three reasons to use AI. Should be short. Provide 3 reasons. Let's do it.assistantcommentaryHere are three reasons to use AI:

Personalized learning – AI can adapt to your learning style and pace, providing customized lessons that suit your learning style and progress.
Skill development – AI can help you develop skills that are in demand, such as coding, data analysis, and data science.
Supportive design

We should provide a short answer.assistantfinalHere are three reasons to use AI:

Personalized learning – AI adapts to your learning style and pace, giving you personalized lessons that fit your needs.
Skill development – AI helps you learn and practice skills that are in demand, like coding, data analysis, and data‑science skills.
Real‑world impact – AI helps you practice real-world skills in real life, such as learning new skills and applying them in real life.
finish_reason: stop
num_tokens: 226

gemini-code-assist · 2025-12-03T13:22:34Z

Summary of Changes

Hello @isharif168, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces new modeling utilities to enable the quantization of GPT-OSS Mixture-of-Experts (MoE) models. It provides a mechanism to transform the MoE architecture from a fused-expert design to a sequential, per-expert structure, which is a crucial step for applying certain quantization techniques. The changes aim to improve compatibility and efficiency when compressing these advanced language models.

Highlights

New GPT-OSS Model Utilities: A new Python file, src/llmcompressor/modeling/gpt_oss.py, has been added to introduce utilities specifically for handling GPT-OSS Mixture-of-Experts (MoE) models.
MoE Layer Conversion for Quantization: The convert_model_for_quantization_gptoss function is introduced, designed to refactor fused-expert MoE modules into a sequential, per-expert format, which is often necessary for quantization processes.
Sequential Expert Implementation: The SequentialGPTOSSMoE class replaces the original fused MoE layer, creating individual GPTOSSMLP modules for each expert and carefully transferring weights from the original fused tensors.
Individual Expert MLP Definition: The GPTOSSMLP class defines the structure and forward pass for a single expert's Multi-Layer Perceptron, including gate, up, and down projection layers with activation functions.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a new file, src/llmcompressor/modeling/gpt_oss.py, to facilitate the quantization of GPT-OSS models by patching their Mixture-of-Experts (MoE) layers into a more quantization-friendly sequential format. The overall implementation is solid, but I've identified a few areas for improvement regarding code quality, correctness, and performance. My feedback focuses on removing unused imports, ensuring CUDA-specific code is handled safely, replacing magic numbers with constants, and optimizing the MoE forward pass for better efficiency. Please review the detailed comments for specific suggestions.

src/llmcompressor/modeling/gpt_oss.py

gemini-code-assist · 2025-12-03T13:25:19Z

src/llmcompressor/modeling/gpt_oss.py

+import os
+from transformers import AutoModelForCausalLM, AutoTokenizer
+
+from llmcompressor import oneshot
+from llmcompressor.utils.dev import skip_weights_initialize
+from llmcompressor.modifiers.quantization import QuantizationModifier


There are several unused imports in this file: os, AutoModelForCausalLM, AutoTokenizer, oneshot, and QuantizationModifier. These should be removed to keep the code clean and maintainable.

Suggested change

import os

from transformers import AutoModelForCausalLM, AutoTokenizer

from llmcompressor import oneshot

from llmcompressor.utils.dev import skip_weights_initialize

from llmcompressor.modifiers.quantization import QuantizationModifier

from llmcompressor.utils.dev import skip_weights_initialize

src/llmcompressor/modeling/gpt_oss.py

github-actions · 2025-12-03T13:38:24Z

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed.

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

- Add gpt_oss_20b_example.py which does the convert and quantization - Clean up the gpt_oss.py from the test code Signed-off-by: Sharif Inamdar <[email protected]>

cfRod · 2025-12-05T09:51:27Z

src/llmcompressor/modeling/gpt_oss.py

+        del m
+    if to_delete:
+        gc.collect()
+        try:


Is this script expected to be used on GPUs? Has it been tested on GPU?

cfRod · 2025-12-05T10:09:54Z

src/llmcompressor/modeling/gpt_oss.py

+
+            dtype = gup.dtype
+            parent, child_name = _get_parent_and_child(model, name)
+            top_k = int(max(1, min(_get_top_k(model.config) or 1, E)))


Why cant topk just be defined as per the original definition https://github.com/huggingface/transformers/blob/390dca67e554b2b8f131064d4b6d991bf3ab3105/src/transformers/models/gpt_oss/modeling_gpt_oss.py#L154

cfRod · 2025-12-05T10:15:51Z

src/llmcompressor/modeling/gpt_oss.py

+        x = hidden_states.reshape(-1, H)
+
+        # Use the original router (it returns scores and indices already softmaxed over top-k)
+        router_scores, router_indices = self.router(x)   # scores: [tokens, E], indices: [tokens, k]


The original definition of the router, returns router logits https://github.com/huggingface/transformers/blob/390dca67e554b2b8f131064d4b6d991bf3ab3105/src/transformers/models/gpt_oss/modeling_gpt_oss.py#L177

is this right?

gemini-code-assist bot reviewed Dec 3, 2025

View reviewed changes

Shubhra Pandit and others added 9 commits December 3, 2025 13:54

Add file to linearize and quantize the gpt-oss models

0d4f339

Update src/llmcompressor/modeling/gpt_oss.py

3174bd9

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Update src/llmcompressor/modeling/gpt_oss.py

0706881

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Update src/llmcompressor/modeling/gpt_oss.py

cb19b08

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Remove hardcoded paths

6bd6f6a

Remove dataset loading and processing

140a3dc

Address review comments

97026e6

Address review comments

a30a86a

Remove gpt_oss test code and add in examples

201d06c

- Add gpt_oss_20b_example.py which does the convert and quantization - Clean up the gpt_oss.py from the test code Signed-off-by: Sharif Inamdar <[email protected]>

isharif168 force-pushed the pr-1831-plus-fixes branch from 91089fe to 201d06c Compare December 4, 2025 15:20

isharif168 changed the title ~~Remove test code from PR 1831 and add separate example for w4a8~~ Linearize gpt_oss model and add separate example to qunatize it to w4a8 Dec 4, 2025

Merge branch 'main' into pr-1831-plus-fixes

391fc92

isharif168 mentioned this pull request Dec 4, 2025

Support compressed-tensors W4A8 MoE checkpoints in GptOssModel weight loader for CPU vllm-project/vllm#29315

Open

5 tasks

cfRod reviewed Dec 5, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Linearize gpt_oss model and add separate example to qunatize it to w4a8 #2091

Linearize gpt_oss model and add separate example to qunatize it to w4a8 #2091

isharif168 commented Dec 3, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Dec 3, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

gemini-code-assist bot Dec 3, 2025

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Dec 3, 2025

Uh oh!

cfRod Dec 5, 2025

Uh oh!

cfRod Dec 5, 2025

Uh oh!

cfRod Dec 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Linearize gpt_oss model and add separate example to qunatize it to w4a8 #2091

Are you sure you want to change the base?

Linearize gpt_oss model and add separate example to qunatize it to w4a8 #2091

Conversation

isharif168 commented Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot commented Dec 3, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

gemini-code-assist bot Dec 3, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Dec 3, 2025

Uh oh!

cfRod Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

cfRod Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

cfRod Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

isharif168 commented Dec 3, 2025 •

edited

Loading