Skip to content

Conversation

@isharif168
Copy link

@isharif168 isharif168 commented Dec 3, 2025

SUMMARY:
Add test to convert the gpt_oss model and then quantize model to w8a8

TEST PLAN:
Tested with gpt_oss model and run using vllm

Sample output

=== Prompt 0 ===
<|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI.
Knowledge cutoff: 2024-06
Current date: 2025-12-04

Reasoning: medium

Valid channels: analysis, commentary, final. Channel must be included for every message.<|end|><|start|>developer<|message|># Instructions

You are a helpful assistant.

<|end|><|start|>user<|message|>Give 3 reasons to use AI.<|end|><|start|>assistant

--- Candidate 0 ---
analysisWe need to answer with three reasons to use AI. Should be short. Provide 3 reasons. Let's do it.assistantcommentaryHere are three reasons to use AI:

  1. Personalized learning – AI can adapt to your learning style and pace, providing customized lessons that suit your learning style and progress.
  2. Skill development – AI can help you develop skills that are in demand, such as coding, data analysis, and data science.
  3. Supportive design

We should provide a short answer.assistantfinalHere are three reasons to use AI:

  1. Personalized learning – AI adapts to your learning style and pace, giving you personalized lessons that fit your needs.
  2. Skill development – AI helps you learn and practice skills that are in demand, like coding, data analysis, and data‑science skills.
  3. Real‑world impact – AI helps you practice real-world skills in real life, such as learning new skills and applying them in real life.
    finish_reason: stop
    num_tokens: 226

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @isharif168, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces new modeling utilities to enable the quantization of GPT-OSS Mixture-of-Experts (MoE) models. It provides a mechanism to transform the MoE architecture from a fused-expert design to a sequential, per-expert structure, which is a crucial step for applying certain quantization techniques. The changes aim to improve compatibility and efficiency when compressing these advanced language models.

Highlights

  • New GPT-OSS Model Utilities: A new Python file, src/llmcompressor/modeling/gpt_oss.py, has been added to introduce utilities specifically for handling GPT-OSS Mixture-of-Experts (MoE) models.
  • MoE Layer Conversion for Quantization: The convert_model_for_quantization_gptoss function is introduced, designed to refactor fused-expert MoE modules into a sequential, per-expert format, which is often necessary for quantization processes.
  • Sequential Expert Implementation: The SequentialGPTOSSMoE class replaces the original fused MoE layer, creating individual GPTOSSMLP modules for each expert and carefully transferring weights from the original fused tensors.
  • Individual Expert MLP Definition: The GPTOSSMLP class defines the structure and forward pass for a single expert's Multi-Layer Perceptron, including gate, up, and down projection layers with activation functions.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new file, src/llmcompressor/modeling/gpt_oss.py, to facilitate the quantization of GPT-OSS models by patching their Mixture-of-Experts (MoE) layers into a more quantization-friendly sequential format. The overall implementation is solid, but I've identified a few areas for improvement regarding code quality, correctness, and performance. My feedback focuses on removing unused imports, ensuring CUDA-specific code is handled safely, replacing magic numbers with constants, and optimizing the MoE forward pass for better efficiency. Please review the detailed comments for specific suggestions.

Comment on lines 4 to 9
import os
from transformers import AutoModelForCausalLM, AutoTokenizer

from llmcompressor import oneshot
from llmcompressor.utils.dev import skip_weights_initialize
from llmcompressor.modifiers.quantization import QuantizationModifier
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There are several unused imports in this file: os, AutoModelForCausalLM, AutoTokenizer, oneshot, and QuantizationModifier. These should be removed to keep the code clean and maintainable.

Suggested change
import os
from transformers import AutoModelForCausalLM, AutoTokenizer
from llmcompressor import oneshot
from llmcompressor.utils.dev import skip_weights_initialize
from llmcompressor.modifiers.quantization import QuantizationModifier
from llmcompressor.utils.dev import skip_weights_initialize

@github-actions
Copy link

github-actions bot commented Dec 3, 2025

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed.

Shubhra Pandit and others added 9 commits December 3, 2025 13:54
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
- Add gpt_oss_20b_example.py which does the convert and quantization
- Clean up the gpt_oss.py from the test code

Signed-off-by: Sharif Inamdar <[email protected]>
@isharif168 isharif168 changed the title Remove test code from PR 1831 and add separate example for w4a8 Linearize gpt_oss model and add separate example to qunatize it to w4a8 Dec 4, 2025
del m
if to_delete:
gc.collect()
try:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this script expected to be used on GPUs? Has it been tested on GPU?


dtype = gup.dtype
parent, child_name = _get_parent_and_child(model, name)
top_k = int(max(1, min(_get_top_k(model.config) or 1, E)))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

x = hidden_states.reshape(-1, H)

# Use the original router (it returns scores and indices already softmaxed over top-k)
router_scores, router_indices = self.router(x) # scores: [tokens, E], indices: [tokens, k]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants