Skip to content

[DSV4] DeepSeekV4 Pro#2858

Open
kylesayrs wants to merge 3 commits into
mainfrom
kylesayrs/dsv4-pro
Open

[DSV4] DeepSeekV4 Pro#2858
kylesayrs wants to merge 3 commits into
mainfrom
kylesayrs/dsv4-pro

Conversation

@kylesayrs

Copy link
Copy Markdown
Collaborator

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
@github-actions

Copy link
Copy Markdown

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed.

@coderabbitai

coderabbitai Bot commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 01473098-e511-40f1-94e0-5fea8ef00238

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review

Walkthrough

Adds a new standalone example script deepseek_v4_pro_example.py for post-training quantization of DeepSeek-V4-Pro. The script loads the model with auto-offloading, prepares calibration data from HuggingFaceH4/ultrachat_200k, and applies a mixed quantization recipe (NVFP4 for MLP/expert projections, FP8_BLOCK for attention projections) via oneshot() in sequential mode.

Changes

DeepSeek V4 Pro PTQ Example

Layer / File(s) Summary
End-to-end quantization script
examples/quantizing_moe/deepseek_v4_pro_example.py
Loads DeepSeek-V4-Pro with device_map="auto_offload" and bfloat16 norm override, builds a calibration split from ultrachat_200k with manual BOS/EOS message templating, defines a QuantizationModifier with NVFP4 for expert/MLP targets and FP8_BLOCK for attention projection targets, runs oneshot() with pipeline="sequential" over DeepseekV4DecoderLayer, and saves compressed model and tokenizer to a derived output path.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related PRs

  • vllm-project/llm-compressor#2638: Adds a comparable example script using the same QuantizationModifier + NVFP4 pattern, ultrachat_200k calibration, and llmcompressor.oneshot() save flow for a different model (Qwen3.6).
  • vllm-project/llm-compressor#2832: Adds another quantization example script sharing the same structural pattern: regex-targeted layers, ultrachat_200k calibration dataset, oneshot() invocation, and model/tokenizer saving.

Suggested labels

enhancement, deepseek, fp8, nvfp4, moe

Suggested reviewers

  • brian-dellabetta
  • HDCharles
🚥 Pre-merge checks | ✅ 2 | ❌ 3

❌ Failed checks (1 warning, 2 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title '[DSV4] DeepSeekV4 Pro' is vague and lacks specificity about what was actually added or changed in the changeset. Use a more descriptive title that explains the specific change, such as 'Add DeepSeekV4 Pro quantization example' or 'Add post-training quantization example for DeepSeek-V4-Pro'.
Description check ❓ Inconclusive The description lists prerequisites and dependencies but does not clearly describe what changes are included in this pull request or their purpose. Add a clear description of what this PR implements, such as the quantization example script being added and its key features.
✅ Passed checks (2 passed)
Check name Status Explanation
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch kylesayrs/dsv4-pro

Warning

Review ran into problems

🔥 Problems

Linked repositories: Your configuration references 1 linked repositories, but your current plan allows 0. Analyzed ``, skipped vllm-project/compressed-tensors.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@mergify mergify Bot added the documentation Improvements or additions to documentation label Jun 23, 2026
@coderabbitai coderabbitai Bot added deepseek enhancement New feature or request fp8 For any issue / PR related to FP8 support moe nvfp4 For any PR / issue related to NVFP4 support labels Jun 23, 2026

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new example script, deepseek_v4_pro_example.py, for quantizing the DeepSeek-V4 model using llmcompressor with FP8 and NVFP4 schemes. Feedback on the changes highlights several critical improvements to make the script reusable and functional: avoiding hardcoded absolute local paths for the model and offload directories, loading the tokenizer dynamically from the model ID, removing commented-out debug code, and uncommenting the recipe parameter in the oneshot function so that the quantization scheme is actually applied.

Comment on lines +22 to +24
# MODEL_ID = "RedHatAI/DeepSeek-V4-Flash-BF16"
MODEL_ID = "/mnt/nvme-data/engine/kylesayrs/DeepSeek-V4-Pro-BF16"
# MODEL_ID = "RedHatAI/DeepSeek-V4-Flash-BF16"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Avoid hardcoding absolute local paths specific to a user environment and clean up duplicate commented-out lines. Consider using a Hugging Face model ID or a configurable environment variable/parameter.

Suggested change
# MODEL_ID = "RedHatAI/DeepSeek-V4-Flash-BF16"
MODEL_ID = "/mnt/nvme-data/engine/kylesayrs/DeepSeek-V4-Pro-BF16"
# MODEL_ID = "RedHatAI/DeepSeek-V4-Flash-BF16"
MODEL_ID = "deepseek-ai/DeepSeek-V4" # Replace with your model ID or local path

MODEL_ID,
device_map="auto_offload",
max_memory={},
offload_folder="/mnt/nvme-data/engine/kylesayrs/offload_folder",

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Avoid hardcoding absolute local paths for the offload folder. Use a relative path or a temporary directory instead.

Suggested change
offload_folder="/mnt/nvme-data/engine/kylesayrs/offload_folder",
offload_folder="./offload_folder",

model=model,
processor=tokenizer,
dataset=ds,
# recipe=recipe,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The recipe argument is commented out, which means the quantization scheme defined in recipe will not be applied during the oneshot calibration. Please uncomment this line to ensure the model is quantized.

Suggested change
# recipe=recipe,
recipe=recipe,

Comment on lines +34 to +38
# kluge for the way I saved the decompressed checkpoint
# mds = model.model.layers[-1].self_attn.wq_a._hf_hook.weights_map.dataset.index
# mds["model.hc_head.base"] = mds['model.hc_head.hc_base']
# mds["model.hc_head.fn"] = mds['model.hc_head.hc_fn']
# mds["model.hc_head.scale"] = mds['model.hc_head.hc_scale']

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Remove commented-out debug code / temporary workarounds to keep the example clean and maintainable.

# mds["model.hc_head.fn"] = mds['model.hc_head.hc_fn']
# mds["model.hc_head.scale"] = mds['model.hc_head.hc_scale']

tokenizer = AutoTokenizer.from_pretrained("RedHatAI/DeepSeek-V4-Flash-BF16")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Load the tokenizer from MODEL_ID instead of a hardcoded string to ensure consistency if the model ID is changed.

Suggested change
tokenizer = AutoTokenizer.from_pretrained("RedHatAI/DeepSeek-V4-Flash-BF16")
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🧹 Nitpick comments (3)
examples/quantizing_moe/deepseek_v4_pro_example.py (3)

81-88: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value

Consider adding a comment to explain add_special_tokens=False.

For educational clarity, it would help users to understand why add_special_tokens=False is used here (since special tokens like BOS/EOS were already added in the preprocess function).

📝 Suggested improvement
 def tokenize(sample):
     return tokenizer(
         sample["text"],
         padding=False,
         max_length=MAX_SEQUENCE_LENGTH,
         truncation=True,
-        add_special_tokens=False,
+        add_special_tokens=False,  # Special tokens already added in preprocess()
     )
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/quantizing_moe/deepseek_v4_pro_example.py` around lines 81 - 88, In
the tokenize function, add a comment above or next to the
add_special_tokens=False parameter to explain why special tokens are not being
added here. The comment should clarify that special tokens like BOS/EOS were
already added during preprocessing in the preprocess function, so they should
not be added again during tokenization to avoid duplication.

Source: Path instructions


30-30: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value

Consider removing the empty max_memory parameter or add explanation.

The max_memory={} parameter is empty and appears to have no effect. For educational clarity, either:

  • Remove it if not needed
  • Provide an example value with a comment explaining how users can limit per-device memory allocation

Example with explanation:

max_memory={0: "40GB", 1: "40GB", "cpu": "100GB"},  # Optional: limit per-device memory
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/quantizing_moe/deepseek_v4_pro_example.py` at line 30, The
`max_memory={}` parameter being passed in the function call is empty and has no
practical effect. Either remove this parameter entirely since an empty
dictionary serves no purpose, or replace it with a meaningful example that
demonstrates how users can limit per-device memory allocation (such as
specifying device IDs and memory limits like {0: "40GB", 1: "40GB", "cpu":
"100GB"}) and add a comment explaining the parameter's purpose for educational
clarity.

Source: Path instructions


104-121: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value

Consider minor clarifications to the recipe configuration.

The quantization recipe is well-structured, but could benefit from small improvements:

  1. Line 109: The target r"re:.*attn\.compressor\.indexer\.q_b_proj$" is not mentioned in the comments above (lines 96-102). Adding it to the comments would help users understand the full scope.

  2. Line 120: The explicit ignore=[] parameter is unnecessary when empty and can be removed for cleaner code.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/quantizing_moe/deepseek_v4_pro_example.py` around lines 104 - 121,
Add a clarifying comment above the QuantizationModifier to document the
compressor indexer target pattern `r"re:.*attn\.compressor\.indexer\.q_b_proj$"`
that appears in the attention config_groups, explaining its purpose alongside
the existing attention projection targets. Additionally, remove the empty
`ignore=[]` parameter from the QuantizationModifier initialization as it is
unnecessary when not explicitly needed.

Source: Path instructions

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@examples/quantizing_moe/deepseek_v4_pro_example.py`:
- Around line 16-19: The comment for the
DeepseekV4PreTrainedModel._keep_in_fp32_modules_strict workaround lacks
sufficient context for educational purposes. Enhance the existing comment to
include a comprehensive explanation of the upstream bug being worked around,
link to relevant GitHub issues or pull requests that document this bug, and add
a TODO comment specifying the conditions or library version when this workaround
can be removed. This will help users understand why the workaround is necessary
and when it becomes obsolete.
- Around line 34-38: Remove the commented-out "kluge" workaround code block and
its explanatory comment from the example file. Delete the entire section
starting with the comment "# kluge for the way I saved the decompressed
checkpoint" and all subsequent commented-out lines that follow it which
manipulate the mds weights_map dataset index. This developer-specific workaround
code is not relevant to users learning from the example and should be removed to
keep the example clean and production-ready.
- Around line 126-139: The recipe parameter in the oneshot() function call is
commented out, preventing the quantization recipe from being applied during the
calibration process. Uncomment the recipe=recipe, parameter in the oneshot()
function call to ensure the quantization recipe defined earlier is actually
used. Additionally, improve the comments above and around the
propagate_error=False parameter to clearly explain what the workaround addresses
and when users might encounter similar issues with transformers cache behavior.
- Around line 21-32: Replace the hardcoded local file system paths with publicly
accessible alternatives in the model loading section. Change the MODEL_ID
variable from the local path to use the publicly available Hugging Face model
identifier that is already commented out, and replace the hardcoded
offload_folder path in the AutoModelForCausalLM.from_pretrained call with either
a generic temporary directory path or make it configurable through a variable
that users can easily modify. This will make the example script runnable for end
users without requiring access to specific developer machine paths.
- Around line 59-75: In the preprocess function, add a comment above or within
the assistant role handling block (the elif role == "assistant": section) to
explain that the </think> token is part of DeepSeek-V4's chain-of-thought
reasoning format for educational clarity. Additionally, add error handling to
the function to validate that each message in example["messages"] contains both
"role" and "content" fields before accessing them, either by wrapping the
message processing loop in a try-except block or by adding explicit validation
checks to handle missing or malformed message fields gracefully.
- Line 40: The AutoTokenizer.from_pretrained call hardcodes a model path
"RedHatAI/DeepSeek-V4-Flash-BF16" that differs from the model being used in the
example. Either replace the hardcoded string with the MODEL_ID variable to
ensure consistency throughout the example, or if the Flash and Pro model
tokenizers are intentionally the same, add a clear comment explaining why
different model identifiers are used for the model and tokenizer respectively.

---

Nitpick comments:
In `@examples/quantizing_moe/deepseek_v4_pro_example.py`:
- Around line 81-88: In the tokenize function, add a comment above or next to
the add_special_tokens=False parameter to explain why special tokens are not
being added here. The comment should clarify that special tokens like BOS/EOS
were already added during preprocessing in the preprocess function, so they
should not be added again during tokenization to avoid duplication.
- Line 30: The `max_memory={}` parameter being passed in the function call is
empty and has no practical effect. Either remove this parameter entirely since
an empty dictionary serves no purpose, or replace it with a meaningful example
that demonstrates how users can limit per-device memory allocation (such as
specifying device IDs and memory limits like {0: "40GB", 1: "40GB", "cpu":
"100GB"}) and add a comment explaining the parameter's purpose for educational
clarity.
- Around line 104-121: Add a clarifying comment above the QuantizationModifier
to document the compressor indexer target pattern
`r"re:.*attn\.compressor\.indexer\.q_b_proj$"` that appears in the attention
config_groups, explaining its purpose alongside the existing attention
projection targets. Additionally, remove the empty `ignore=[]` parameter from
the QuantizationModifier initialization as it is unnecessary when not explicitly
needed.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 1f787e13-b61c-4a72-bac3-e71a8952c37c

📥 Commits

Reviewing files that changed from the base of the PR and between 5bebc2b and 16f2df5.

📒 Files selected for processing (1)
  • examples/quantizing_moe/deepseek_v4_pro_example.py

Comment on lines +16 to +19
# Upstream BUG: norms should be loaded in float32, but usually aren't due to the base
# model having a quant_config which overrides this. Loading in float32 actually
# breaks the model definition (it expects bfloat16). Let's force load in bfloat16.
DeepseekV4PreTrainedModel._keep_in_fp32_modules_strict = set()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📐 Maintainability & Code Quality | 🟠 Major | ⚖️ Poor tradeoff

Add detailed explanation or remove the fragile workaround.

Modifying a private class attribute (_keep_in_fp32_modules_strict) in an example script is problematic for educational purposes. Users learning from this example won't understand:

  • Why this workaround is necessary
  • What "upstream BUG" it addresses
  • When they need to apply it
  • Whether it will work with future library versions

Either provide a comprehensive explanation with links to relevant issues/PRs, or if this is truly a temporary workaround for incomplete dependencies (as noted in PR objectives), add a clear TODO comment explaining when this can be removed.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/quantizing_moe/deepseek_v4_pro_example.py` around lines 16 - 19, The
comment for the DeepseekV4PreTrainedModel._keep_in_fp32_modules_strict
workaround lacks sufficient context for educational purposes. Enhance the
existing comment to include a comprehensive explanation of the upstream bug
being worked around, link to relevant GitHub issues or pull requests that
document this bug, and add a TODO comment specifying the conditions or library
version when this workaround can be removed. This will help users understand why
the workaround is necessary and when it becomes obsolete.

Source: Path instructions

Comment on lines +21 to +32
# Select model and load it.
# MODEL_ID = "RedHatAI/DeepSeek-V4-Flash-BF16"
MODEL_ID = "/mnt/nvme-data/engine/kylesayrs/DeepSeek-V4-Pro-BF16"
# MODEL_ID = "RedHatAI/DeepSeek-V4-Flash-BF16"

with load_context():
model = AutoModelForCausalLM.from_pretrained(
MODEL_ID,
device_map="auto_offload",
max_memory={},
offload_folder="/mnt/nvme-data/engine/kylesayrs/offload_folder",
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🔴 Critical | ⚡ Quick win

Replace hardcoded local paths with publicly accessible model IDs.

The example uses developer-specific local paths that make it non-runnable for users:

  • Line 23: /mnt/nvme-data/engine/kylesayrs/DeepSeek-V4-Pro-BF16
  • Line 31: /mnt/nvme-data/engine/kylesayrs/offload_folder

For an example script, these should use publicly available Hugging Face model IDs (like the commented-out "RedHatAI/DeepSeek-V4-Flash-BF16") and a generic offload folder path or make it configurable.

📝 Suggested fix
-# Select model and load it.
-# MODEL_ID = "RedHatAI/DeepSeek-V4-Flash-BF16"
-MODEL_ID = "/mnt/nvme-data/engine/kylesayrs/DeepSeek-V4-Pro-BF16"
-# MODEL_ID = "RedHatAI/DeepSeek-V4-Flash-BF16"
+# Select model and load it. You can also use local paths if you have the model downloaded.
+MODEL_ID = "RedHatAI/DeepSeek-V4-Flash-BF16"  # or "deepseek-ai/DeepSeek-V4-Pro"
+OFFLOAD_FOLDER = "./offload_folder"  # Directory for offloading model layers
 
 with load_context():
     model = AutoModelForCausalLM.from_pretrained(
         MODEL_ID,
         device_map="auto_offload",
         max_memory={},
-        offload_folder="/mnt/nvme-data/engine/kylesayrs/offload_folder",
+        offload_folder=OFFLOAD_FOLDER,
     )
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/quantizing_moe/deepseek_v4_pro_example.py` around lines 21 - 32,
Replace the hardcoded local file system paths with publicly accessible
alternatives in the model loading section. Change the MODEL_ID variable from the
local path to use the publicly available Hugging Face model identifier that is
already commented out, and replace the hardcoded offload_folder path in the
AutoModelForCausalLM.from_pretrained call with either a generic temporary
directory path or make it configurable through a variable that users can easily
modify. This will make the example script runnable for end users without
requiring access to specific developer machine paths.

Source: Path instructions

Comment on lines +34 to +38
# kluge for the way I saved the decompressed checkpoint
# mds = model.model.layers[-1].self_attn.wq_a._hf_hook.weights_map.dataset.index
# mds["model.hc_head.base"] = mds['model.hc_head.hc_base']
# mds["model.hc_head.fn"] = mds['model.hc_head.hc_fn']
# mds["model.hc_head.scale"] = mds['model.hc_head.hc_scale']

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

Remove commented-out "kluge" code.

These commented lines describe a developer-specific workaround ("kluge") that has no educational value for users learning the library. Example scripts should be clean and production-ready to demonstrate best practices.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/quantizing_moe/deepseek_v4_pro_example.py` around lines 34 - 38,
Remove the commented-out "kluge" workaround code block and its explanatory
comment from the example file. Delete the entire section starting with the
comment "# kluge for the way I saved the decompressed checkpoint" and all
subsequent commented-out lines that follow it which manipulate the mds
weights_map dataset index. This developer-specific workaround code is not
relevant to users learning from the example and should be removed to keep the
example clean and production-ready.

Source: Path instructions

# mds["model.hc_head.fn"] = mds['model.hc_head.hc_fn']
# mds["model.hc_head.scale"] = mds['model.hc_head.hc_scale']

tokenizer = AutoTokenizer.from_pretrained("RedHatAI/DeepSeek-V4-Flash-BF16")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Use consistent model ID for tokenizer or explain the mismatch.

The tokenizer is loaded from "RedHatAI/DeepSeek-V4-Flash-BF16" while the model uses a different path. For clarity and correctness in an example:

  • If the tokenizers are identical, use the same MODEL_ID variable
  • If they differ intentionally, add a comment explaining why (e.g., if Flash and Pro models share the same tokenizer)
🔧 Suggested fix
-tokenizer = AutoTokenizer.from_pretrained("RedHatAI/DeepSeek-V4-Flash-BF16")
+tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
tokenizer = AutoTokenizer.from_pretrained("RedHatAI/DeepSeek-V4-Flash-BF16")
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/quantizing_moe/deepseek_v4_pro_example.py` at line 40, The
AutoTokenizer.from_pretrained call hardcodes a model path
"RedHatAI/DeepSeek-V4-Flash-BF16" that differs from the model being used in the
example. Either replace the hardcoded string with the MODEL_ID variable to
ensure consistency throughout the example, or if the Flash and Pro model
tokenizers are intentionally the same, add a clear comment explaining why
different model identifiers are used for the model and tokenizer respectively.

Source: Path instructions

Comment on lines +59 to +75
def preprocess(example):
# DeepSeek-V4 does not have a traditional chat template.
# Encode manually per https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash/tree/main/encoding
BOS = "<|begin▁of▁sentence|>"
EOS = "<|end▁of▁sentence|>"
text = BOS
for message in example["messages"]:
role = message["role"]
content = message["content"]
if role == "system":
text += content
elif role == "user":
text += f"<|User|>{content}"
elif role == "assistant":
text += f"<|Assistant|></think>{content}{EOS}"

return {"text": text}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

Add explanation for the </think> token and consider error handling.

The preprocessing function has a few educational clarity issues:

  1. Unexplained token: Line 73 includes </think> without explanation. Users should understand this is part of DeepSeek-V4's chain-of-thought format.

  2. No error handling: The function assumes all messages have "role" and "content" fields. Consider adding validation or a try-except for robustness.

  3. Ambiguous characters: The fullwidth vertical lines (|) in special tokens are correct for DeepSeek-V4's format and can be safely ignored from the Ruff warnings.

📚 Suggested improvements
 def preprocess(example):
-    # DeepSeek-V4 does not have a traditional chat template.
-    # Encode manually per https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash/tree/main/encoding
+    """
+    DeepSeek-V4 does not have a traditional chat template.
+    Encode manually per https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash/tree/main/encoding
+    
+    Note: DeepSeek-V4 uses chain-of-thought reasoning. The </think> token marks
+    the end of the model's internal reasoning before providing the final answer.
+    """
     BOS = "<|begin▁of▁sentence|>"
     EOS = "<|end▁of▁sentence|>"
     text = BOS
     for message in example["messages"]:
         role = message["role"]
         content = message["content"]
         if role == "system":
             text += content
         elif role == "user":
             text += f"<|User|>{content}"
         elif role == "assistant":
             text += f"<|Assistant|></think>{content}{EOS}"
+        else:
+            # Skip unknown roles
+            continue
 
     return {"text": text}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def preprocess(example):
# DeepSeek-V4 does not have a traditional chat template.
# Encode manually per https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash/tree/main/encoding
BOS = "<|begin▁of▁sentence|>"
EOS = "<|end▁of▁sentence|>"
text = BOS
for message in example["messages"]:
role = message["role"]
content = message["content"]
if role == "system":
text += content
elif role == "user":
text += f"<|User|>{content}"
elif role == "assistant":
text += f"<|Assistant|></think>{content}{EOS}"
return {"text": text}
def preprocess(example):
"""
DeepSeek-V4 does not have a traditional chat template.
Encode manually per https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash/tree/main/encoding
Note: DeepSeek-V4 uses chain-of-thought reasoning. The </think> token marks
the end of the model's internal reasoning before providing the final answer.
"""
BOS = "<|begin▁of▁sentence|>"
EOS = "<|end▁of▁sentence|>"
text = BOS
for message in example["messages"]:
role = message["role"]
content = message["content"]
if role == "system":
text += content
elif role == "user":
text += f"<|User|>{content}"
elif role == "assistant":
text += f"<|Assistant|></think>{content}{EOS}"
else:
# Skip unknown roles
continue
return {"text": text}
🧰 Tools
🪛 Ruff (0.15.18)

[warning] 62-62: String contains ambiguous (FULLWIDTH VERTICAL LINE). Did you mean | (VERTICAL LINE)?

(RUF001)


[warning] 62-62: String contains ambiguous (FULLWIDTH VERTICAL LINE). Did you mean | (VERTICAL LINE)?

(RUF001)


[warning] 63-63: String contains ambiguous (FULLWIDTH VERTICAL LINE). Did you mean | (VERTICAL LINE)?

(RUF001)


[warning] 63-63: String contains ambiguous (FULLWIDTH VERTICAL LINE). Did you mean | (VERTICAL LINE)?

(RUF001)


[warning] 71-71: String contains ambiguous (FULLWIDTH VERTICAL LINE). Did you mean | (VERTICAL LINE)?

(RUF001)


[warning] 71-71: String contains ambiguous (FULLWIDTH VERTICAL LINE). Did you mean | (VERTICAL LINE)?

(RUF001)


[warning] 73-73: String contains ambiguous (FULLWIDTH VERTICAL LINE). Did you mean | (VERTICAL LINE)?

(RUF001)


[warning] 73-73: String contains ambiguous (FULLWIDTH VERTICAL LINE). Did you mean | (VERTICAL LINE)?

(RUF001)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/quantizing_moe/deepseek_v4_pro_example.py` around lines 59 - 75, In
the preprocess function, add a comment above or within the assistant role
handling block (the elif role == "assistant": section) to explain that the
</think> token is part of DeepSeek-V4's chain-of-thought reasoning format for
educational clarity. Additionally, add error handling to the function to
validate that each message in example["messages"] contains both "role" and
"content" fields before accessing them, either by wrapping the message
processing loop in a try-except block or by adding explicit validation checks to
handle missing or malformed message fields gracefully.

Sources: Path instructions, Linters/SAST tools

Comment on lines +126 to +139
oneshot(
model=model,
processor=tokenizer,
dataset=ds,
# recipe=recipe,
pipeline="sequential",
max_seq_length=MAX_SEQUENCE_LENGTH,
num_calibration_samples=NUM_CALIBRATION_SAMPLES,
sequential_targets=["DeepseekV4DecoderLayer"],
batch_size=1,
shuffle_calibration_samples=True,
propagate_error=False, # work around reliance on transformers cache
# something weird happens with the cache and propagation
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🔴 Critical | ⚡ Quick win

CRITICAL: The recipe parameter is commented out - quantization will not be applied.

Line 130 has # recipe=recipe, commented out, which means the quantization recipe defined in lines 104-121 is never used. Without this parameter, the oneshot() call will not apply any quantization, defeating the entire purpose of this example.

Additionally:

  • Line 137-138: The workaround comments for propagate_error=False are incomplete and don't help users understand when or why they might need this parameter.
🐛 Required fix
 oneshot(
     model=model,
     processor=tokenizer,
     dataset=ds,
-    # recipe=recipe,
+    recipe=recipe,
     pipeline="sequential",
     max_seq_length=MAX_SEQUENCE_LENGTH,
     num_calibration_samples=NUM_CALIBRATION_SAMPLES,
     sequential_targets=["DeepseekV4DecoderLayer"],
     batch_size=1,
     shuffle_calibration_samples=True,
-    propagate_error=False,  # work around reliance on transformers cache
-    # something weird happens with the cache and propagation
+    # Note: propagate_error=False works around an issue with transformers cache
+    # during sequential quantization of DeepSeek-V4 models
+    propagate_error=False,
 )
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/quantizing_moe/deepseek_v4_pro_example.py` around lines 126 - 139,
The recipe parameter in the oneshot() function call is commented out, preventing
the quantization recipe from being applied during the calibration process.
Uncomment the recipe=recipe, parameter in the oneshot() function call to ensure
the quantization recipe defined earlier is actually used. Additionally, improve
the comments above and around the propagate_error=False parameter to clearly
explain what the workaround addresses and when users might encounter similar
issues with transformers cache behavior.

Source: Path instructions

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
@mergify

mergify Bot commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

The quality checks have failed. Please run make style and make quality under
the root directory to adddress the lint failures. You will need to install the
dev optional install to get the required linting packages:
https://github.com/vllm-project/llm-compressor/blob/main/CONTRIBUTING.md

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

deepseek documentation Improvements or additions to documentation enhancement New feature or request fp8 For any issue / PR related to FP8 support moe nvfp4 For any PR / issue related to NVFP4 support quality-failed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant