[DSV4] DeepSeekV4 Pro by kylesayrs · Pull Request #2858 · vllm-project/llm-compressor

kylesayrs · 2026-06-23T17:16:11Z

Prerequisites

Transformers v5 #2647
[Tracing] Support tracing cache #2686
[Offloading] Support full disk offloading huggingface/accelerate#4083
[Offloading] Support full disk offloading huggingface/transformers#46749
[Offloading] [Bugfix] Fix disk offloading of models with explicit tensor dtypes huggingface/transformers#46849
Fix/understand increasing memory usage during calibration

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

github-actions · 2026-06-23T17:16:22Z

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed.

coderabbitai · 2026-06-23T17:16:28Z

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 01473098-e511-40f1-94e0-5fea8ef00238

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

Walkthrough

Adds a new standalone example script deepseek_v4_pro_example.py for post-training quantization of DeepSeek-V4-Pro. The script loads the model with auto-offloading, prepares calibration data from HuggingFaceH4/ultrachat_200k, and applies a mixed quantization recipe (NVFP4 for MLP/expert projections, FP8_BLOCK for attention projections) via oneshot() in sequential mode.

Changes

DeepSeek V4 Pro PTQ Example

Layer / File(s)	Summary
End-to-end quantization script `examples/quantizing_moe/deepseek_v4_pro_example.py`	Loads DeepSeek-V4-Pro with `device_map="auto_offload"` and bfloat16 norm override, builds a calibration split from `ultrachat_200k` with manual BOS/EOS message templating, defines a `QuantizationModifier` with NVFP4 for expert/MLP targets and FP8_BLOCK for attention projection targets, runs `oneshot()` with `pipeline="sequential"` over `DeepseekV4DecoderLayer`, and saves compressed model and tokenizer to a derived output path.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related PRs

vllm-project/llm-compressor#2638: Adds a comparable example script using the same QuantizationModifier + NVFP4 pattern, ultrachat_200k calibration, and llmcompressor.oneshot() save flow for a different model (Qwen3.6).
vllm-project/llm-compressor#2832: Adds another quantization example script sharing the same structural pattern: regex-targeted layers, ultrachat_200k calibration dataset, oneshot() invocation, and model/tokenizer saving.

Suggested labels

enhancement, deepseek, fp8, nvfp4, moe

Suggested reviewers

brian-dellabetta
HDCharles

🚥 Pre-merge checks | ✅ 2 | ❌ 3

❌ Failed checks (1 warning, 2 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check	❓ Inconclusive	The title '[DSV4] DeepSeekV4 Pro' is vague and lacks specificity about what was actually added or changed in the changeset.	Use a more descriptive title that explains the specific change, such as 'Add DeepSeekV4 Pro quantization example' or 'Add post-training quantization example for DeepSeek-V4-Pro'.
Description check	❓ Inconclusive	The description lists prerequisites and dependencies but does not clearly describe what changes are included in this pull request or their purpose.	Add a clear description of what this PR implements, such as the quantization example script being added and its key features.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch kylesayrs/dsv4-pro

Warning

Review ran into problems

🔥 Problems

Linked repositories: Your configuration references 1 linked repositories, but your current plan allows 0. Analyzed ``, skipped vllm-project/compressed-tensors.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands.}

gemini-code-assist

Code Review

This pull request introduces a new example script, deepseek_v4_pro_example.py, for quantizing the DeepSeek-V4 model using llmcompressor with FP8 and NVFP4 schemes. Feedback on the changes highlights several critical improvements to make the script reusable and functional: avoiding hardcoded absolute local paths for the model and offload directories, loading the tokenizer dynamically from the model ID, removing commented-out debug code, and uncommenting the recipe parameter in the oneshot function so that the quantization scheme is actually applied.

gemini-code-assist · 2026-06-23T17:17:33Z

+# MODEL_ID = "RedHatAI/DeepSeek-V4-Flash-BF16"
+MODEL_ID = "/mnt/nvme-data/engine/kylesayrs/DeepSeek-V4-Pro-BF16"
+# MODEL_ID = "RedHatAI/DeepSeek-V4-Flash-BF16"


Avoid hardcoding absolute local paths specific to a user environment and clean up duplicate commented-out lines. Consider using a Hugging Face model ID or a configurable environment variable/parameter.

Suggested change

# MODEL_ID = "RedHatAI/DeepSeek-V4-Flash-BF16"

MODEL_ID = "/mnt/nvme-data/engine/kylesayrs/DeepSeek-V4-Pro-BF16"

# MODEL_ID = "RedHatAI/DeepSeek-V4-Flash-BF16"

MODEL_ID = "deepseek-ai/DeepSeek-V4" # Replace with your model ID or local path

gemini-code-assist · 2026-06-23T17:17:34Z

+        MODEL_ID,
+        device_map="auto_offload",
+        max_memory={},
+        offload_folder="/mnt/nvme-data/engine/kylesayrs/offload_folder",


Avoid hardcoding absolute local paths for the offload folder. Use a relative path or a temporary directory instead.

Suggested change

offload_folder="/mnt/nvme-data/engine/kylesayrs/offload_folder",

offload_folder="./offload_folder",

gemini-code-assist · 2026-06-23T17:17:34Z

+    model=model,
+    processor=tokenizer,
+    dataset=ds,
+    # recipe=recipe,


The recipe argument is commented out, which means the quantization scheme defined in recipe will not be applied during the oneshot calibration. Please uncomment this line to ensure the model is quantized.

Suggested change

# recipe=recipe,

recipe=recipe,

gemini-code-assist · 2026-06-23T17:17:34Z

+# kluge for the way I saved the decompressed checkpoint
+# mds = model.model.layers[-1].self_attn.wq_a._hf_hook.weights_map.dataset.index
+# mds["model.hc_head.base"] = mds['model.hc_head.hc_base']
+# mds["model.hc_head.fn"] = mds['model.hc_head.hc_fn']
+# mds["model.hc_head.scale"] = mds['model.hc_head.hc_scale']


Remove commented-out debug code / temporary workarounds to keep the example clean and maintainable.

gemini-code-assist · 2026-06-23T17:17:34Z

+# mds["model.hc_head.fn"] = mds['model.hc_head.hc_fn']
+# mds["model.hc_head.scale"] = mds['model.hc_head.hc_scale']
+
+tokenizer = AutoTokenizer.from_pretrained("RedHatAI/DeepSeek-V4-Flash-BF16")


Load the tokenizer from MODEL_ID instead of a hardcoded string to ensure consistency if the model ID is changed.

Suggested change

tokenizer = AutoTokenizer.from_pretrained("RedHatAI/DeepSeek-V4-Flash-BF16")

tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)

coderabbitai

Actionable comments posted: 6

🧹 Nitpick comments (3)

examples/quantizing_moe/deepseek_v4_pro_example.py (3)
81-88: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value

Consider adding a comment to explain add_special_tokens=False.

For educational clarity, it would help users to understand why add_special_tokens=False is used here (since special tokens like BOS/EOS were already added in the preprocess function).
📝 Suggested improvement
 def tokenize(sample):
     return tokenizer(
         sample["text"],
         padding=False,
         max_length=MAX_SEQUENCE_LENGTH,
         truncation=True,
-        add_special_tokens=False,
+        add_special_tokens=False,  # Special tokens already added in preprocess()
     )
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/quantizing_moe/deepseek_v4_pro_example.py` around lines 81 - 88, In
the tokenize function, add a comment above or next to the
add_special_tokens=False parameter to explain why special tokens are not being
added here. The comment should clarify that special tokens like BOS/EOS were
already added during preprocessing in the preprocess function, so they should
not be added again during tokenization to avoid duplication.
Source: Path instructions

30-30: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value

Consider removing the empty max_memory parameter or add explanation.

The max_memory={} parameter is empty and appears to have no effect. For educational clarity, either:

Remove it if not needed

Provide an example value with a comment explaining how users can limit per-device memory allocation

Example with explanation:
max_memory={0: "40GB", 1: "40GB", "cpu": "100GB"},  # Optional: limit per-device memory
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/quantizing_moe/deepseek_v4_pro_example.py` at line 30, The
`max_memory={}` parameter being passed in the function call is empty and has no
practical effect. Either remove this parameter entirely since an empty
dictionary serves no purpose, or replace it with a meaningful example that
demonstrates how users can limit per-device memory allocation (such as
specifying device IDs and memory limits like {0: "40GB", 1: "40GB", "cpu":
"100GB"}) and add a comment explaining the parameter's purpose for educational
clarity.
Source: Path instructions

104-121: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value

Consider minor clarifications to the recipe configuration.

The quantization recipe is well-structured, but could benefit from small improvements:

Line 109: The target r"re:.*attn\.compressor\.indexer\.q_b_proj$" is not mentioned in the comments above (lines 96-102). Adding it to the comments would help users understand the full scope.

Line 120: The explicit ignore=[] parameter is unnecessary when empty and can be removed for cleaner code.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/quantizing_moe/deepseek_v4_pro_example.py` around lines 104 - 121,
Add a clarifying comment above the QuantizationModifier to document the
compressor indexer target pattern `r"re:.*attn\.compressor\.indexer\.q_b_proj$"`
that appears in the attention config_groups, explaining its purpose alongside
the existing attention projection targets. Additionally, remove the empty
`ignore=[]` parameter from the QuantizationModifier initialization as it is
unnecessary when not explicitly needed.
Source: Path instructions

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@examples/quantizing_moe/deepseek_v4_pro_example.py`:
- Around line 16-19: The comment for the
DeepseekV4PreTrainedModel._keep_in_fp32_modules_strict workaround lacks
sufficient context for educational purposes. Enhance the existing comment to
include a comprehensive explanation of the upstream bug being worked around,
link to relevant GitHub issues or pull requests that document this bug, and add
a TODO comment specifying the conditions or library version when this workaround
can be removed. This will help users understand why the workaround is necessary
and when it becomes obsolete.
- Around line 34-38: Remove the commented-out "kluge" workaround code block and
its explanatory comment from the example file. Delete the entire section
starting with the comment "# kluge for the way I saved the decompressed
checkpoint" and all subsequent commented-out lines that follow it which
manipulate the mds weights_map dataset index. This developer-specific workaround
code is not relevant to users learning from the example and should be removed to
keep the example clean and production-ready.
- Around line 126-139: The recipe parameter in the oneshot() function call is
commented out, preventing the quantization recipe from being applied during the
calibration process. Uncomment the recipe=recipe, parameter in the oneshot()
function call to ensure the quantization recipe defined earlier is actually
used. Additionally, improve the comments above and around the
propagate_error=False parameter to clearly explain what the workaround addresses
and when users might encounter similar issues with transformers cache behavior.
- Around line 21-32: Replace the hardcoded local file system paths with publicly
accessible alternatives in the model loading section. Change the MODEL_ID
variable from the local path to use the publicly available Hugging Face model
identifier that is already commented out, and replace the hardcoded
offload_folder path in the AutoModelForCausalLM.from_pretrained call with either
a generic temporary directory path or make it configurable through a variable
that users can easily modify. This will make the example script runnable for end
users without requiring access to specific developer machine paths.
- Around line 59-75: In the preprocess function, add a comment above or within
the assistant role handling block (the elif role == "assistant": section) to
explain that the </think> token is part of DeepSeek-V4's chain-of-thought
reasoning format for educational clarity. Additionally, add error handling to
the function to validate that each message in example["messages"] contains both
"role" and "content" fields before accessing them, either by wrapping the
message processing loop in a try-except block or by adding explicit validation
checks to handle missing or malformed message fields gracefully.
- Line 40: The AutoTokenizer.from_pretrained call hardcodes a model path
"RedHatAI/DeepSeek-V4-Flash-BF16" that differs from the model being used in the
example. Either replace the hardcoded string with the MODEL_ID variable to
ensure consistency throughout the example, or if the Flash and Pro model
tokenizers are intentionally the same, add a clear comment explaining why
different model identifiers are used for the model and tokenizer respectively.

---

Nitpick comments:
In `@examples/quantizing_moe/deepseek_v4_pro_example.py`:
- Around line 81-88: In the tokenize function, add a comment above or next to
the add_special_tokens=False parameter to explain why special tokens are not
being added here. The comment should clarify that special tokens like BOS/EOS
were already added during preprocessing in the preprocess function, so they
should not be added again during tokenization to avoid duplication.
- Line 30: The `max_memory={}` parameter being passed in the function call is
empty and has no practical effect. Either remove this parameter entirely since
an empty dictionary serves no purpose, or replace it with a meaningful example
that demonstrates how users can limit per-device memory allocation (such as
specifying device IDs and memory limits like {0: "40GB", 1: "40GB", "cpu":
"100GB"}) and add a comment explaining the parameter's purpose for educational
clarity.
- Around line 104-121: Add a clarifying comment above the QuantizationModifier
to document the compressor indexer target pattern
`r"re:.*attn\.compressor\.indexer\.q_b_proj$"` that appears in the attention
config_groups, explaining its purpose alongside the existing attention
projection targets. Additionally, remove the empty `ignore=[]` parameter from
the QuantizationModifier initialization as it is unnecessary when not explicitly
needed.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 1f787e13-b61c-4a72-bac3-e71a8952c37c

📥 Commits

Reviewing files that changed from the base of the PR and between 5bebc2b and 16f2df5.

📒 Files selected for processing (1)

examples/quantizing_moe/deepseek_v4_pro_example.py

coderabbitai · 2026-06-23T17:20:46Z

+# Upstream BUG: norms should be loaded in float32, but usually aren't due to the base
+# model having a quant_config which overrides this. Loading in float32 actually
+# breaks the model definition (it expects bfloat16). Let's force load in bfloat16.
+DeepseekV4PreTrainedModel._keep_in_fp32_modules_strict = set()


📐 Maintainability & Code Quality | 🟠 Major | ⚖️ Poor tradeoff

Add detailed explanation or remove the fragile workaround.

Modifying a private class attribute (_keep_in_fp32_modules_strict) in an example script is problematic for educational purposes. Users learning from this example won't understand:

Why this workaround is necessary

What "upstream BUG" it addresses

When they need to apply it

Whether it will work with future library versions

Either provide a comprehensive explanation with links to relevant issues/PRs, or if this is truly a temporary workaround for incomplete dependencies (as noted in PR objectives), add a clear TODO comment explaining when this can be removed.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@examples/quantizing_moe/deepseek_v4_pro_example.py` around lines 16 - 19, The comment for the DeepseekV4PreTrainedModel._keep_in_fp32_modules_strict workaround lacks sufficient context for educational purposes. Enhance the existing comment to include a comprehensive explanation of the upstream bug being worked around, link to relevant GitHub issues or pull requests that document this bug, and add a TODO comment specifying the conditions or library version when this workaround can be removed. This will help users understand why the workaround is necessary and when it becomes obsolete.

Source: Path instructions

coderabbitai · 2026-06-23T17:20:46Z

+# Select model and load it.
+# MODEL_ID = "RedHatAI/DeepSeek-V4-Flash-BF16"
+MODEL_ID = "/mnt/nvme-data/engine/kylesayrs/DeepSeek-V4-Pro-BF16"
+# MODEL_ID = "RedHatAI/DeepSeek-V4-Flash-BF16"
+
+with load_context():
+    model = AutoModelForCausalLM.from_pretrained(
+        MODEL_ID,
+        device_map="auto_offload",
+        max_memory={},
+        offload_folder="/mnt/nvme-data/engine/kylesayrs/offload_folder",
+    )


🎯 Functional Correctness | 🔴 Critical | ⚡ Quick win

Replace hardcoded local paths with publicly accessible model IDs.

The example uses developer-specific local paths that make it non-runnable for users:

Line 23: /mnt/nvme-data/engine/kylesayrs/DeepSeek-V4-Pro-BF16

Line 31: /mnt/nvme-data/engine/kylesayrs/offload_folder

For an example script, these should use publicly available Hugging Face model IDs (like the commented-out "RedHatAI/DeepSeek-V4-Flash-BF16") and a generic offload folder path or make it configurable.

📝 Suggested fix

-# Select model and load it. -# MODEL_ID = "RedHatAI/DeepSeek-V4-Flash-BF16" -MODEL_ID = "/mnt/nvme-data/engine/kylesayrs/DeepSeek-V4-Pro-BF16" -# MODEL_ID = "RedHatAI/DeepSeek-V4-Flash-BF16" +# Select model and load it. You can also use local paths if you have the model downloaded. +MODEL_ID = "RedHatAI/DeepSeek-V4-Flash-BF16" # or "deepseek-ai/DeepSeek-V4-Pro" +OFFLOAD_FOLDER = "./offload_folder" # Directory for offloading model layers with load_context(): model = AutoModelForCausalLM.from_pretrained( MODEL_ID, device_map="auto_offload", max_memory={}, - offload_folder="/mnt/nvme-data/engine/kylesayrs/offload_folder", + offload_folder=OFFLOAD_FOLDER, )

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@examples/quantizing_moe/deepseek_v4_pro_example.py` around lines 21 - 32, Replace the hardcoded local file system paths with publicly accessible alternatives in the model loading section. Change the MODEL_ID variable from the local path to use the publicly available Hugging Face model identifier that is already commented out, and replace the hardcoded offload_folder path in the AutoModelForCausalLM.from_pretrained call with either a generic temporary directory path or make it configurable through a variable that users can easily modify. This will make the example script runnable for end users without requiring access to specific developer machine paths.

Source: Path instructions

coderabbitai · 2026-06-23T17:20:47Z

+# kluge for the way I saved the decompressed checkpoint
+# mds = model.model.layers[-1].self_attn.wq_a._hf_hook.weights_map.dataset.index
+# mds["model.hc_head.base"] = mds['model.hc_head.hc_base']
+# mds["model.hc_head.fn"] = mds['model.hc_head.hc_fn']
+# mds["model.hc_head.scale"] = mds['model.hc_head.hc_scale']


📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

Remove commented-out "kluge" code.

These commented lines describe a developer-specific workaround ("kluge") that has no educational value for users learning the library. Example scripts should be clean and production-ready to demonstrate best practices.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@examples/quantizing_moe/deepseek_v4_pro_example.py` around lines 34 - 38, Remove the commented-out "kluge" workaround code block and its explanatory comment from the example file. Delete the entire section starting with the comment "# kluge for the way I saved the decompressed checkpoint" and all subsequent commented-out lines that follow it which manipulate the mds weights_map dataset index. This developer-specific workaround code is not relevant to users learning from the example and should be removed to keep the example clean and production-ready.

Source: Path instructions

coderabbitai · 2026-06-23T17:20:47Z

+# mds["model.hc_head.fn"] = mds['model.hc_head.hc_fn']
+# mds["model.hc_head.scale"] = mds['model.hc_head.hc_scale']
+
+tokenizer = AutoTokenizer.from_pretrained("RedHatAI/DeepSeek-V4-Flash-BF16")


🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Use consistent model ID for tokenizer or explain the mismatch.

The tokenizer is loaded from "RedHatAI/DeepSeek-V4-Flash-BF16" while the model uses a different path. For clarity and correctness in an example:

If the tokenizers are identical, use the same MODEL_ID variable

If they differ intentionally, add a comment explaining why (e.g., if Flash and Pro models share the same tokenizer)

🔧 Suggested fix

-tokenizer = AutoTokenizer.from_pretrained("RedHatAI/DeepSeek-V4-Flash-BF16") +tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

tokenizer = AutoTokenizer.from_pretrained("RedHatAI/DeepSeek-V4-Flash-BF16")

tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@examples/quantizing_moe/deepseek_v4_pro_example.py` at line 40, The AutoTokenizer.from_pretrained call hardcodes a model path "RedHatAI/DeepSeek-V4-Flash-BF16" that differs from the model being used in the example. Either replace the hardcoded string with the MODEL_ID variable to ensure consistency throughout the example, or if the Flash and Pro model tokenizers are intentionally the same, add a clear comment explaining why different model identifiers are used for the model and tokenizer respectively.

Source: Path instructions

coderabbitai · 2026-06-23T17:20:47Z

+def preprocess(example):
+    # DeepSeek-V4 does not have a traditional chat template.
+    # Encode manually per https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash/tree/main/encoding
+    BOS = "<｜begin▁of▁sentence｜>"
+    EOS = "<｜end▁of▁sentence｜>"
+    text = BOS
+    for message in example["messages"]:
+        role = message["role"]
+        content = message["content"]
+        if role == "system":
+            text += content
+        elif role == "user":
+            text += f"<｜User｜>{content}"
+        elif role == "assistant":
+            text += f"<｜Assistant｜></think>{content}{EOS}"
+
+    return {"text": text}


📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

Add explanation for the </think> token and consider error handling.

The preprocessing function has a few educational clarity issues:

Unexplained token: Line 73 includes </think> without explanation. Users should understand this is part of DeepSeek-V4's chain-of-thought format.

No error handling: The function assumes all messages have "role" and "content" fields. Consider adding validation or a try-except for robustness.

Ambiguous characters: The fullwidth vertical lines (｜) in special tokens are correct for DeepSeek-V4's format and can be safely ignored from the Ruff warnings.

📚 Suggested improvements

def preprocess(example): - # DeepSeek-V4 does not have a traditional chat template. - # Encode manually per https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash/tree/main/encoding + """ + DeepSeek-V4 does not have a traditional chat template. + Encode manually per https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash/tree/main/encoding + + Note: DeepSeek-V4 uses chain-of-thought reasoning. The </think> token marks + the end of the model's internal reasoning before providing the final answer. + """ BOS = "<｜begin▁of▁sentence｜>" EOS = "<｜end▁of▁sentence｜>" text = BOS for message in example["messages"]: role = message["role"] content = message["content"] if role == "system": text += content elif role == "user": text += f"<｜User｜>{content}" elif role == "assistant": text += f"<｜Assistant｜></think>{content}{EOS}" + else: + # Skip unknown roles + continue return {"text": text}

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def preprocess(example):

# DeepSeek-V4 does not have a traditional chat template.

# Encode manually per https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash/tree/main/encoding

BOS = "<｜begin▁of▁sentence｜>"

EOS = "<｜end▁of▁sentence｜>"

text = BOS

for message in example["messages"]:

role = message["role"]

content = message["content"]

if role == "system":

text += content

elif role == "user":

text += f"<｜User｜>{content}"

elif role == "assistant":

text += f"<｜Assistant｜></think>{content}{EOS}"

return {"text": text}

def preprocess(example):

"""

DeepSeek-V4 does not have a traditional chat template.

Encode manually per https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash/tree/main/encoding

Note: DeepSeek-V4 uses chain-of-thought reasoning. The </think> token marks

the end of the model's internal reasoning before providing the final answer.

"""

BOS = "<｜begin▁of▁sentence｜>"

EOS = "<｜end▁of▁sentence｜>"

text = BOS

for message in example["messages"]:

role = message["role"]

content = message["content"]

if role == "system":

text += content

elif role == "user":

text += f"<｜User｜>{content}"

elif role == "assistant":

text += f"<｜Assistant｜></think>{content}{EOS}"

else:

# Skip unknown roles

continue

return {"text": text}

🧰 Tools

🪛 Ruff (0.15.18)

[warning] 62-62: String contains ambiguous ｜ (FULLWIDTH VERTICAL LINE). Did you mean | (VERTICAL LINE)?

(RUF001)

[warning] 62-62: String contains ambiguous ｜ (FULLWIDTH VERTICAL LINE). Did you mean | (VERTICAL LINE)?

(RUF001)

[warning] 63-63: String contains ambiguous ｜ (FULLWIDTH VERTICAL LINE). Did you mean | (VERTICAL LINE)?

(RUF001)

[warning] 63-63: String contains ambiguous ｜ (FULLWIDTH VERTICAL LINE). Did you mean | (VERTICAL LINE)?

(RUF001)

[warning] 71-71: String contains ambiguous ｜ (FULLWIDTH VERTICAL LINE). Did you mean | (VERTICAL LINE)?

(RUF001)

[warning] 71-71: String contains ambiguous ｜ (FULLWIDTH VERTICAL LINE). Did you mean | (VERTICAL LINE)?

(RUF001)

[warning] 73-73: String contains ambiguous ｜ (FULLWIDTH VERTICAL LINE). Did you mean | (VERTICAL LINE)?

(RUF001)

[warning] 73-73: String contains ambiguous ｜ (FULLWIDTH VERTICAL LINE). Did you mean | (VERTICAL LINE)?

(RUF001)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@examples/quantizing_moe/deepseek_v4_pro_example.py` around lines 59 - 75, In the preprocess function, add a comment above or within the assistant role handling block (the elif role == "assistant": section) to explain that the </think> token is part of DeepSeek-V4's chain-of-thought reasoning format for educational clarity. Additionally, add error handling to the function to validate that each message in example["messages"] contains both "role" and "content" fields before accessing them, either by wrapping the message processing loop in a try-except block or by adding explicit validation checks to handle missing or malformed message fields gracefully.

Sources: Path instructions, Linters/SAST tools

coderabbitai · 2026-06-23T17:20:47Z

+oneshot(
+    model=model,
+    processor=tokenizer,
+    dataset=ds,
+    # recipe=recipe,
+    pipeline="sequential",
+    max_seq_length=MAX_SEQUENCE_LENGTH,
+    num_calibration_samples=NUM_CALIBRATION_SAMPLES,
+    sequential_targets=["DeepseekV4DecoderLayer"],
+    batch_size=1,
+    shuffle_calibration_samples=True,
+    propagate_error=False,  # work around reliance on transformers cache
+    # something weird happens with the cache and propagation
+)


🎯 Functional Correctness | 🔴 Critical | ⚡ Quick win

CRITICAL: The recipe parameter is commented out - quantization will not be applied.

Line 130 has # recipe=recipe, commented out, which means the quantization recipe defined in lines 104-121 is never used. Without this parameter, the oneshot() call will not apply any quantization, defeating the entire purpose of this example.

Additionally:

Line 137-138: The workaround comments for propagate_error=False are incomplete and don't help users understand when or why they might need this parameter.

🐛 Required fix

oneshot( model=model, processor=tokenizer, dataset=ds, - # recipe=recipe, + recipe=recipe, pipeline="sequential", max_seq_length=MAX_SEQUENCE_LENGTH, num_calibration_samples=NUM_CALIBRATION_SAMPLES, sequential_targets=["DeepseekV4DecoderLayer"], batch_size=1, shuffle_calibration_samples=True, - propagate_error=False, # work around reliance on transformers cache - # something weird happens with the cache and propagation + # Note: propagate_error=False works around an issue with transformers cache + # during sequential quantization of DeepSeek-V4 models + propagate_error=False, )

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@examples/quantizing_moe/deepseek_v4_pro_example.py` around lines 126 - 139, The recipe parameter in the oneshot() function call is commented out, preventing the quantization recipe from being applied during the calibration process. Uncomment the recipe=recipe, parameter in the oneshot() function call to ensure the quantization recipe defined earlier is actually used. Additionally, improve the comments above and around the propagate_error=False parameter to clearly explain what the workaround addresses and when users might encounter similar issues with transformers cache behavior.

Source: Path instructions

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

mergify · 2026-06-24T15:57:47Z

The quality checks have failed. Please run make style and make quality under
the root directory to adddress the lint failures. You will need to install the
dev optional install to get the required linting packages:
https://github.com/vllm-project/llm-compressor/blob/main/CONTRIBUTING.md

WIP: add example

16f2df5

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

mergify Bot added the documentation Improvements or additions to documentation label Jun 23, 2026

coderabbitai Bot added deepseek enhancement New feature or request fp8 For any issue / PR related to FP8 support moe nvfp4 For any PR / issue related to NVFP4 support labels Jun 23, 2026

gemini-code-assist Bot reviewed Jun 23, 2026

View reviewed changes

coderabbitai Bot reviewed Jun 23, 2026

View reviewed changes

kylesayrs added 2 commits June 23, 2026 16:07

ddp, working with cache change

2102d01

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

WIP

0e3c77c

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

mergify Bot added the quality-failed label Jun 24, 2026

	offload_folder="/mnt/nvme-data/engine/kylesayrs/offload_folder",
	offload_folder="./offload_folder",

	tokenizer = AutoTokenizer.from_pretrained("RedHatAI/DeepSeek-V4-Flash-BF16")
	tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)

Uh oh!

Conversation

kylesayrs commented Jun 23, 2026

Prerequisites

Uh oh!

github-actions Bot commented Jun 23, 2026

Uh oh!

coderabbitai Bot commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

❌ Failed checks (1 warning, 2 inconclusive)

Review ran into problems

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

mergify Bot commented Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

coderabbitai Bot commented Jun 23, 2026 •

edited

Loading