Skip to content

Conversation

@dcampora
Copy link
Contributor

@dcampora dcampora commented Dec 5, 2025

Support Mistral Large 3 NVFP4.

Depends on #14466.

  • GSM8K test results:
SGLANG_ENABLE_JIT_DEEPGEMM=0 \
python3 -m sglang.launch_server \
--model mistralai/Mistral-Large-3-675B-Instruct-2512-NVFP4 \
--kv-cache-dtype fp8_e4m3 \
--tensor-parallel-size 8 \
--disable-radix-cache \
--stream-interval 20 \
--mem-fraction-static 0.9 \
--attention-backend trtllm_mla \
--model-loader-extra-config '{"enable_multithread_load": true}' \
--max-running-requests 1024 \
--cuda-graph-max-bs 1024 \
--chat-template mistral

lm_eval \
--model local-chat-completions \
--model_args model=mistralai/Mistral-Large-3-675B-Instruct-2512-NVFP4,\
base_url=http://127.0.0.1:30000/v1/chat/completions,\
num_concurrent=128,timeout=999999,max_gen_toks=8192 \
--tasks gsm8k \
--batch_size 128 \
--apply_chat_template \
--num_fewshot 8
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     8|exact_match|↑  |0.9249|±  |0.0073|
|     |       |strict-match    |     8|exact_match|↑  |0.7104|±  |0.0125|

Checklist

@gemini-code-assist
Copy link
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@github-actions github-actions bot added quant LLM Quantization deepseek blackwell SM100/SM120 labels Dec 5, 2025
@dcampora dcampora changed the title Mistral Large 3 Eagle and NVFP4 support Mistral Large 3 NVFP4 support Dec 5, 2025
@JustinTong0323
Copy link
Collaborator

/tag-and-rerun-ci

@github-actions github-actions bot added the run-ci label Dec 5, 2025
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a mm op, why put under attention layer?


from .compressed_tensors_scheme import CompressedTensorsScheme
from .compressed_tensors_w4a4_nvfp4 import CompressedTensorsW4A4Fp4
from .compressed_tensors_w4a16_nvfp4 import CompressedTensorsW4A16Fp4
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have we tested the w4a16 code path? If not, better to do it in another PR. We may need it on Hopper or previous arch, and we don't have w4a16 moe support for now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment, can do in follow-up PR.


if is_activation_quantization_format(self.quant_format):
if self._is_fp4a4_nvfp4(weight_quant, input_quant):
if cutlass_fp4_supported():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

w4a4 supports both flashinfer and cutlass, right? I think we should do something similar to the below method, check the capability.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be only used by w4a16.

)


def swizzle_blockscale(scale: torch.Tensor):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a comment to clarify this method is nvfp4 specific.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

blackwell SM100/SM120 deepseek quant LLM Quantization run-ci

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants