add envs VLLM_KUNLUN_ENABLE_INT8_BMM to docs#150
Conversation
| | `export XMLIR_ENABLE_MOCK_TORCH_COMPILE` | `false` | ***\*Disable Mock Torch Compile Function\****. Set to `false` to ensure the actual compilation and optimization flow is used, rather than mock mode. | | ||
| | `FUSED_QK_ROPE_OP` | `0` | ***\*Control whether to use the Fused QK-Norm and RoPE implementation\****. Default is `0` (use original/standard RoPE). Setting to `1` may be used to enable QWEN3. | | ||
| | `FUSED_QK_ROPE_OP` | `0` | ***\*Control whether to use the Fused QK-Norm and RoPE implementation\****. Default is `0` (use original/standard RoPE). Setting to `1` may be used to enable QWEN3. | | ||
| | `VLLM_KUNLUN_ENABLE_INT8_BMM` | `1` | ***\*Control whether to enable int8 bmm\****. Default is `0`. Setting to `1` can save some memory when using int8 quantization. | |
There was a problem hiding this comment.
If this is useful for INT8, why not set it to be "1" as default?
There was a problem hiding this comment.
It's not available for unquant case. Just set it to true when using int8.
There was a problem hiding this comment.
I don't think we should use this environment variable to control whether to run quantization. There are standard methods for judging quantization. And I understand that if I run a quantization model but forget to enable this environment variable, it will cause an error?
There was a problem hiding this comment.
if VLLM_KUNLUN_ENABLE_INT8_BMM=False, use torch.bmm and flaot16/bfloat16 to do W_UK/W_UV calculation.
Otherwise, use xtorch_ops.mla_bmm_I8 and int8 to do it. For DS V3.1, this don't improve the performance,and just save memory. I think this feature requires more testing and prefer to add a new environment to control the behavior.
|
|
PR Description
add envs VLLM_KUNLUN_ENABLE_INT8_BMM to docs
Checklist (Required)
Before submitting this PR, please ensure that all the following items are completed:
pre-commitchecks.git commit -s.PR Type
Please prefix the PR title with one or more of the following labels to help reviewers quickly understand the nature of the change:
[Feature]– New features or enhancements (e.g. Attention, Communicator, Kernel, Worker, etc.)[Bugfix]– Bug fixes[CI/Build]– CI, build system, or infrastructure improvements[Doc]– Documentation updates or fixes[Misc]– Other changes that do not fit the above categories (use sparingly)Detailed Checklist (Click to Expand)
Thank you for contributing to vLLM Kunlun! To help us maintain high code quality and streamline the review process, please ensure your PR meets the following requirements.
1. Code Quality
pre-commit).2. Testing
3. DCO Compliance
This project follows the Developer Certificate of Origin (DCO).
Signed-off-by:line.git commit -sto automatically add the sign-off.4. Review Expectations
During the review process, maintainers may:
We appreciate your patience and collaboration throughout the review process!