Skip to content

add envs VLLM_KUNLUN_ENABLE_INT8_BMM to docs#150

Open
zhihui96 wants to merge 1 commit intobaidu:mainfrom
zhihui96:update-doc
Open

add envs VLLM_KUNLUN_ENABLE_INT8_BMM to docs#150
zhihui96 wants to merge 1 commit intobaidu:mainfrom
zhihui96:update-doc

Conversation

@zhihui96
Copy link
Contributor

PR Description

add envs VLLM_KUNLUN_ENABLE_INT8_BMM to docs


Checklist (Required)

Before submitting this PR, please ensure that all the following items are completed:

  • All code changes pass the pre-commit checks.
  • Commits are signed off using git commit -s.
  • The PR title is properly classified (see below).

PR Type

Please prefix the PR title with one or more of the following labels to help reviewers quickly understand the nature of the change:

  • [Feature] – New features or enhancements (e.g. Attention, Communicator, Kernel, Worker, etc.)
  • [Bugfix] – Bug fixes
  • [CI/Build] – CI, build system, or infrastructure improvements
  • [Doc] – Documentation updates or fixes
  • [Misc] – Other changes that do not fit the above categories (use sparingly)

Note: If the PR spans multiple categories, include all relevant prefixes.


Detailed Checklist (Click to Expand)

Thank you for contributing to vLLM Kunlun! To help us maintain high code quality and streamline the review process, please ensure your PR meets the following requirements.

1. Code Quality

  • All linting and formatting checks pass (pre-commit).
  • The code is well-structured and sufficiently documented.
  • The change is designed with maintainability and readability in mind.

2. Testing

  • Relevant unit tests are added or updated.
  • Integration tests are included when applicable.
  • Existing tests continue to pass.

3. DCO Compliance

This project follows the Developer Certificate of Origin (DCO).

  • All commits include a Signed-off-by: line.
  • Use git commit -s to automatically add the sign-off.

4. Review Expectations

During the review process, maintainers may:

  • Request code refactoring or additional tests.
  • Ask for clarifications on design decisions.
  • Suggest performance, stability, or maintainability improvements.

We appreciate your patience and collaboration throughout the review process!

| `export XMLIR_ENABLE_MOCK_TORCH_COMPILE` | `false` | ***\*Disable Mock Torch Compile Function\****. Set to `false` to ensure the actual compilation and optimization flow is used, rather than mock mode. |
| `FUSED_QK_ROPE_OP` | `0` | ***\*Control whether to use the Fused QK-Norm and RoPE implementation\****. Default is `0` (use original/standard RoPE). Setting to `1` may be used to enable QWEN3. |
| `FUSED_QK_ROPE_OP` | `0` | ***\*Control whether to use the Fused QK-Norm and RoPE implementation\****. Default is `0` (use original/standard RoPE). Setting to `1` may be used to enable QWEN3. |
| `VLLM_KUNLUN_ENABLE_INT8_BMM` | `1` | ***\*Control whether to enable int8 bmm\****. Default is `0`. Setting to `1` can save some memory when using int8 quantization. |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is useful for INT8, why not set it to be "1" as default?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not available for unquant case. Just set it to true when using int8.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should use this environment variable to control whether to run quantization. There are standard methods for judging quantization. And I understand that if I run a quantization model but forget to enable this environment variable, it will cause an error?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if VLLM_KUNLUN_ENABLE_INT8_BMM=False, use torch.bmm and flaot16/bfloat16 to do W_UK/W_UV calculation.
Otherwise, use xtorch_ops.mla_bmm_I8 and int8 to do it. For DS V3.1, this don't improve the performance,and just save memory. I think this feature requires more testing and prefer to add a new environment to control the behavior.

@liwei109
Copy link
Collaborator

if not isinstance(layer.quant_method, UnquantizedLinearMethod)
You can determine whether to run int8 bmm in this way, instead of adding an environment variable, because only someone who knows the environment variable can run it correctly that way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants