Skip to content

Conversation

@JackChuang
Copy link
Contributor

Motivation

Prevent users from using KV4 with incompatible attention backends by clearly documenting supported backends and enforcing runtime checks.

Improves reliability and reduces runtime errors by ensuring users cannot accidentally run KV4 with unsupported attention backends.

Modifications

Ensure code executes after default settings are set by placing it in model_runner.py instead of server_args.py.

Description / Changes:
1. Backend documentation updates
• Added FP4 KV cache column to the MLA/MHA backend table.
• Clarifies which backend combinations are supported with FP4 KV caches.
2. ModelRunner updates
• Added _handle_kv4_compatibility() to check KV4 compatibility with attention backends at runtime.
• Logs warnings for potential edge-case incompatibilities.
• Adds assertions to enforce correct decode_attention_backend for FA4 + MLA/MHA and non-FA4 + MLA/MHA setups.
• Raises an error if KV4 is used on non-CUDA platforms.

Testing

The compatibility results are tested on B200 (sm100), using Qwen3-235B-A22B as MHA and DeepSeek-R1-0528-FP4 as MLA to test.

Next (WIP)

Test kv4 with fa3 and flashmla backend on sm90 to complete the table. I will send another PR for this.

Checklist

… FP4 note

- Introduce FP4 KV cache support in the backend matrix.
- Add note on FA4 + KV4 scenario.

Signed-off-by: Ho-Ren (Jack) Chuang <[email protected]>
- Add _handle_kv4_compatibility() to validate backend choices for KV4 scenarios.
- Warns on potential edge-case incompatibilities.
- Asserts correct decode_attention_backend for FA4 + MLA/MHA and non-FA4 + MLA/MHA setups.
- Raises error if KV4 is used on non-CUDA platforms.

Signed-off-by: Ho-Ren (Jack) Chuang <[email protected]>
@gemini-code-assist
Copy link
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@github-actions github-actions bot added the documentation Improvements or additions to documentation label Dec 5, 2025
port=self.dist_port,
)

def _handle_kv4_compatibility(self):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move this function to server_args.py, somewhere after _handle_attention_backend_compatibility

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem. I will fix this as soon as I can. Thank you for your feedback!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants