Add FP8 KV Support #737

yiliu30 · 2025-08-15T03:59:59Z

No description provided.

Signed-off-by: yiliu30 <[email protected]>

Copilot

Pull Request Overview

This PR adds FP8 KV cache quantization support to the AutoRound quantization library. This feature enables quantizing the key-value cache in attention mechanisms to FP8 format for improved memory efficiency.

Adds a new enable_fp8_kv parameter to control FP8 KV cache quantization
Implements FP8 KV cache infrastructure with calibration and quantization context
Updates test coverage to verify FP8 KV cache serialization and functionality

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File	Description
auto_round/autoround.py	Adds enable_fp8_kv parameter and context manager integration
auto_round/experimental/fp8_kv_cache.py	Core FP8 KV cache implementation with quantization logic
test/test_cpu/test_export.py	Test updates to verify FP8 KV cache export functionality

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

Copilot · 2025-08-15T04:01:13Z

auto_round/experimental/fp8_kv_cache.py

+from transformers.cache_utils import DynamicCache
+
+logger.add(sys.stderr, level="TRACE")
+


Adding a logger to sys.stderr at TRACE level in a module import can interfere with application-level logging configuration. Consider removing this global logger configuration or making it conditional.

Suggested change

Copilot · 2025-08-15T04:01:14Z

auto_round/experimental/fp8_kv_cache.py

+        Get the k_scale and v_scale and output the quant-dequant key_states and value_states
+        """
+
+        # FIXME: Should we append the key_states/value_states to the cache?


The FIXME comment indicates uncertainty about the intended behavior. This should be resolved before production use, as it could affect the correctness of the KV cache implementation.

Copilot · 2025-08-15T04:01:14Z

auto_round/experimental/fp8_kv_cache.py

+    # FIXME: Handle this better.
+    return "attention" in module.__class__.__name__.lower() and (
+        hasattr(module, "k_proj") or hasattr(module, "v_proj") or hasattr(module, "qkv_proj")
+    )


The attention module detection logic uses a simple string check and attribute existence. This brittle approach should be improved with a more robust detection mechanism.

Suggested change

)

"""

Robustly detect if a module is an attention module.

Checks for known attention base classes and required attributes.

"""

# Check for known attention base classes

attention_bases = tuple(

cls for cls in (MultiheadAttention, LlamaAttention) if cls is not None

)

if attention_bases and isinstance(module, attention_bases):

return True

# Fallback: check for common attention attributes and method signatures

has_proj = any(

hasattr(module, attr)

for attr in ("k_proj", "v_proj", "qkv_proj", "in_proj_weight", "out_proj")

)

has_forward = hasattr(module, "forward")

# Optionally, check for typical attention input signatures

return has_proj and has_forward

Copilot · 2025-08-15T04:01:14Z

auto_round/experimental/fp8_kv_cache.py

+     `[batch_size, num_heads, seq_len - residual_length, head_dim]`.
+
+
+    # TODO: Triggered by adding kv_cache_scheme in ...


The TODO comment is incomplete and doesn't provide meaningful information. It should either be completed with specific details or removed.

Suggested change

# TODO: Triggered by adding kv_cache_scheme in ...

wenhuach21 · 2025-08-15T04:19:53Z

Please note that kv cache does not affect tuning as we only use forward in tuning. So the first change is moving the arg from init to save_quantized?

Signed-off-by: yiliu30 <[email protected]>

yiliu30 added 3 commits August 14, 2025 22:43

add fp8 kv

75da6b6

Signed-off-by: yiliu30 <[email protected]>

refactor

4b3f36a

Signed-off-by: yiliu30 <[email protected]>

clean code

eff6a34

Signed-off-by: yiliu30 <[email protected]>

yiliu30 requested a review from Copilot August 15, 2025 04:00

Copilot AI reviewed Aug 15, 2025

View reviewed changes

update name

b13c1cc

Signed-off-by: yiliu30 <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add FP8 KV Support #737

Add FP8 KV Support #737

Uh oh!

yiliu30 commented Aug 15, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Aug 15, 2025

Uh oh!

Copilot AI Aug 15, 2025

Uh oh!

Copilot AI Aug 15, 2025

Uh oh!

Copilot AI Aug 15, 2025

Uh oh!

wenhuach21 commented Aug 15, 2025 •

edited

Loading

Uh oh!

Uh oh!

		from transformers.cache_utils import DynamicCache

		logger.add(sys.stderr, level="TRACE")

-    )
+    """
+    Robustly detect if a module is an attention module.
+    Checks for known attention base classes and required attributes.
+    """
+    # Check for known attention base classes
+    attention_bases = tuple(
+        cls for cls in (MultiheadAttention, LlamaAttention) if cls is not None
+    )
+    if attention_bases and isinstance(module, attention_bases):
+        return True
+    # Fallback: check for common attention attributes and method signatures
+    has_proj = any(
+        hasattr(module, attr)
+        for attr in ("k_proj", "v_proj", "qkv_proj", "in_proj_weight", "out_proj")
+    )
+    has_forward = hasattr(module, "forward")
+    # Optionally, check for typical attention input signatures
+    return has_proj and has_forward

		`[batch_size, num_heads, seq_len - residual_length, head_dim]`.


		# TODO: Triggered by adding kv_cache_scheme in ...

Add FP8 KV Support #737

Are you sure you want to change the base?

Add FP8 KV Support #737

Uh oh!

Conversation

yiliu30 commented Aug 15, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Aug 15, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Aug 15, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Aug 15, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Aug 15, 2025

Choose a reason for hiding this comment

Uh oh!

wenhuach21 commented Aug 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

wenhuach21 commented Aug 15, 2025 •

edited

Loading