Skip to content

Conversation

@ecamartins
Copy link
Collaborator

@ecamartins ecamartins commented Nov 21, 2025

Proposed changes

In CK Tile Stream-K, there may be multiple workgroups contributing to a given tile in the C tensor. When the accumulation strategy involves atomics, there may be round off error in cases where the accumulator type is not the same as the C type. To compute an error tolerance for test validation, the Stream-K Tile Partitioner has a function called estimate_num_wgs_per_tile to estimate the number of workgroups per tile. That said, this function only provides an estimate; it may underestimate in some cases. This underestimation was causing the error tolerance computed by calculate_rtol_atol to be too low for the DP + 2 Tile SK test cases. This led to a regression for some Stream-K tests. Note, that these validation failures due to round-off errors may not always be present due to the non-determinism that accompany atomics.

Thus, this change updates the estimate_num_wgs_per_tile function to explicitly return the value of 2 for DP+2TSK instances to ensure that we have a better error tolerance to avoid test failures due to round-off error.

We tested locally on gfx90a, gfx942, and gfx950; all tests pass. We found some unrelated issues on gfx908, so we have disabled our tests on gfx908 for now and will be creating a ticket to investigate the issues.

Checklist

Please put an x into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask.

  • I have added tests relevant to the introduced functionality, and the unit tests are passing locally
  • I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, IF the test takes more than 30 seconds to run.
  • I have added inline documentation which enables the maintainers with understanding the motivation
  • I have removed the stale documentation which is no longer relevant after this pull request
  • (If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request
  • I have run clang-format on all changed files
  • Any dependent changes have been merged

When there are multiple workgroups contributing to a tile, when using
atomics, there may be round off error in cases where the accumulator
type is not the same as the C type. To compute an error tolerance for
test validation, the Stream-K Tile Partitioner has a function called
estimate_num_wgs_per_tile to estimate the number of workgroups per tile.
That said, this function only provides an estimate. In some cases for
DP+2TSK, the function returns 1 rather than the more accurate value of
2.

Thus, this change updates the estimate_num_wgs_per_tile function to
explicitely return the value of 2 in cases for DP+2TSK to ensure that we
have a better error tolerance to avoid test failures due to round-off
error.
@ecamartins ecamartins merged commit 02ab76c into develop Nov 22, 2025
18 checks passed
@ecamartins ecamartins deleted the emimarti/ck_tile/dp_2tsk_validation_fix branch November 22, 2025 03:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants