Skip to content

Conversation

paul-gibbons
Copy link

Description

Debug tool in TE allows user to set non-overlapping list of (start, end) pairs in which logging will occur at specified frequency. This MR address two bugs with start_end_list functionality.

  1. start_end_list was directly being used List in stats_buffer, leading to unhashable type error at the first logging interval. List is converted to nested tuples in LogTensorStats and LogFP8TensorStats which resolves the issue.
TransformerEngine/transformer_engine/debug/features/utils/stats_buffer.py", line 193, in try_add_buffer
[rank0]:     if (layer_name, tensor_name, options) in self.buffers:
[rank0]:        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: TypeError: unhashable type: 'list'
  1. Add additional handling in stats_buffer to clean up scenario where next_iter=None when using start_end_list and we face type error when checking iteration >= next_iter in stats_buffer.py below.
"/pg/dev/code/TransformerEngine/transformer_engine/debug/features/utils/stats_buffer.py", line 224, in log_stats
[rank0]:     if not self._if_run_reduction():
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/pg/dev/code/TransformerEngine/transformer_engine/debug/features/utils/stats_buffer.py", line 184, in _if_run_reduction
[rank0]:     if iteration >= next_iter:
[rank0]:        ^^^^^^^^^^^^^^^^^^^^^^
[rank0]: TypeError: '>=' not supported between instances of 'int' and 'NoneType'

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

@pggPL pggPL self-requested a review October 9, 2025 07:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant