v0.1.10 #2258
LeiWang1999
announced in
Announcements
v0.1.10
#2258
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
This release focuses on broader backend support, new GPU instructions, compiler
pipeline improvements, and release/build stability.
Highlights
LDS, LDS transpose reads, INT8 MFMA, MXFP4 FP4 E2M1, and RDNA gfx1151 target
support.
copies, TMA gather4 / scatter4, and T.copy_cluster for TMA multicast and SM-
to-SM cluster copy.
GPU benchmarking, and do_not_specialize support.
lowering/codegen paths into backend-specific modules.
including split TVM DLL handling.
Compiler / Runtime Improvements
bind-free pipeline annotations, guarded TMA pipeline fixes, and bind-scope
preservation.
vectorization for bf16/fp16 reductions.
Bug Fixes
placement, 1D TMA store layout inference, quarter swizzle, and invalid
T.tma_copy SIMT fallback.
filtering, and Roller autotuner behavior on RDNA3 WMMA targets.
Docs / Examples
examples, DeepSeek-V4 operator examples, and LayerNorm example.
documentation.
Compatibility Notes
What's Changed
.shared::ctain TMA copy paths only on CUDA 12.8+ by @ColmaLiu in [Bugfix] Enable.shared::ctain TMA copy paths only on CUDA 12.8+ #2087T.sync_threads()by @bucket-xv in Allow variable barrier id inT.sync_threads()#2197do_not_specializefor autotune by @Rachmanino in [Enhance] Reject default scalar params and supportdo_not_specializefor autotune #2084T.tma_copySIMT fallback by @Rachmanino in [BugFix] Add missing quarter swizzle and disallowT.tma_copySIMT fallback #2242New Contributors
Full Changelog: v0.1.9...v0.1.10
This discussion was created from the release v0.1.10.
Beta Was this translation helpful? Give feedback.
All reactions