v0.1.10 · tile-ai tilelang · Discussion #2258 · GitHub

LeiWang1999
May 25, 2026
Maintainer

This release focuses on broader backend support, new GPU instructions, compiler
pipeline improvements, and release/build stability.

Highlights

Added major AMD support: RDNA3/RDNA3.5 WMMA, gfx950/CDNA4 copy.async, 160K
LDS, LDS transpose reads, INT8 MFMA, MXFP4 FP4 E2M1, and RDNA gfx1151 target
support.
Added CUDA/Blackwell features: MXFP8 block-scaled GEMM, FP4 TensorMap TMA
copies, TMA gather4 / scatter4, and T.copy_cluster for TMA multicast and SM-
to-SM cluster copy.
Added native SM75 MMA GEMM support for FP16, INT8, and INT4.
Added initial Metal GEMM support using simdgroup_matrix MMA.
Added T.tfloat32 dtype support and expanded TCGEN5 F8/F6/F4 dtype plumbing.
Improved autotuning with pipelined compilation, grouped compilation, multi-
GPU benchmarking, and do_not_specialize support.
Refactored backend structure by splitting CUDA, ROCm, Metal, CPU, and WebGPU
lowering/codegen paths into backend-specific modules.
Migrated IR usage toward tirx.
Added PyPI release publishing workflow and improved Windows support,
including split TVM DLL handling.

Compiler / Runtime Improvements

Improved software pipeline handling, including scalar bind replay, scalar
bind-free pipeline annotations, guarded TMA pipeline fixes, and bind-scope
preservation.
Added TL_DISABLE_SHARED_MEMORY_REUSE pass config.
Improved reduction codegen with batched AllReduce and packed add2
vectorization for bf16/fp16 reductions.
Preserved dynamic shared memory aliases in CUDA IR.
Added variable barrier ID support in T.sync_threads().
Cleaned up compiler temp files by default.

Bug Fixes

Fixed multiple TMA issues: Blackwell 1024-byte alignment, descriptor init
placement, 1D TMA store layout inference, quarter swizzle, and invalid
T.tma_copy SIMT fallback.
Fixed SM90 WGMMA B-type typo and SM75 kN-per-warp handling.
Fixed T.gemm() on SM75 and SM70 buffer region indexing.
Fixed ROCm FP4 packed buffer map key and several HIP codegen issues.
Fixed sparse INT8 default metadata dtype, IntrinInfo repr, CUPTI cache flush
filtering, and Roller autotuner behavior on RDNA3 WMMA targets.

Docs / Examples

Added software pipeline and cluster TMA programming guides.
Added MXFP8 block-scaled grouped GEMM examples, HISA sparse attention indexer
examples, DeepSeek-V4 operator examples, and LayerNorm example.
Migrated eligible examples to eager style and refreshed target/build
documentation.

Compatibility Notes

Dropped Python 3.9 support; TileLang now requires Python >= 3.10.
Bumped apache-tvm-ffi requirement to >=0.1.10.
Source/build docs now cover Linux and Windows paths.

What's Changed

[AMD][Radeon] Add the Support of RDNA3/RDNA3.5(gfx11) WMMA by @jiawei-real in [AMD][Radeon] Add the Support of RDNA3/RDNA3.5(gfx11) WMMA #2044
[codex] Remove dead transform pass leftovers by @LeiWang1999 in [codex] Remove dead transform pass leftovers #2083
[Bugfix] Enable .shared::cta in TMA copy paths only on CUDA 12.8+ by @ColmaLiu in [Bugfix] Enable .shared::cta in TMA copy paths only on CUDA 12.8+ #2087
[AMD][gfx950] Add ds_read_tr16_b64 / ds_read_tr8_b64 support for gfx950 LDS transpose reads by @zhangnju in [AMD][gfx950] Add ds_read_tr16_b64 / ds_read_tr8_b64 support for gfx950 LDS transpose reads #2085
[AMD][Gfx950] Add the support of 160K LDS and copy.async by @zhangnju in [AMD][Gfx950] Add the support of 160K LDS and copy.async #2058
[BugFix] Relax loop wait and adjust trailing drain behavior in async pipeline tests by @Rachmanino in [BugFix] Relax loop wait and adjust trailing drain behavior in async pipeline tests #2092
[Feature] Block-scaled GEMM support for MXFP8 on Blackwell by @Rachmanino in [Feature] Block-scaled GEMM support for MXFP8 on Blackwell #1945
[Host CodeGen][Refactor] Cleanup namespace and remove useless C templates by @SiriusNEO in [Host CodeGen][Refactor] Cleanup namespace and remove useless C templates #2091
Add opt-out for prelower semantic checks for DeepSeek V4 Flash on ARM64 by @foraxe in Add opt-out for prelower semantic checks for DeepSeek V4 Flash on ARM64 #2094
[Example] Add HISA: hierarchical sparse attention indexer by @xuyufei-a in [Example] Add HISA: hierarchical sparse attention indexer #2069
[Language] Small cleanup and notes for alloc global by @SiriusNEO in [Language] Small cleanup and notes for alloc global #2100
[Enhancement] Optimize hopper fp8 deepgemm tile size by @Rachmanino in [Enhancement] Optimize hopper fp8 deepgemm tile size #2103
[CUDA][SM100] Include cuda_fp6.h when emitting FP6 types by @TerminusAkivili in [CUDA][SM100] Include cuda_fp6.h when emitting FP6 types #2102
feat: support cdna4 v_mfma_i32_16x16x64_i8 & v_mfma_i32_32x32x32_i8 by @Paran0idy in feat: support cdna4 v_mfma_i32_16x16x64_i8 & v_mfma_i32_32x32x32_i8 #2097
[AMD] [gfx950]Fix multiple HIP codegen bugs to support TileKernel by @zhangnju in [AMD] [gfx950]Fix multiple HIP codegen bugs to support TileKernel #2099
[Language][UX] User-friendly error report when incorrectly indexing buffer by @SiriusNEO in [Language][UX] User-friendly error report when incorrectly indexing buffer #2104
[TMA] Support FP4 TensorMap TMA copies by @LeiWang1999 in [TMA] Support FP4 TensorMap TMA copies #2107
[Example] Add MXFP8 blockscaled grouped gemm examples with transB support by @Rachmanino in [Example] Add MXFP8 blockscaled grouped gemm examples with transB support #2098
[Feature] Batched AllReduce for better T.reduce performance by @kurisu6912 in [Feature] Batched AllReduce for better T.reduce performance #1976
fix: add missing TvmLogDebugSettings::ParseSpec and VerboseEnabledImpl for TVM_LOG_CUSTOMIZE builds by @kurisu6912 in fix: add missing TvmLogDebugSettings::ParseSpec and VerboseEnabledImpl for TVM_LOG_CUSTOMIZE builds #2109
[Refactor][Build] Separate CMakeLists into different backends by @SiriusNEO in [Refactor][Build] Separate CMakeLists into different backends #2114
[Enhancement][CUDA][SM100] Report unsupported FP6 vector types earlier by @TerminusAkivili in [Enhancement][CUDA][SM100] Report unsupported FP6 vector types earlier #2117
[AMD][CI issue] add gfx950 guard to fix the CI issues by @zhangnju in [AMD][CI issue] add gfx950 guard to fix the CI issues #2105
[BugFix] Fix redundant runtime bounds checks for BufferLoad indices in LegalizeSafeMemoryAccess by @SiriusNEO in [BugFix] Fix redundant runtime bounds checks for BufferLoad indices in LegalizeSafeMemoryAccess #2122
[Fix] Unable to allocate shared memory buffer from tail by @Denverjin in [Fix] Unable to allocate shared memory buffer from tail #2106
[FIX] Fix kernel file suffix for cutedsl when only target is set by @ur4t in [FIX] Fix kernel file suffix for cutedsl when only target is set #2128
Change disable_out_of_bound_warning default to True by @kurisu6912 in Change disable_out_of_bound_warning default to True #2131
[Typo] Fix typos in comments and example README by @yurekami in [Typo] Fix typos in comments and example README #2133
[codex] Fix 1D TMA store layout inference by @LeiWang1999 in [codex] Fix 1D TMA store layout inference #2137
[Fix][Build] Disable Cython PEP-489 multi-phase init for the cython wrapper by @yurekami in [Fix][Build] Disable Cython PEP-489 multi-phase init for the cython wrapper #2135
fix: TMA alignment to 1024 bytes on Blackwell by @kasper0406 in fix: TMA alignment to 1024 bytes on Blackwell #2134
[CI] [pre-commit.ci] autoupdate by @pre-commit-ci[bot] in [CI] [pre-commit.ci] autoupdate #2149
[TMA] Fix TMA descriptor init placement by @LeiWang1999 in [TMA] Fix TMA descriptor init placement #2151
[Refactor] Refactor register annotation lowering by @Rachmanino in [Refactor] Refactor register annotation lowering #2088
[Feature][Fix] Extend TCGEN5 F8F6F4 dtype plumbing by @TerminusAkivili in [Feature][Fix] Extend TCGEN5 F8F6F4 dtype plumbing #2126
[Refactor][Backend] Split tl.copy lowering by backend by @LeiWang1999 in [Refactor][Backend] Split tl.copy lowering by backend #2138
[codex] Split GEMM implementations by backend by @LeiWang1999 in [codex] Split GEMM implementations by backend #2153
[Refactor][CodeGen] Refactor CodeGen part for multi-backend decoupling by @SiriusNEO in [Refactor][CodeGen] Refactor CodeGen part for multi-backend decoupling #2121
[docs] fix TMEM description by @yiakwy-xpu-ml-framework-team in [docs] fix TMEM description #2152
[docs] update tma description by @yiakwy-xpu-ml-framework-team in [docs] update tma description #2154
[Feature] Add full Windows support and fix related cross-platform issues by @sepcnt in [Feature] Add full Windows support and fix related cross-platform issues #2093
[Examples] Add examples for operators in DeepSeek-V4 by @Rachmanino in [Examples] Add examples for operators in DeepSeek-V4 #2148
[Refactor][Backend] Split remaining TileOps by backend by @LeiWang1999 in [Refactor][Backend] Split remaining TileOps by backend #2156
[Examples] Remove duplicated sparse TensorCore examples by @LeiWang1999 in [Examples] Remove duplicated sparse TensorCore examples #2162
[Backend] Share common GPU tile op lowerers by @LeiWang1999 in [Backend] Share common GPU tile op lowerers #2163
[Refactor] Move backend stubs out of codegen by @LeiWang1999 in [Refactor] Move backend stubs out of codegen #2164
[Release] Fix scikit-build version provider scope by @LeiWang1999 in [Release] Fix scikit-build version provider scope #2167
[Refactor] Move backend-specific GEMM implementations and transforms into backend directories by @LeiWang1999 in [Refactor] Move backend-specific GEMM implementations and transforms into backend directories #2165
[Refactor] Refactor multiple TensorCoreIntrinEmitter to provide atom-level mma control interface by @Rachmanino in [Refactor] Refactor multiple TensorCoreIntrinEmitter to provide atom-level mma control interface #2161
[BugFix] Fix T.gemm() on SM75 (Turing) GPUs ([Question] NVIDIA RTX 2080Ti has trouble in running examples #1992) by @Chennesxu in [BugFix] Fix T.gemm() on SM75 (Turing) GPUs (#1992) #2173
Fix float4 storage dtype torch mapping by @zihaomu in Fix float4 storage dtype torch mapping #2174
[Build] Fix cross platform CMake and add messages when enabling backends by @SiriusNEO in [Build] Fix cross platform CMake and add messages when enabling backends #2183
[Autotune] Add pipeline, grouped compilation, and multi-GPU benchmark support by @Wazrrr in [Autotune] Add pipeline, grouped compilation, and multi-GPU benchmark support #2159
[WIP] Handle CuTeDSL FP4 torch dtype by @zihaomu in [WIP] Handle CuTeDSL FP4 torch dtype #2187
Add RDNA gfx1151 ROCm target support by @lhl in Add RDNA gfx1151 ROCm target support #2127
[BugFix] Consider non-local store in external call and SIMT producer for warp specialize by @Rachmanino in [BugFix] Consider non-local store in external call and SIMT producer for warp specialize #2166
[ROCm] Try to fix ROCm CI error by @zihaomu in [ROCm] Try to fix ROCm CI error #2179
Fix SM70 buffer region indexing by @cklxx in Fix SM70 buffer region indexing #2191
[Example] Add layernorm example in tilelang by @ighoshsubho in [Example] Add layernorm example in tilelang #2168
[Compat] Bump __nv_fp8_e8m0 guard from CUDA 12.6 to 12.8 by @GoldenStain in [Compat] Bump __nv_fp8_e8m0 guard from CUDA 12.6 to 12.8 #2212
[NFC] Align stale fallback comment with CUDA 12.8 guard by @GoldenStain in [NFC] Align stale fallback comment with CUDA 12.8 guard #2215
[BugFix] Vendor HIP headers and build fat CUDA+ROCm linux wheels by @benenzhu in [BugFix] Vendor HIP headers and build fat CUDA+ROCm linux wheels #2195
[Release] Fix typing issue cause release job failed by @oraluben in [Release] Fix typing issue cause release job failed #2213
Allow variable barrier id in T.sync_threads() by @bucket-xv in Allow variable barrier id in T.sync_threads() #2197
[Python] Drop Python 3.9 support by @LeiWang1999 in [Python] Drop Python 3.9 support #2218
[Fix][AMD] Fix Roller autotuner for RDNA3 WMMA targets by @lhl in [Fix][AMD] Fix Roller autotuner for RDNA3 WMMA targets #2208
[Perf] Enable fast math in sparse MLA example by @Rachmanino in [Perf] Enable fast math in sparse MLA example #2219
[Backend] Refactor gemm_sp by @botbw in [Backend] Refactor gemm_sp #2048
feat: auto-vectorize bf16/fp16 reduce with packed add2 intrinsics by @kurisu6912 in feat: auto-vectorize bf16/fp16 reduce with packed add2 intrinsics #2112
[Pipeline] Fix guarded TMA pipeline handling by @LeiWang1999 in [Pipeline] Fix guarded TMA pipeline handling #2224
[CuTeDSL] Add PDL codegen and launcher support by @JayceSu98 in [CuTeDSL] Add PDL codegen and launcher support #2220
[Enhancement]Support mixed-sign ramp indices in LegalizeNegativeIndex by @TerminusAkivili in [Enhancement]Support mixed-sign ramp indices in LegalizeNegativeIndex #2225
[Transform] Fix CPU while fallback thread lowering by @LeiWang1999 in [Transform] Fix CPU while fallback thread lowering #2227
[CUDA] Add native SM75 MMA GEMM support for FP16, INT8 and INT4 by @Tokimorphling in [CUDA] Add native SM75 MMA GEMM support for FP16, INT8 and INT4 #2198
[Feature] Add T.copy_cluster to support TMA multicast and SM-to-SM cluster copy by @He-Jingkai in [Feature] Add T.copy_cluster to support TMA multicast and SM-to-SM cluster copy #1908
[Enhance] Reject default scalar params and support do_not_specialize for autotune by @Rachmanino in [Enhance] Reject default scalar params and support do_not_specialize for autotune #2084
[CUDA][TMA] Add TMA tile::gather4 / tile::scatter4 support by @ighoshsubho in [CUDA][TMA] Add TMA tile::gather4 / tile::scatter4 support #2129
[TIR][IR] Update to use tirx by @LeiWang1999 in [TIR][IR] Update to use tirx #2216
[ROCm] Match CUDA path debug-info and temp-file plumbing by @yyccli in [ROCm] Match CUDA path debug-info and temp-file plumbing #2230
[Transform] Preserve bind scope when splitting if statements by @LeiWang1999 in [Transform] Preserve bind scope when splitting if statements #2232
[Transform] Refactor TIR statement traversal helpers by @LeiWang1999 in [Transform] Refactor TIR statement traversal helpers #2231
[Transform] Preserve WS prelude liveness ordering by @LeiWang1999 in [Transform] Preserve WS prelude liveness ordering #2233
[Pipeline] Replay scalar binds in pipelined stages by @LeiWang1999 in [Pipeline] Replay scalar binds in pipelined stages #2234
[BugFix] Fix sparse int8 default metadata dtype by @TerminusAkivili in [BugFix] Fix sparse int8 default metadata dtype #2229
[BugFix] Fix SM90 WGMMA B_type typo and update SM75 kNPerWarp by @Chennesxu in [BugFix] Fix SM90 WGMMA B_type typo and update SM75 kNPerWarp #2236
[Pipeline] Support scalar bind-free pipeline annotations by @LeiWang1999 in [Pipeline] Support scalar bind-free pipeline annotations #2237
[CI] Temporarily disable ROCm CI by @LeiWang1999 in [CI] Temporarily disable ROCm CI #2241
Add TL_DISABLE_SHARED_MEMORY_REUSE pass config by @kurisu6912 in Add TL_DISABLE_SHARED_MEMORY_REUSE pass config #2228
[Feature] Introduce T.tfloat32 data type support by @ColmaLiu in [Feature] Introduce T.tfloat32 data type support #2032
[BugFix] Add missing quarter swizzle and disallow T.tma_copy SIMT fallback by @Rachmanino in [BugFix] Add missing quarter swizzle and disallow T.tma_copy SIMT fallback #2242
[CUDA][IR] Preserve dynamic shared memory aliases by @LeiWang1999 in [CUDA][IR] Preserve dynamic shared memory aliases #2240
[Example] Migrate eligible examples to eager style by @ColmaLiu in [Example] Migrate eligible examples to eager style #2010
[CuTeDSL] Integrate host codegen call sites by @JayceSu98 in [CuTeDSL] Integrate host codegen call sites #2221
[Metal] Add Metal GEMM support with simdgroup_matrix MMA by @oraluben in [Metal] Add Metal GEMM support with simdgroup_matrix MMA #1869
[tilelang] Fix CUPTI cache flush filtering by @LeiWang1999 in [tilelang] Fix CUPTI cache flush filtering #2244
[BugFix] Fix IntrinInfo repr by @zihaomu in [BugFix] Fix IntrinInfo repr #2175
[CI] Add PyPI release publishing by @LeiWang1999 in [CI] Add PyPI release publishing #2246
[ROCm] Expose HIP kernel n_regs / n_spills / n_max_threads on JITKernel by @benenzhu in [ROCm] Expose HIP kernel n_regs / n_spills / n_max_threads on JITKernel #2211
Use split TVM DLLs on Windows by @sepcnt in Use split TVM DLLs on Windows #2247
[AMD][CDNA4] Add MXFP4 (FP4 E2M1) support for gfx950 by @zhangnju in [AMD][CDNA4] Add MXFP4 (FP4 E2M1) support for gfx950 #2132
[ENV] Clean up compiler temp files by default by @LeiWang1999 in [ENV] Clean up compiler temp files by default #2254
[Release] Bump version to 0.1.10 by @LeiWang1999 in [Release] Bump version to 0.1.10 #2255
[Bugfix] Fix ROCm FP4 packed buffer map key by @LeiWang1999 in [Bugfix] Fix ROCm FP4 packed buffer map key #2256

New Contributors

@jiawei-real made their first contribution in [AMD][Radeon] Add the Support of RDNA3/RDNA3.5(gfx11) WMMA #2044
@foraxe made their first contribution in Add opt-out for prelower semantic checks for DeepSeek V4 Flash on ARM64 #2094
@xuyufei-a made their first contribution in [Example] Add HISA: hierarchical sparse attention indexer #2069
@Denverjin made their first contribution in [Fix] Unable to allocate shared memory buffer from tail #2106
@ur4t made their first contribution in [FIX] Fix kernel file suffix for cutedsl when only target is set #2128
@yurekami made their first contribution in [Typo] Fix typos in comments and example README #2133
@kasper0406 made their first contribution in fix: TMA alignment to 1024 bytes on Blackwell #2134
@yiakwy-xpu-ml-framework-team made their first contribution in [docs] fix TMEM description #2152
@Chennesxu made their first contribution in [BugFix] Fix T.gemm() on SM75 (Turing) GPUs (#1992) #2173
@zihaomu made their first contribution in Fix float4 storage dtype torch mapping #2174
@Wazrrr made their first contribution in [Autotune] Add pipeline, grouped compilation, and multi-GPU benchmark support #2159
@lhl made their first contribution in Add RDNA gfx1151 ROCm target support #2127
@cklxx made their first contribution in Fix SM70 buffer region indexing #2191
@ighoshsubho made their first contribution in [Example] Add layernorm example in tilelang #2168
@JayceSu98 made their first contribution in [CuTeDSL] Add PDL codegen and launcher support #2220
@Tokimorphling made their first contribution in [CUDA] Add native SM75 MMA GEMM support for FP16, INT8 and INT4 #2198
@He-Jingkai made their first contribution in [Feature] Add T.copy_cluster to support TMA multicast and SM-to-SM cluster copy #1908
@yyccli made their first contribution in [ROCm] Match CUDA path debug-info and temp-file plumbing #2230

Full Changelog: v0.1.9...v0.1.10

This discussion was created from the release v0.1.10.

Replies: 0 comments

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment