-
Notifications
You must be signed in to change notification settings - Fork 473
Description
Problem
The tile load/store paths in tile_shared_t::copy_from_global() and copy_to_global() have performance gaps that affect structured types (vec, mat, structs) and N-D tiles:
-
No coalescing for structured types: The scalar fallback copies one
Telement per thread per loop iteration. For types likemat66(144 bytes), the compiler emits a tight loop of 36 scalar load/store pairs without unrolling, destroying memory-level parallelism. Casting tofloat*and copying at float granularity would give coalesced access and allow the compiler to unroll. -
Vectorized path limited to 2D cases: The existing float4 vectorized path is limited to 2D tile loads, and may be generalized to ND.
-
No
alignedAPI parameter for loads/stores: Users should be able to skip runtime checks for alignment, essentially guaranteeing that a tile may be loaded on the vectorized hot path. -
Expensive coordinate math in scalar fallback: Every element requires
coord_from_linear(), which uses integer division and modulo per dimension. For N-D tiles this overhead is significant relative to the actual memory transfer. In the 1D case and for full tiles (tile spans entire inner dimensions), coordinate math is unnecessary but still performed.
Benchmarks (RTX 5090, mat66 shared tiles, load+store)
| Path | L2-cached (36 MB) | DRAM (144 MB) |
|---|---|---|
| Vectorized (float4) | 2132 GB/s | 672 GB/s |
| Scalar fallback | 334 GB/s | 323 GB/s |
| Speedup | 6.4x | 2.1x |
Proposed Changes
1. Automatic coalescing for all types
When sizeof(T) % 4 == 0 (true for all vec/mat scalar types and most structs), reinterpret both global and shared memory as float* and copy at float granularity. Consecutive threads access consecutive floats (coalesced), and #pragma unroll can eliminate the struct-copy loop.
2. Vectorized N-D tile loads/stores for all types
Ensure the vectorized path is automatically selected for all types (not just scalars) whenever alignment and contiguity checks pass. No special handling is needed per type — float4 is just a 128-bit transfer unit for copying bytes.
3. aligned=True for tile_load and tile_store
Add the aligned parameter to tile_load and to tile_store, allowing users to skip runtime alignment checks on stores when they know their data meets the requirements.
4. Optimized scalar fallback math
Replace coord_from_linear() (integer div/mod per dimension per element) with incremental coordinate advancement: compute the initial coordinate once from the thread index, then advance by the block stride with carry propagation. For 1D tiles and N-D full tiles (tile spans entire inner dimensions), skip coordinate math entirely and use flat linear indexing.