Skip to content

[REQ] Optimize tile load/store for structured types and N-D tiles #1236

@daedalus5

Description

@daedalus5

Problem

The tile load/store paths in tile_shared_t::copy_from_global() and copy_to_global() have performance gaps that affect structured types (vec, mat, structs) and N-D tiles:

  1. No coalescing for structured types: The scalar fallback copies one T element per thread per loop iteration. For types like mat66 (144 bytes), the compiler emits a tight loop of 36 scalar load/store pairs without unrolling, destroying memory-level parallelism. Casting to float* and copying at float granularity would give coalesced access and allow the compiler to unroll.

  2. Vectorized path limited to 2D cases: The existing float4 vectorized path is limited to 2D tile loads, and may be generalized to ND.

  3. No aligned API parameter for loads/stores: Users should be able to skip runtime checks for alignment, essentially guaranteeing that a tile may be loaded on the vectorized hot path.

  4. Expensive coordinate math in scalar fallback: Every element requires coord_from_linear(), which uses integer division and modulo per dimension. For N-D tiles this overhead is significant relative to the actual memory transfer. In the 1D case and for full tiles (tile spans entire inner dimensions), coordinate math is unnecessary but still performed.

Benchmarks (RTX 5090, mat66 shared tiles, load+store)

Path L2-cached (36 MB) DRAM (144 MB)
Vectorized (float4) 2132 GB/s 672 GB/s
Scalar fallback 334 GB/s 323 GB/s
Speedup 6.4x 2.1x

Proposed Changes

1. Automatic coalescing for all types

When sizeof(T) % 4 == 0 (true for all vec/mat scalar types and most structs), reinterpret both global and shared memory as float* and copy at float granularity. Consecutive threads access consecutive floats (coalesced), and #pragma unroll can eliminate the struct-copy loop.

2. Vectorized N-D tile loads/stores for all types

Ensure the vectorized path is automatically selected for all types (not just scalars) whenever alignment and contiguity checks pass. No special handling is needed per type — float4 is just a 128-bit transfer unit for copying bytes.

3. aligned=True for tile_load and tile_store

Add the aligned parameter to tile_load and to tile_store, allowing users to skip runtime alignment checks on stores when they know their data meets the requirements.

4. Optimized scalar fallback math

Replace coord_from_linear() (integer div/mod per dimension per element) with incremental coordinate advancement: compute the initial coordinate once from the thread index, then advance by the block stride with carry propagation. For 1D tiles and N-D full tiles (tile spans entire inner dimensions), skip coordinate math entirely and use flat linear indexing.

Metadata

Metadata

Assignees

Labels

feature requestRequest for something to be added

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions