[REQ] Optimize tile load/store for structured types and N-D tiles

## Problem

The tile load/store paths in `tile_shared_t::copy_from_global()` and `copy_to_global()` have performance gaps that affect structured types (vec, mat, structs) and N-D tiles:

1. **No coalescing for structured types**: The scalar fallback copies one `T` element per thread per loop iteration. For types like `mat66` (144 bytes), the compiler emits a tight loop of 36 scalar load/store pairs without unrolling, destroying memory-level parallelism. Casting to `float*` and copying at float granularity would give coalesced access and allow the compiler to unroll.

2. **Vectorized path limited to 2D cases**: The existing float4 vectorized path is limited to 2D tile loads, and may be generalized to ND.

3. **No `aligned` API parameter for loads/stores**: Users should be able to skip runtime checks for alignment, essentially guaranteeing that a tile may be loaded on the vectorized hot path.

4. **Expensive coordinate math in scalar fallback**: Every element requires `coord_from_linear()`, which uses integer division and modulo per dimension. For N-D tiles this overhead is significant relative to the actual memory transfer. In the 1D case and for full tiles (tile spans entire inner dimensions), coordinate math is unnecessary but still performed.

## Benchmarks (RTX 5090, mat66 shared tiles, load+store)

| Path | L2-cached (36 MB) | DRAM (144 MB) |
|------|-------------------|---------------|
| Vectorized (float4) | 2132 GB/s | 672 GB/s |
| Scalar fallback | 334 GB/s | 323 GB/s |
| **Speedup** | **6.4x** | **2.1x** |

## Proposed Changes

### 1. Automatic coalescing for all types

When `sizeof(T) % 4 == 0` (true for all vec/mat scalar types and most structs), reinterpret both global and shared memory as `float*` and copy at float granularity. Consecutive threads access consecutive floats (coalesced), and `#pragma unroll` can eliminate the struct-copy loop.

### 2. Vectorized N-D tile loads/stores for all types

Ensure the vectorized path is automatically selected for all types (not just scalars) whenever alignment and contiguity checks pass. No special handling is needed per type — `float4` is just a 128-bit transfer unit for copying bytes.

### 3. `aligned=True` for `tile_load` and `tile_store`

Add the `aligned` parameter to `tile_load` and to `tile_store`, allowing users to skip runtime alignment checks on stores when they know their data meets the requirements.

### 4. Optimized scalar fallback math

Replace `coord_from_linear()` (integer div/mod per dimension per element) with incremental coordinate advancement: compute the initial coordinate once from the thread index, then advance by the block stride with carry propagation. For 1D tiles and N-D full tiles (tile spans entire inner dimensions), skip coordinate math entirely and use flat linear indexing.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[REQ] Optimize tile load/store for structured types and N-D tiles #1236

Problem

Benchmarks (RTX 5090, mat66 shared tiles, load+store)

Proposed Changes

1. Automatic coalescing for all types

2. Vectorized N-D tile loads/stores for all types

3. `aligned=True` for `tile_load` and `tile_store`

4. Optimized scalar fallback math

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Path	L2-cached (36 MB)	DRAM (144 MB)
Vectorized (float4)	2132 GB/s	672 GB/s
Scalar fallback	334 GB/s	323 GB/s
Speedup	6.4x	2.1x

[REQ] Optimize tile load/store for structured types and N-D tiles #1236

Description

Problem

Benchmarks (RTX 5090, mat66 shared tiles, load+store)

Proposed Changes

1. Automatic coalescing for all types

2. Vectorized N-D tile loads/stores for all types

3. aligned=True for tile_load and tile_store

4. Optimized scalar fallback math

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

3. `aligned=True` for `tile_load` and `tile_store`