ganeshnj
diff --git a/‎README.md‎
Lines changed: 3 additions & 2 deletions b/‎README.md‎
Lines changed: 3 additions & 2 deletions
diff --git a/‎bindings/python_bindings.cpp‎
Lines changed: 14 additions & 0 deletions b/‎bindings/python_bindings.cpp‎
Lines changed: 14 additions & 0 deletions
diff --git a/‎docs/15-shared-memory.md‎
Lines changed: 121 additions & 0 deletions b/‎docs/15-shared-memory.md‎
Lines changed: 121 additions & 0 deletions
diff --git a/‎examples/tiled_matvec_test.py‎
Lines changed: 155 additions & 0 deletions b/‎examples/tiled_matvec_test.py‎
Lines changed: 155 additions & 0 deletions
@@ -132,8 +132,9 @@ Expected: ~3x fewer kernel launches, ~3x speedup.
 
 #### Layer 4 — Tiled matmul with shared memory (C++ MLIR change)
 
-- [ ] Tiled matmul with shared memory + barriers — reuse each weight row across multiple threads via shared memory; impactful at n_embd >= 64
-- [ ] Requires adding `gpu.barrier` and shared memory ops to the MLIR pipeline
+- [x] Shared memory primitives — `tt.sync()`, `tt.shared_store(idx, val)`, `tt.shared_load(idx)` — [`examples/tiled_matvec_test.py`](examples/tiled_matvec_test.py), [`docs/15-shared-memory.md`](docs/15-shared-memory.md)
+- [x] Tiled 2-row-per-block matvec demo using shared memory for x vector reuse
+- [ ] Full tiled GEMM (requires `tt.for_range` loop support → `scf.for` in MLIR)
 
 #### Layer 5 — Flash Attention (algorithmic, longer sequences)
 
 
@@ -141,6 +141,20 @@ PYBIND11_MODULE(_tiny_ton_core, m) {
            [](tinyton::IRBuilder &self, PyValue cond, int64_t skip) {
              self.emitBranchZero(cond.val, skip);
            })
+      .def("emit_sync", [](tinyton::IRBuilder &self) { self.emitSync(); })
+      .def("emit_shared_store",
+           [](tinyton::IRBuilder &self, PyValue idx, PyValue val,
+              int64_t bufferSize) {
+             self.emitSharedStore(idx.val, val.val, bufferSize);
+           },
+           py::arg("idx"), py::arg("val"), py::arg("buffer_size"))
+      .def("emit_shared_load",
+           [](tinyton::IRBuilder &self, PyValue idx, int64_t bufferSize,
+              const std::string &dtype) {
+             auto et = tinyton::elementTypeFromString(dtype);
+             return PyValue{self.emitSharedLoad(idx.val, bufferSize, et)};
+           },
+           py::arg("idx"), py::arg("buffer_size"), py::arg("dtype") = "f32")
       .def("emit_ret", [](tinyton::IRBuilder &self) { self.emitRet(); })
       .def("dump_mlir",
            [](tinyton::IRBuilder &self) {
 
@@ -0,0 +1,121 @@
+# Shared Memory: tt.sync / tt.shared_store / tt.shared_load
+
+## The problem
+
+In the current `linear_kernel`, each block computes one output row of `y = W @ x`.
+Every block loads the full `x` vector from global memory independently:
+
+```
+Block 0: load x[0..N-1] from global → dot with W[0,:]
+Block 1: load x[0..N-1] from global → dot with W[1,:]
+Block 2: load x[0..N-1] from global → dot with W[2,:]
+...
+```
+
+The same `x` data is read `out_features` times from global memory. On real
+GPUs the L2 cache usually handles this for small vectors, but for larger data
+the redundant reads become a bottleneck.
+
+Shared memory solves this by loading `x` once and letting all threads in a
+block reuse it from fast on-chip storage.
+
+## New primitives
+
+```python
+tt.sync()                    # barrier — all threads in the block wait here
+tt.shared_store(idx, val)    # write val to shared memory at position idx
+val = tt.shared_load(idx)    # read from shared memory at position idx
+```
+
+Shared memory is **per-block**: each block has its own buffer, sized
+automatically to `BLOCK` (from `tt.arange(0, BLOCK)`).
+
+## Execution model
+
+```
+Thread 0: load x[0] from global → shared_store(0, x[0])
+Thread 1: load x[1] from global → shared_store(1, x[1])
+...
+Thread N-1: load x[N-1] from global → shared_store(N-1, x[N-1])
+
+          ╔═══════════╗
+          ║  tt.sync() ║  ← all threads wait here
+          ╚═══════════╝
+
+Thread 0: x_sh = shared_load(0..N-1)  → compute dot with W row
+Thread 1: x_sh = shared_load(0..N-1)  → compute dot with W row
+```
+
+## Example: tiled 2-row-per-block matvec
+
+Instead of 1 output row per block, each block computes 2 rows. The `x` vector
+is loaded into shared memory once and reused for both dot products:
+
+```python
+@tt.jit
+def tiled_linear_kernel(W_ptr, x_ptr, y_ptr, in_features, BLOCK: tt.constexpr):
+    pid  = tt.program_id(0)
+    tid  = tt.arange(0, BLOCK)
+    mask = tid < in_features
+
+    x_val = tt.load(x_ptr + tid, mask=mask)
+    tt.shared_store(tid, x_val)
+    tt.sync()
+    x_sh = tt.shared_load(tid)
+
+    w0   = tt.load(W_ptr + (pid * 2)     * in_features + tid, mask=mask)
+    w1   = tt.load(W_ptr + (pid * 2 + 1) * in_features + tid, mask=mask)
+    dot0 = tt.reduce_sum(w0 * x_sh)
+    dot1 = tt.reduce_sum(w1 * x_sh)
+    tt.store(y_ptr + pid * 2,     dot0)
+    tt.store(y_ptr + pid * 2 + 1, dot1)
+```
+
+Launch with `grid = (out_features // 2,)` — half the blocks, each doing 2x work.
+
+## MLIR lowering
+
+| Python | TinyTon IR | GPU dialect |
+|---|---|---|
+| `tt.sync()` | `tinyton.sync` | `gpu.barrier` |
+| `tt.shared_store(idx, val)` | `tinyton.shared_store %idx, %val size 64` | `memref.store` to workgroup memref |
+| `tt.shared_load(idx)` | `tinyton.shared_load %idx size 64` | `memref.load` from workgroup memref |
+
+The `size` attribute is the buffer size, baked in at compile time from
+`block_size` (captured via `tt.arange`). The GPU lowering allocates a
+`memref<size x f32, #gpu.address_space<workgroup>>` as a second workgroup
+attribution (separate from the 32-element buffer used by `reduce_sum`/
+`reduce_max`).
+
+## Simulator
+
+The simulator maintains a 256-element `sharedMem` vector per block. Instructions
+are distinguished from regular `LDR`/`STR` by flag bits in the encoding:
+
+- `SHMEM_STR`: opcode 0x8 with rd=1 (global STR has rd=0)
+- `SHMEM_LDR`: opcode 0x7 with rt=1 (global LDR has rt=0)
+- `SYNC`: opcode 0xF with imm=1 (RET has imm=0); pauses all threads and
+  resumes them in the next phase, matching GPU barrier semantics.
+
+## Files changed
+
+| File | Change |
+|---|---|
+| `include/tiny-ton/Dialect/TinyTon/TinyTonOps.td` | `SyncOp`, `SharedStoreOp`, `SharedLoadOp` |
+| `include/tiny-ton/IR/Builder.h` | `emitSync`, `emitSharedStore`, `emitSharedLoad` |
+| `lib/IR/Builder.cpp` | Implementation |
+| `lib/Conversion/TinyTonToGPU.cpp` | Pre-scan for buffer size, second workgroup memref, lowering |
+| `lib/Compiler/CodeGen.cpp` | Simulator instruction encoding |
+| `lib/Runtime/Simulator.cpp` | `sharedMem` buffer, `StepResult::Sync`, flag-based dispatch |
+| `bindings/python_bindings.cpp` | Python bindings |
+| `python/tiny_ton/jit.py` | `_BUILTINS`, `_eval_call` handlers |
+| `python/tiny_ton/__init__.py` | Stubs |
+| `examples/tiled_matvec_test.py` | Round-trip + tiled matvec tests |
+| `docs/15-shared-memory.md` | This design doc |
+
+## What this does NOT include
+
+Full tiled GEMM requires iterating over K tiles inside the kernel via
+`tt.for_range`, which generates `scf.for` in MLIR. That is a separate
+future addition. This plan provides all the shared memory building blocks
+that tiled GEMM needs.
@@ -0,0 +1,155 @@
+"""Test shared memory ops: tt.sync, tt.shared_store, tt.shared_load.
+
+Verifies:
+  - Basic shared memory round-trip: store then load
+  - Tiled 2-row-per-block matvec using shared memory for x vector reuse
+  - Correctness against NumPy for various matrix sizes
+"""
+
+import sys
+import os
+sys.path.insert(0, os.path.join(os.path.dirname(__file__), '..', 'python'))
+
+import numpy as np
+import tiny_ton as tt
+
+
+def compare(name, got, expected, atol=1e-4):
+    ok = np.allclose(got, expected, atol=atol)
+    print(f'  {name}: {"PASS" if ok else "FAIL"}')
+    if not ok:
+        print(f'    got:      {got}')
+        print(f'    expected: {expected}')
+    return ok
+
+
+# ---------------------------------------------------------------------------
+# Kernel 1: shared memory round-trip (store → sync → load)
+# ---------------------------------------------------------------------------
+
+@tt.jit
+def shmem_roundtrip(src, dst, N, BLOCK: tt.constexpr):
+    tid  = tt.arange(0, BLOCK)
+    mask = tid < N
+    val  = tt.load(src + tid, mask=mask)
+    tt.shared_store(tid, val)
+    tt.sync()
+    out  = tt.shared_load(tid)
+    tt.store(dst + tid, out, mask=mask)
+
+
+# ---------------------------------------------------------------------------
+# Kernel 2: single-row linear (baseline, no shared memory)
+# ---------------------------------------------------------------------------
+
+@tt.jit
+def linear_kernel(W_ptr, x_ptr, y_ptr, in_features, BLOCK: tt.constexpr):
+    pid = tt.program_id(0)
+    tid = tt.arange(0, BLOCK)
+    mask = tid < in_features
+    w = tt.load(W_ptr + pid * in_features + tid, mask=mask)
+    x = tt.load(x_ptr + tid, mask=mask)
+    dot = tt.reduce_sum(w * x)
+    tt.store(y_ptr + pid, dot)
+
+
+# ---------------------------------------------------------------------------
+# Kernel 3: tiled 2-row-per-block linear (shared memory for x reuse)
+# ---------------------------------------------------------------------------
+
+@tt.jit
+def tiled_linear_kernel(W_ptr, x_ptr, y_ptr, in_features, BLOCK: tt.constexpr):
+    pid  = tt.program_id(0)
+    tid  = tt.arange(0, BLOCK)
+    mask = tid < in_features
+
+    x_val = tt.load(x_ptr + tid, mask=mask)
+    tt.shared_store(tid, x_val)
+    tt.sync()
+    x_sh = tt.shared_load(tid)
+
+    w0   = tt.load(W_ptr + (pid * 2)     * in_features + tid, mask=mask)
+    w1   = tt.load(W_ptr + (pid * 2 + 1) * in_features + tid, mask=mask)
+    dot0 = tt.reduce_sum(w0 * x_sh)
+    dot1 = tt.reduce_sum(w1 * x_sh)
+    tt.store(y_ptr + pid * 2,     dot0)
+    tt.store(y_ptr + pid * 2 + 1, dot1)
+
+
+# ---------------------------------------------------------------------------
+# Tests
+# ---------------------------------------------------------------------------
+
+def test_shmem_roundtrip():
+    print('--- shared memory round-trip ---')
+    all_ok = True
+    for N in [4, 16, 27]:
+        x = np.random.randn(N).astype(np.float32)
+        out = np.zeros(N, dtype=np.float32)
+        shmem_roundtrip[(1,)](x.copy(), out, N, N)
+        ok = compare(f'N={N}', out, x)
+        all_ok = all_ok and ok
+    return all_ok
+
+
+def test_single_row_linear():
+    print('--- single-row linear (baseline) ---')
+    all_ok = True
+    for out_features, in_features in [(4, 4), (8, 16), (6, 27)]:
+        W = np.random.randn(out_features, in_features).astype(np.float32)
+        x = np.random.randn(in_features).astype(np.float32)
+        expected = W @ x
+        y = np.zeros(out_features, dtype=np.float32)
+        BLOCK = max(in_features, 4)
+        linear_kernel[(out_features,)](
+            W.flatten().copy(), x.copy(), y, in_features, BLOCK)
+        ok = compare(f'{out_features}x{in_features}', y, expected)
+        all_ok = all_ok and ok
+    return all_ok
+
+
+def test_tiled_linear():
+    print('--- tiled 2-row-per-block linear (shared memory) ---')
+    all_ok = True
+    for out_features, in_features in [(4, 4), (8, 16), (6, 27)]:
+        W = np.random.randn(out_features, in_features).astype(np.float32)
+        x = np.random.randn(in_features).astype(np.float32)
+        expected = W @ x
+        y = np.zeros(out_features, dtype=np.float32)
+        BLOCK = max(in_features, 4)
+        n_blocks = out_features // 2
+        tiled_linear_kernel[(n_blocks,)](
+            W.flatten().copy(), x.copy(), y, in_features, BLOCK)
+        ok = compare(f'{out_features}x{in_features} tiled', y, expected)
+        all_ok = all_ok and ok
+    return all_ok
+
+
+def test_tiled_vs_baseline():
+    print('--- tiled vs baseline match ---')
+    W = np.random.randn(8, 16).astype(np.float32)
+    x = np.random.randn(16).astype(np.float32)
+
+    y_baseline = np.zeros(8, dtype=np.float32)
+    linear_kernel[(8,)](W.flatten().copy(), x.copy(), y_baseline, 16, 16)
+
+    y_tiled = np.zeros(8, dtype=np.float32)
+    tiled_linear_kernel[(4,)](W.flatten().copy(), x.copy(), y_tiled, 16, 16)
+
+    return compare('8x16 baseline==tiled', y_tiled, y_baseline)
+
+
+if __name__ == '__main__':
+    np.random.seed(42)
+    results = [
+        test_shmem_roundtrip(),
+        test_single_row_linear(),
+        test_tiled_linear(),
+        test_tiled_vs_baseline(),
+    ]
+    print()
+    if all(results):
+        print('All tests PASSED')
+    else:
+        print('SOME TESTS FAILED')
+        sys.exit(1)