A production-ready, safety-first GEMM (General Matrix Multiply) implementation achieving 171.9 GFLOPS on Intel i9-14900 (single-core), with a 162× speedup over naive implementations. Designed for embedded systems, numerical computing, and performance-critical applications where reliability matters as much as speed.
Intel i9-14900 (Single Core)
Naive: 2036.722 ms (1.1 GFLOPS)
Optimized: 12.495 ms (171.9 GFLOPS)
Speedup: 163.00x faster
Benchmark Results (512×512×512)
Naive: 52.0 ms (5.2 GFLOPS)
Optimized: 1.6 ms (163.8 GFLOPS)
Speedup: 31.74×
Achieves consistent performance across irregular sizes and non-square matrices through adaptive blocking and aspect-ratio-aware optimization.
- No
alignason stack arrays — eliminates segfaults from misaligned stacks - No masked stores — replaced with safe scalar loops (2-5% overhead on edge cases)
- Always unaligned operations for temporary buffers — defensive, portable
- Debug assertions for critical invariants (commented out in release builds)
- Zero undefined behavior — validated against UBSan/ASan
- Comprehensive test suite — 5 test suites + performance benchmark
- Multi-tier architecture with specialized paths for small, medium, and large matrices
- SIMD-optimized packing (1.5-2× faster than scalar)
- K-loop unrolling with interleaved FMAs to break dependency chains
- Software pipelining to hide memory latency
- Pre-computed tile counts — no division in hot paths
- Adaptive blocking based on matrix aspect ratios (tall/wide/deep)
- Beta pre-scaling — eliminates redundant operations
- 6 test executables covering all code paths
- Cross-platform (Windows/Linux, MSVC/GCC/Clang)
- AVX2/FMA3 optimized with SSE2 fallback paths
- Static and dynamic memory modes
- AddressSanitizer and Valgrind integration
- Configurable validation levels (0/1/2)
# Clone repository
git clone https://github.com/yourusername/gemm.git
cd gemm
# Build in Release mode (recommended for performance)
cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build
# Run all tests
cd build && ctest --output-on-failure
# Run benchmark
./test_benchmark- CPU: AVX2 + FMA3 (Intel Haswell/AMD Excavator or newer)
- Compiler: GCC 7+, Clang 8+, or MSVC 2019+
- CMake: 3.16 or newer
- OS: Linux, Windows, macOS
# Build with validation (debug builds)
cmake -B build -DCMAKE_BUILD_TYPE=Debug -DGEMM_VALIDATION_LEVEL=2
# Build with AddressSanitizer
cmake -B build -DENABLE_ASAN=ON
# Build with Valgrind support (Linux)
cmake -B build -DENABLE_VALGRIND=ON
# Build without validation (release builds)
cmake -B build -DCMAKE_BUILD_TYPE=Release -DGEMM_VALIDATION_LEVEL=0| Level | Description | Use Case |
|---|---|---|
| 0 | No validation | Production/Release builds |
| 1 | Basic assertions | Development builds |
| 2 | Full instrumentation | Debug/Testing |
The project includes 6 test executables covering all functionality:
Tests Tier 1 fixed-size kernels (4×4, 6×6, 8×8, etc.)
./test_gemm_smallTests adaptive blocking and kernel selection
./test_gemm_planningUnit tests for all micro-kernels (1×8, 4×8, 8×8, 8×6, etc.)
./test_gemm_largeFull instrumentation with validation level 2
./test_gemm_validatedIntegration tests for gemm_large.c orchestration
./test_gemm_executeCompares optimized vs naive implementation
./test_benchmarkRuns all test suites in sequence
./test_allThe build system provides convenient targets:
# Run all tests via CTest
make run_tests
# Run individual test suites
make run_small # Small kernels (Tier 1)
make run_planning # Planning module
make run_kernels # Individual kernels
make run_validated # Validated tests
make run_execute # Execution pipeline
make run_unified # All tests in sequence
# Performance benchmark
make run_benchmark
# Valgrind integration (if enabled)
make valgrind_execute
make valgrind_validated# Run all tests
ctest --output-on-failure
# Run specific test
ctest -R SmallKernels
ctest -R Planning
ctest -R KernelTests
ctest -R ValidatedKernels
ctest -R ExecutePipeline
ctest -R AllTests
# Verbose output
ctest -V┌─────────────────────────────────────────────────────────┐
│ Tier 1: Small Fixed-Size Kernels │
│ • 4×4, 6×6, 8×8 (square, fixed K) │
│ • 8×4, 4×8, 8×6, 6×8 (rectangular, variable K) │
│ • K-outer loops with pre-scaled operands │
│ • Handles M,N ≤ 16, K ≤ 64, FLOPs ≤ 8192 │
└─────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────┐
│ Tier 2: Blocked Execution (NC→KC→MC) │
│ • Maximizes L2 cache reuse for B panels │
│ • SIMD-optimized packing with alpha pre-scaling │
│ • Pre-selected kernels for full tiles │
│ • Adaptive blocking (64-512 based on aspect ratio) │
└─────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────┐
│ Micro-Kernels (AVX2/FMA3) │
│ • 1×8, 4×8, 8×8, 8×6, 8×16, 16×8, 16×6 variants │
│ • K-unroll by 2 with interleaved computation │
│ • Register pressure: ≤16 YMM (carefully managed) │
│ • Separate ADD/STORE variants for beta handling │
└─────────────────────────────────────────────────────────┘
K-Loop Unrolling (2×)
// Before: Sequential dependency chain
for (k) { acc += A[k] * B[k]; }
// After: Interleaved computation (better ILP)
for (k+=2) {
acc = fma(A[k+0], B[k+0], acc);
acc = fma(A[k+1], B[k+1], acc); // Independent!
}Software Pipelining
// Load next iteration while computing current
a0 = load(A + k);
a1 = load(A + k+1); // Prefetch
b0 = load(B + k);
b1 = load(B + k+1); // Prefetch
// Compute with both loadedSIMD Packing for A
// Gather 8 rows in one operation
__m256 v = _mm256_set_ps(
src[7*K], src[6*K], ..., src[0*K]
);Pre-Scaled Alpha
// Traditional: 16K multiplies for 4×4
for (i,j,k) C[i,j] += alpha * A[i,k] * B[k,j];
// Optimized: 4K multiplies (alpha absorbed into B)
B_scaled = alpha * B;
for (i,j,k) C[i,j] += A[i,k] * B_scaled[k,j];#include "gemm.h"
// Allocate aligned matrices
float *A = gemm_aligned_alloc(64, M * K * sizeof(float));
float *B = gemm_aligned_alloc(64, K * N * sizeof(float));
float *C = gemm_aligned_alloc(64, M * N * sizeof(float));
// Compute: C = alpha*A*B + beta*C
gemm_auto(C, A, B, M, K, N, alpha, beta);
// Cleanup
gemm_aligned_free(A);
gemm_aligned_free(B);
gemm_aligned_free(C);// Create plan once
gemm_plan_t *plan = gemm_plan_create(M, K, N);
// Execute multiple times
for (int iter = 0; iter < 1000; iter++) {
gemm_execute_plan(plan, C, A, B, alpha, beta);
}
gemm_plan_destroy(plan);// Static workspace (faster, limited size)
gemm_static(C, A, B, M, K, N, alpha, beta);
// Dynamic allocation (handles any size)
gemm_dynamic(C, A, B, M, K, N, alpha, beta);
// Check static workspace limit
if (gemm_fits_static(M, K, N)) {
// Can use static mode
}Each micro-kernel is carefully designed to stay within AVX2's 16 YMM register limit:
| Kernel | Accumulators | Temps | Total | Status |
|---|---|---|---|---|
| 8×8 | 8 | 3 | 11 | ✅ Safe |
| 8×16 | 16 | 4 | 20 | |
| 16×8 | 16 | 4 | 20 |
Composite kernels automatically split into multiple calls to avoid register spilling.
The planner selects block sizes based on matrix shape:
// Tall matrices (M >> N): Small NC, large MC
if (M/N > 3.0) {
MC = 256; KC = 128; NC = 128;
}
// Wide matrices (N >> M): Small MC, large NC
else if (M/N < 0.33) {
MC = 64; KC = 128; NC = 512;
}
// Deep matrices (K >> M,N): Large KC
else if (K/N > 4.0) {
MC = 64; KC = 512; NC = 128;
}NC→KC→MC Loop Order maximizes L2 reuse:
- Pack B once per KC×NC tile
- Reuse packed B across all MC tiles
- Minimize memory traffic (pack overhead ~5%)
All kernels validated against:
- Naive reference implementation (correctness)
- AddressSanitizer (heap/stack overflow)
- UndefinedBehaviorSanitizer (UB detection)
- Valgrind (memory leaks)
# Debug build with full validation
cmake -B build-debug -DCMAKE_BUILD_TYPE=Debug \
-DGEMM_VALIDATION_LEVEL=2 -DENABLE_ASAN=ON
cmake --build build-debug
# Run validated tests with ASan
cd build-debug
./test_gemm_validated
# Release build for benchmarking
cmake -B build-release -DCMAKE_BUILD_TYPE=Release \
-DGEMM_VALIDATION_LEVEL=0
cmake --build build-release
# Run benchmark
cd build-release
./test_benchmark
# Memory leak check (Linux)
cmake -B build-valgrind -DENABLE_VALGRIND=ON
cmake --build build-valgrind
cd build-valgrind && make valgrind_executegemm/
├── src/gemm_2/ # Core library
│ ├── gemm.h # Main API
│ ├── gemm_kernels_avx2.h # Micro-kernels
│ ├── gemm_large.c # Tier 2 execution
│ ├── gemm_small.c # Tier 1 kernels
│ ├── gemm_planning.c # Adaptive planner
│ └── gemm_simd_ops.h # SIMD operations
├── tests/ # Test suite
│ ├── test_gemm_small.c # Tier 1 tests
│ ├── test_planning.c # Planner tests
│ ├── test_gemm_large.c # Kernel tests
│ ├── test_gemm_validated.c # Validated tests
│ ├── test_gemm_execute.c # Pipeline tests
│ ├── test_benchmark.c # Performance
│ └── CMakeLists.txt # Build system
└── README.md
-
pack_A_panel_simdbuffer overflow — Ifib > actual_mr, writes out of bounds. Workaround: Planner ensuresib ≤ MR. Fix pending in next release. -
No AVX-512 support — Currently limited to AVX2. AVX-512 kernels would achieve ~300 GFLOPS.
Contributions welcome! Please:
- Run the full test suite (
make run_tests) - Verify with AddressSanitizer (
-DENABLE_ASAN=ON) - Ensure validation level 2 passes
- Check performance regression with
test_benchmark
MIT License - See LICENSE file for details.
If you use this code in research, please cite:
@software{gemm_safety_hardened,
title = {Safety-Hardened GEMM Implementation},
author = {TUGBARS},
year = {2025},
note = {tuned for 14900K}
}TUGBARS - Embedded systems engineer specializing in numerical optimization and safety-critical code.
Benchmarked on Intel i9-14900, single-threaded. Performance may vary based on CPU architecture, memory bandwidth, and compiler optimizations.