FlashInfer AI Kernel Generation Contest @ MLSys 2026

Create high-performance GPU kernels for state-of-the-art LLM architectures on NVIDIA Blackwell GPUs with humans and/or AI agents.

FlashInfer-Bench is our official framework to evaluate your AI-generated kernels.

Updates

2026.02.05: Full dataset for definitions and workloads are released at HuggingFace

Competition Tracks

The competition features three tracks, each targeting a critical LLM operation:

Track	Description
fused_moe	Fused Mixture-of-Experts kernel for efficient expert routing and computation
sparse_attention	Sparse attention mechanisms for long-context inference
gated_delta_net	Gated delta network operations for efficient state updates

Fork this template once per track you want to compete in (separate repos for each track).

Getting Started

1. Clone this Repository

git clone https://github.com/opensource4you/flashinfer-contest-playground.git
cd flashinfer-contest-playground

2. Install Dependencies

We use uv to manage environmnet.

Since the latest version of flashinfer-bench isn't not uploaded to pypi, we have to build it from source:

Clone flashinfer-bench from GitHub in this directory:

git clone https://github.com/flashinfer-ai/flashinfer-bench.git

Build a virtual environment with uv, and install dependencies:

uv venv
uv sync

You can check whether flashinfer-bench and modal are installed with the uv pip show:

uv pip show flashinfer-bench  # If you build from source, it should be >=0.1.2
uv pip show modal

3. Download the TraceSet

We provide kernel definitions and workloads in FlashInfer-Trace format. Clone the competition dataset from HuggingFace:

git lfs install
git clone https://huggingface.co/datasets/flashinfer-ai/mlsys26-contest

Set the environment variable:

export FIB_DATASET_PATH=/path/to/flashinfer-trace

Default path if you install TraceSet in this directory:

export FIB_DATASET_PATH=$(pwd)/mlsys26-contest

You should be able to see the environment variable FIB_DATASET_PATH being updated:

echo $FIB_DATASET_PATH

4. Configure Your Solution

Edit config.toml to set your track and team info:

[solution]
name = "my-team-solution-v1"      # Solution name
# Track: moe | dsa_sparse_attention | dsa_topk_indexer | gdn_decode | gdn_prefill
definition = "moe"          
author = "team-name"              # Team/author name

[build]
language = "triton"               # triton | cuda
entry_point = "kernel.py::my_kernel"            # Path::Function

5. Implement Your Kernel

For Triton: Edit solution/triton/kernel.py with your implementation.

For CUDA: Edit solution/cuda/kernel.cu and solution/cuda/binding.py with your implementation.

Development Workflow

Pack Your Solution

Generate solution.json from your source files:

uv run scripts/pack_solution.py

Run Local Benchmarks

Test your solution on your local GPU:

uv run scripts/run_local.py

Requires: Local CUDA-capable GPU and FIB_DATASET_PATH environment variable.

Run Cloud Benchmarks (Modal)

Test your solution on NVIDIA B200 GPUs via Modal:

One-time setup:

uv run modal setup
uv run modal volume create flashinfer-trace
uv run modal volume put flashinfer-trace /path/to/flashinfer-trace

Run benchmark:

uv run modal run scripts/run_modal.py

Submission

To submit your solution for evaluation:

Ensure your implementation is complete and tested
Run python scripts/pack_solution.py to generate solution.json
Commit and push your changes
Tag your commit for evaluation (e.g., git tag submission-v1)

Project Structure

flashinfer-bench-starter-kit/
├── README.md                    # This file
├── config.toml                  # Track configuration (edit this)
├── solution/                    # Solution source files
│   ├── triton/                  # Triton implementation
│   │   └── kernel.py           # Your Triton kernel
│   └── cuda/                    # CUDA implementation
│       ├── kernel.cu           # Your CUDA kernel
│       └── binding.py          # TVM FFI bindings
├── scripts/                     # Utility scripts
│   ├── run_local.py            # Local benchmark runner
│   ├── run_modal.py            # Modal cloud benchmark runner
│   └── pack_solution.py        # Pack source files into solution.json
└── images/                      # Sponsor logos

Additional Resources

FlashInfer Trace Viewer

FlashInfer Trace consists of multiple JSON objects (definitions, workloads, solutions, and traces), which can contain large code blocks. To easily visualize and inspect these objects, you can use the FlashInfer Trace Viewer. Simply paste any FlashInfer Trace JSON into the viewer to get a friendly, structured view of its contents.

Solution Handling API

from flashinfer_bench import BuildSpec
from flashinfer_bench.agents import pack_solution_from_files, extract_solution_to_files

# Pack source files into a Solution object
spec = BuildSpec(
    language="triton",  # or "cuda"
    target_hardware=["cuda"],
    entry_point="my_kernel",
)
solution = pack_solution_from_files(
    path="./my_solution_dir",
    spec=spec,
    name="my_solution_v1",
    definition="fused_moe",
    author="your_name",
)

# Extract a Solution to files in a working directory
extract_solution_to_files(solution, "./output_dir")

Running Sanitizers

from flashinfer_bench.agents import flashinfer_bench_run_sanitizer

output = flashinfer_bench_run_sanitizer(
    solution=solution,
    workload=workload,
    sanitizer_types=["memcheck", "racecheck", "synccheck", "initcheck"],
    timeout=300,
)
print(output)

NCU Profiling

from flashinfer_bench.agents import flashinfer_bench_run_ncu

output = flashinfer_bench_run_ncu(
    solution=solution,
    workload=workload,
    set="detailed",
    page="details",
    timeout=120,
)
print(output)

List Available Tools

from flashinfer_bench.agents import get_all_tool_schemas

schemas = get_all_tool_schemas()
# Returns list of OpenAI-compatible function schemas

Notes

Destination Passing Style (DPS)

FlashInfer-Bench uses destination passing style (DPS) by default, where both inputs and outputs are passed as function parameters. DPS avoids measuring tensor allocation overhead, resulting in more accurate performance numbers. We recommend using DPS when possible, as it yields better benchmark results.

Important: Avoid using variadic input arguments in your kernel signatures, as they will fail the builder validation check.

If your kernel uses value-returning style (i.e., returns output tensors instead of writing to pre-allocated ones), set destination_passing_style to false in your solution's spec:

{
  "name": "my_solution",
  "definition": "gdn_decode_qk4_v8_d128_k_last",
  "author": "my_name",
  "spec": {
    "language": "triton",
    "target_hardware": ["cuda"],
    "entry_point": "kernel.py::my_kernel",
    "dependencies": [],
    "destination_passing_style": false
  },
  "sources": [...]
}

Common error when DPS is mismatched:

Destination-passing style callable: expected xx parameters, but got xx

This can happen for two reasons: (1) your kernel function signature has the wrong number of parameters, or (2) your kernel uses value-returning style but the solution still has destination_passing_style set to true by default. For the latter case, fix by setting destination_passing_style to false.

CUDA Kernel Bindings

For CUDA kernel implementations, we recommend using TVM FFI for Python bindings. The flashinfer_bench.agents module provides TVM FFI agent instruction prompts to assist with development.

You can set the binding field in your solution's spec to specify the C++ binding type. Defaults to "tvm-ffi" if not specified. Supported values: "tvm-ffi", "torch".

{
  "name": "my_cuda_solution",
  "definition": "gdn_decode_qk4_v8_d128_k_last",
  "author": "my_name",
  "spec": {
    "language": "cuda",
    "target_hardware": ["cuda"],
    "entry_point": "kernel.cu::my_kernel",
    "dependencies": [],
    "binding": "torch"
  },
  "sources": [...]
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FlashInfer AI Kernel Generation Contest @ MLSys 2026

Updates

Competition Tracks

Getting Started

1. Clone this Repository

2. Install Dependencies

3. Download the TraceSet

4. Configure Your Solution

5. Implement Your Kernel

Development Workflow

Pack Your Solution

Run Local Benchmarks

Run Cloud Benchmarks (Modal)

Submission

Project Structure

Additional Resources

FlashInfer Trace Viewer

Solution Handling API

Running Sanitizers

NCU Profiling

List Available Tools

Notes

Destination Passing Style (DPS)

CUDA Kernel Bindings

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
flashinfer-bench		flashinfer-bench
images		images
mlsys26-contest		mlsys26-contest
scripts		scripts
solution		solution
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
config.toml		config.toml
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

FlashInfer AI Kernel Generation Contest @ MLSys 2026

Updates

Competition Tracks

Getting Started

1. Clone this Repository

2. Install Dependencies

3. Download the TraceSet

4. Configure Your Solution

5. Implement Your Kernel

Development Workflow

Pack Your Solution

Run Local Benchmarks

Run Cloud Benchmarks (Modal)

Submission

Project Structure

Additional Resources

FlashInfer Trace Viewer

Solution Handling API

Running Sanitizers

NCU Profiling

List Available Tools

Notes

Destination Passing Style (DPS)

CUDA Kernel Bindings

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages