Task Validator Agent

What This Agent Does

The task_validator agent validates that tasks in AgentKernelArena are correctly configured, self-contained, and functional. It does not optimize kernels. Instead, it runs 10 automated checks on each task and produces a structured validation_report.yaml.

Use it to:

Audit existing tasks for quality and compliance before publishing to leaderboards.
Validate new tasks before merging them into the task suite.
Identify broken tasks (missing files, external dependencies, trivially-passing correctness checks, GPU hangs).

How to Use

1. Set `config.yaml`

agent:
  template: task_validator
tasks:
  - hip2hip/rmsnorm
  - cuda2hip/awq_gemm
  - triton2triton/triton_fused_moe
  # - all                     # validate every task
target_gpu_model: MI300
log_directory: logs
workspace_directory_prefix: workspace

2. Run

python3 main.py

3. Read Results

Each task workspace will contain a validation_report.yaml with per-check results. A validation_summary.yaml is written to the workspace root with aggregated statistics.

Agent Configuration

Edit agents/task_validator/agent_config.yaml:

backend: claude_code          # claude_code | codex | cursor
timeout_seconds: 600          # max time per task validation (set 0 to disable timeout)
python_path: /root/AgentKernelArena/.venv/bin/python3

Validation Checks

#	Check	What It Verifies
1	config_schema	All required fields exist in `config.yaml` with correct types
2	source_files_exist	Every file in `source_file_path` exists in the workspace
3	target_symbols_found	Every function in `target_kernel_functions` is defined in source files
4	compilation	`compile_command` runs successfully (exit code 0, within 120s timeout)
5	correctness	`correctness_command` runs successfully (exit code 0, within 180s timeout)
6	performance	`performance_command` runs successfully (if present, within 180s timeout)
7	correctness_implementation_review	The correctness check is meaningful (not trivially passing)
8	self_contained	No missing headers, imports, or references to external repos/paths
9	gpu_hang_check	No command hangs or times out
10	result_template_compatibility	Task output maps to the standard `task_result_template.yaml` schema

Overall Status

PASS — all checks passed
WARN — no failures, but at least one warning (e.g., questionable correctness implementation)
FAIL — at least one check failed

New Task Requirements

Every new task added to tasks/ must satisfy the following requirements to pass validation.

Required Directory Structure

tasks/<task_type>/<task_name>/
├── config.yaml                  # Task configuration (required)
├── scripts/
│   └── task_runner.py           # Validation runner (recommended pattern)
└── source/
    └── <kernel files>           # .cu, .hip, .py, etc.

Alternative structures (Makefile-based, test-file-based) are acceptable as long as all config references resolve.

Required `config.yaml` Fields

# List of source files containing kernel code (relative to task root)
source_file_path:
  - source/my_kernel.cu

# List of kernel function names that must be found in source files
target_kernel_functions:
  - my_kernel_function

# Command(s) to compile or build-check the task
compile_command:
  - python3 scripts/task_runner.py --mode compile

# Command(s) to run correctness validation
correctness_command:
  - python3 scripts/task_runner.py --mode correctness

# Task type: one of hip2hip, cuda2hip, triton2triton, torch2hip, instruction2triton, rocprim
task_type: cuda2hip

Optional `config.yaml` Fields

# Command(s) to run performance benchmarking
performance_command:
  - python3 scripts/task_runner.py --mode performance

# Override which result template to use (null = default)
task_result_template: null

# Prompt overrides for the optimization agent (null = auto-generated)
prompt:
  source_code: null
  instructions: null
  cheatsheet: null

Self-Containedness Rules

A task must be fully self-contained. This means:

No external repo dependencies. Do not reference paths like ../../vllm/, /opt/external/, or assume a cloned repo exists in the workspace. All source code the task needs must be inside the task directory.
No missing headers. Every #include "foo.h" in .cu/.hip files must resolve to a header that ships with the task (or is part of system/ROCm/CUDA includes).
No missing Python imports. Every import or from X import Y must resolve to either:
- Python standard library
- Packages available in the .venv environment (torch, numpy, triton, etc.)
- Local files within the task directory
No external data downloads. Test inputs must be generated inline (random tensors, synthetic data) or bundled as small files in the task directory.

Correctness Check Rules

The correctness check must be a real validation, not a trivial pass:

Compare against a reference. Use a CPU/NumPy reference implementation, known-good output tensors, or a PyTorch eager-mode baseline.
Use reasonable tolerances. For FP32: atol=1e-3, rtol=1e-3 typical. For FP16/BF16: atol=1e-2, rtol=1e-2 typical. For FP8/INT8: atol=1e-1 or custom per-task.
Test multiple shapes. Don't validate with a single input shape. Use at least 2-3 representative shapes covering small, medium, and large inputs.
Return non-zero exit code on failure. The correctness command must sys.exit(1) or raise an exception if validation fails.

Compilation Check Rules

The compile_command must actually compile or syntax-check the source code (not just search for text patterns).
Exit code 0 means success, non-zero means failure.
A build/compile_report.json with {"status": "ok"} or {"status": "fail", "error": "..."} is recommended.

Performance Check Rules (if applicable)

The performance_command should measure kernel execution time and report it in a parseable format.
Output should include baseline time and optimized time for speedup calculation.
A build/performance_report.json with timing data is recommended.
Recommended methodology: 10 warmup iterations + 100 measured iterations, and report the average measured runtime (speedup should be derived from averaged runtimes). The validator may mark performance as WARN if a task is functional but does not follow or clearly document this methodology.

Result Template Compatibility

The task's output flow (compile → correctness → performance) must produce results that can populate the standard task_result_template.yaml:

task_name: "<task_type>/<task_name>"
best_optimized_source_file_path:
  - <source files>
best_optimized_kernel_functions:
  - <kernel functions>
pass_compilation: true/false
compilation_error_message: null
pass_correctness: true/false
correctness_error_message: null
base_execution_time: 0.0          # in ms
best_optimized_execution_time: 0.0
speedup_ratio: 0.0
optimization_summary: ""

Checklist for New Task Authors

Before submitting a new task, verify:

Run the task_validator agent on your task to automatically verify all of the above.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Task Validator Agent

What This Agent Does

How to Use

1. Set `config.yaml`

2. Run

3. Read Results

Agent Configuration

Validation Checks

Overall Status

New Task Requirements

Required Directory Structure

Required `config.yaml` Fields

Optional `config.yaml` Fields

Self-Containedness Rules

Correctness Check Rules

Compilation Check Rules

Performance Check Rules (if applicable)

Result Template Compatibility

Checklist for New Task Authors

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Task Validator Agent

What This Agent Does

How to Use

1. Set config.yaml

2. Run

3. Read Results

Agent Configuration

Validation Checks

Overall Status

New Task Requirements

Required Directory Structure

Required config.yaml Fields

Optional config.yaml Fields

Self-Containedness Rules

Correctness Check Rules

Compilation Check Rules

Performance Check Rules (if applicable)

Result Template Compatibility

Checklist for New Task Authors

1. Set `config.yaml`

Required `config.yaml` Fields

Optional `config.yaml` Fields