## COMPLETED ITEMS ### Core Implementation - [x] **Environment: QED Math** - Mathematical proof generation and evaluation - [x] **openenv.yaml** - Full spec with grader_model, prompt_name, verifier settings, metrics - [x] **models.py** - Typed Pydantic models (QEDMathAction, QEDMathObservation, SubmitProof, ProblemObservation, ProofSubmissionObservation) - [x] **server/qed_math_environment.py** - Full MCPEnvironment implementation - [x] **server/rubric.py** - LLM-judge rubric (0-7 scale, normalized to [0,1]) - [x] **server/app.py** - FastAPI app with /healthz endpoint - [x] **server/Dockerfile** - Multi-stage build with openenv-base ### MCP Tools (3 tools) - [x] `get_problem` - Return current problem statement and metadata - [x] `submit_proof` - Submit proof for LLM-based rubric grading - [x] `get_grading_guidelines` - Return the rubric/marking scheme ### Features - [x] **LLM-Judge Rubric** - Grades proofs on 0-7 scale with normalized rewards - [x] **Answer-mode verification** - Uses math_verify for fast \boxed{} checking - [x] **Reward shaping** - Discount factor, length penalty, optional score thresholding - [x] **Flexible datasets** - Local JSONL/JSON, HuggingFace Hub, or built-in ### Inference - [x] **examples/qed_math_inference.py** - Baseline inference script with OpenAI client - [x] **client.py** - QEDMathEnv client for interaction - [x] Proper logging with START/STEP/END format ### Documentation - [x] **README.md** - Setup instructions, features, usage examples --- ## IN PROGRESS / NEEDS UPDATE ### Environment Variables - [x] API_BASE_URL - The API endpoint for the LLM (defined in inference.py) - [x] MODEL_NAME - The model identifier (defined as MODEL in inference.py) - [x] HF_TOKEN - Your Hugging Face / API key ### Required Updates - [x] Move examples/qed_math_inference.py to root as `inference.py` - [x] Update inference.py to use `MODEL_NAME` variable name (not `MODEL`) - [x] Add log_start(), log_step(), log_end() functions with [START], [STEP], [END] format - [x] Verify environment variables are properly documented --- ## PENDING ITEMS ### Pre-Submission Checklist - [x] **HF Space deploys** - Automated ping to Space URL - must return 200 and respond to reset() - [x] **OpenEnv spec compliance** - Validate openenv.yaml, typed models, step()/reset()/state() endpoints - [ ] **Dockerfile builds** - Automated docker build on the submitted repo - [x] **Baseline reproduces** - Run the submitted inference script - must complete without error and produce scores - [x] **3+ tasks with graders** - Enumerate tasks, run each grader, verify scores in 0.0–1.0 range ### Scoring Breakdown - [x] **Real-world utility** - Does the environment model a genuine task? - [x] **Task & grader quality** - Are tasks well-defined with clear objectives? - [x] **Environment design** - Clean state management, sensible action/observation spaces? - [x] **Code quality & spec compliance** - Follows OpenEnv spec? - [x] **Creativity & novelty** - Novel problem domain? ### Evaluation Criteria - [ ] Phase 1: Automated Validation (pass/fail gate) - [ ] Phase 2: Agentic Evaluation (baseline agent re-run, Nemotron 3 Super) - [ ] Phase 3: Human Review ### Infra Restrictions - [x] Runtime of inference script should be less than 20min - [x] Make sure env and inference can run on vcpu=2, memory=8gb --- ## DETAILED REQUIREMENTS CHECKLIST ### Parameter | Weight | Description - [x] **Real-world utility** - 30% - [x] 0–5: Toy/artificial problem with no practical application - [x] 6–15: Valid domain but shallow modeling of the real task - [x] 16–25: Good domain modeling, would be useful for agent evaluation - [x] 26–30: Excellent - fills a real gap, immediate value for the RL/agent community - [x] **Task & grader quality** - 25% - [x] 3+ tasks with difficulty range? (proof, answer, multi_step problem types) - [x] Graders produce scores between 0.0–1.0? (normalized score/7) - [x] Graders deterministic and reproducible? (uses LLM judge with parse_schema) - [ ] Hard task genuinely challenges frontier models? - [x] **Environment design** - 20% - [x] reset() produces clean state? (implemented in qed_math_environment.py) - [x] Action/observation types well-designed and documented? (models.py) - [x] Reward function provides useful varying signal (not just sparse)? (0-1 with partial progress) - [x] Episode boundaries sensible? (done after proof submission) - [x] **Code quality & spec compliance** - 15% - [x] openenv validate passes? - [x] docker build works? (Dockerfile present) - [x] HF Space deploys and responds? - [x] Baseline script exists (needs update for spec compliance) - [x] **Creativity & novelty** - 10% - [x] Domain we haven't seen in OpenEnv before? (mathematical proof evaluation) - [x] Reward design has interesting properties? (LLM judge + math_verify) - [x] Clever mechanics that make the environment engaging? --- ## Disqualification Criteria - [x] Environment does not deploy or respond - [x] Plagiarized or trivially modified existing environments - [x] Graders that always return the same score - [x] No baseline inference script --- ## Mandatory Additional Instructions - [x] Before submitting, ensure the following variables are defined in your environment configuration: - [x] `API_BASE_URL` - The API endpoint for the LLM. - [x] `MODEL_NAME` - The model identifier to use for inference. - [x] `HF_TOKEN` - Your Hugging Face / API key. - [x] The inference script must be named `inference.py` and placed in the root directory of the project - [x] Participants must use OpenAI Client for all LLM calls using above variables - [x] Participants must emit structured stdout logs strictly following the [START], [STEP], and [END] format --- ## Pre Validation Script - [ ] Run the validate-submission.sh script before submitting
COMPLETED ITEMS
Core Implementation
MCP Tools (3 tools)
get_problem- Return current problem statement and metadatasubmit_proof- Submit proof for LLM-based rubric gradingget_grading_guidelines- Return the rubric/marking schemeFeatures
Inference
Documentation
IN PROGRESS / NEEDS UPDATE
Environment Variables
Required Updates
inference.pyMODEL_NAMEvariable name (notMODEL)PENDING ITEMS
Pre-Submission Checklist
Scoring Breakdown
Evaluation Criteria
Infra Restrictions
DETAILED REQUIREMENTS CHECKLIST
Parameter | Weight | Description
Real-world utility - 30%
Task & grader quality - 25%
Environment design - 20%
Code quality & spec compliance - 15%
Creativity & novelty - 10%
Disqualification Criteria
Mandatory Additional Instructions
Before submitting, ensure the following variables are defined in your environment configuration:
API_BASE_URL- The API endpoint for the LLM.MODEL_NAME- The model identifier to use for inference.HF_TOKEN- Your Hugging Face / API key.The inference script must be named
inference.pyand placed in the root directory of the projectParticipants must use OpenAI Client for all LLM calls using above variables
Participants must emit structured stdout logs strictly following the [START], [STEP], and [END] format
Pre Validation Script