OpenEnv Environment Submission - Issue Tracking

## COMPLETED ITEMS

### Core Implementation
- [x] **Environment: QED Math** - Mathematical proof generation and evaluation
- [x] **openenv.yaml** - Full spec with grader_model, prompt_name, verifier settings, metrics
- [x] **models.py** - Typed Pydantic models (QEDMathAction, QEDMathObservation, SubmitProof, ProblemObservation, ProofSubmissionObservation)
- [x] **server/qed_math_environment.py** - Full MCPEnvironment implementation
- [x] **server/rubric.py** - LLM-judge rubric (0-7 scale, normalized to [0,1])
- [x] **server/app.py** - FastAPI app with /healthz endpoint
- [x] **server/Dockerfile** - Multi-stage build with openenv-base

### MCP Tools (3 tools)
- [x] `get_problem` - Return current problem statement and metadata
- [x] `submit_proof` - Submit proof for LLM-based rubric grading
- [x] `get_grading_guidelines` - Return the rubric/marking scheme

### Features
- [x] **LLM-Judge Rubric** - Grades proofs on 0-7 scale with normalized rewards
- [x] **Answer-mode verification** - Uses math_verify for fast \boxed{} checking
- [x] **Reward shaping** - Discount factor, length penalty, optional score thresholding
- [x] **Flexible datasets** - Local JSONL/JSON, HuggingFace Hub, or built-in

### Inference
- [x] **examples/qed_math_inference.py** - Baseline inference script with OpenAI client
- [x] **client.py** - QEDMathEnv client for interaction
- [x] Proper logging with START/STEP/END format

### Documentation
- [x] **README.md** - Setup instructions, features, usage examples

---

## IN PROGRESS / NEEDS UPDATE

### Environment Variables
- [x] API_BASE_URL - The API endpoint for the LLM (defined in inference.py)
- [x] MODEL_NAME - The model identifier (defined as MODEL in inference.py)
- [x] HF_TOKEN - Your Hugging Face / API key

### Required Updates
- [x] Move examples/qed_math_inference.py to root as `inference.py`
- [x] Update inference.py to use `MODEL_NAME` variable name (not `MODEL`)
- [x] Add log_start(), log_step(), log_end() functions with [START], [STEP], [END] format
- [x] Verify environment variables are properly documented

---

## PENDING ITEMS

### Pre-Submission Checklist
- [x] **HF Space deploys** - Automated ping to Space URL - must return 200 and respond to reset()
- [x] **OpenEnv spec compliance** - Validate openenv.yaml, typed models, step()/reset()/state() endpoints
- [ ] **Dockerfile builds** - Automated docker build on the submitted repo
- [x] **Baseline reproduces** - Run the submitted inference script - must complete without error and produce scores
- [x] **3+ tasks with graders** - Enumerate tasks, run each grader, verify scores in 0.0–1.0 range

### Scoring Breakdown
- [x] **Real-world utility** - Does the environment model a genuine task?
- [x] **Task & grader quality** - Are tasks well-defined with clear objectives?
- [x] **Environment design** - Clean state management, sensible action/observation spaces?
- [x] **Code quality & spec compliance** - Follows OpenEnv spec?
- [x] **Creativity & novelty** - Novel problem domain?

### Evaluation Criteria
- [ ] Phase 1: Automated Validation (pass/fail gate)
- [ ] Phase 2: Agentic Evaluation (baseline agent re-run, Nemotron 3 Super)
- [ ] Phase 3: Human Review

### Infra Restrictions
- [x] Runtime of inference script should be less than 20min
- [x] Make sure env and inference can run on vcpu=2, memory=8gb

---

## DETAILED REQUIREMENTS CHECKLIST

### Parameter | Weight | Description

- [x] **Real-world utility** - 30%
  - [x] 0–5: Toy/artificial problem with no practical application
  - [x] 6–15: Valid domain but shallow modeling of the real task
  - [x] 16–25: Good domain modeling, would be useful for agent evaluation
  - [x] 26–30: Excellent - fills a real gap, immediate value for the RL/agent community

- [x] **Task & grader quality** - 25%
  - [x] 3+ tasks with difficulty range? (proof, answer, multi_step problem types)
  - [x] Graders produce scores between 0.0–1.0? (normalized score/7)
  - [x] Graders deterministic and reproducible? (uses LLM judge with parse_schema)
  - [ ] Hard task genuinely challenges frontier models?

- [x] **Environment design** - 20%
  - [x] reset() produces clean state? (implemented in qed_math_environment.py)
  - [x] Action/observation types well-designed and documented? (models.py)
  - [x] Reward function provides useful varying signal (not just sparse)? (0-1 with partial progress)
  - [x] Episode boundaries sensible? (done after proof submission)

- [x] **Code quality & spec compliance** - 15%
  - [x] openenv validate passes?
  - [x] docker build works? (Dockerfile present)
  - [x] HF Space deploys and responds?
  - [x] Baseline script exists (needs update for spec compliance)

- [x] **Creativity & novelty** - 10%
  - [x] Domain we haven't seen in OpenEnv before? (mathematical proof evaluation)
  - [x] Reward design has interesting properties? (LLM judge + math_verify)
  - [x] Clever mechanics that make the environment engaging?

---

## Disqualification Criteria

- [x] Environment does not deploy or respond
- [x] Plagiarized or trivially modified existing environments
- [x] Graders that always return the same score
- [x] No baseline inference script

---

## Mandatory Additional Instructions

- [x] Before submitting, ensure the following variables are defined in your environment configuration:
  - [x] `API_BASE_URL` - The API endpoint for the LLM.
  - [x] `MODEL_NAME` - The model identifier to use for inference.
  - [x] `HF_TOKEN` - Your Hugging Face / API key.

- [x] The inference script must be named `inference.py` and placed in the root directory of the project

- [x] Participants must use OpenAI Client for all LLM calls using above variables

- [x] Participants must emit structured stdout logs strictly following the [START], [STEP], and [END] format

---

## Pre Validation Script

- [ ] Run the validate-submission.sh script before submitting

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OpenEnv Environment Submission - Issue Tracking #1

COMPLETED ITEMS

Core Implementation

MCP Tools (3 tools)

Features

Inference

Documentation

IN PROGRESS / NEEDS UPDATE

Environment Variables

Required Updates

PENDING ITEMS

Pre-Submission Checklist

Scoring Breakdown

Evaluation Criteria

Infra Restrictions

DETAILED REQUIREMENTS CHECKLIST

Parameter | Weight | Description

Disqualification Criteria

Mandatory Additional Instructions

Pre Validation Script

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

OpenEnv Environment Submission - Issue Tracking #1

Description

COMPLETED ITEMS

Core Implementation

MCP Tools (3 tools)

Features

Inference

Documentation

IN PROGRESS / NEEDS UPDATE

Environment Variables

Required Updates

PENDING ITEMS

Pre-Submission Checklist

Scoring Breakdown

Evaluation Criteria

Infra Restrictions

DETAILED REQUIREMENTS CHECKLIST

Parameter | Weight | Description

Disqualification Criteria

Mandatory Additional Instructions

Pre Validation Script

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions