Skip to content

Yashasm18/email-triage-env

Repository files navigation

πŸ“§ Email Triage OpenEnv

CI Docker Python OpenEnv FastAPI License

An agentic Reinforcement Learning benchmark environment built on the OpenEnv framework β€” evaluates LLM agents on enterprise email triage: classify, summarise, route, and reply across 3 difficulty levels with deterministic reward scoring.

Note

This is a submission for the Meta Γ— HuggingFace Γ— Scaler OpenEnv Hackathon 2026.


⚑ Quick Start

import requests

BASE = "https://souller-email-triage-env.hf.space"

# Start an episode
obs = requests.post(f"{BASE}/reset", json={"task_id": "urgency-detection"}).json()
print(obs["observation"]["email"])

# Submit an action
result = requests.post(f"{BASE}/step", json={
    "label": "urgent",
    "summary": "Production server is down.",
    "reply": "We have escalated this to engineering immediately. We sincerely apologise.",
    "department": "engineering"
}).json()
print(f"Score: {result['reward']:.2f}")

πŸ’‘ Why This Problem?

Enterprise teams receive thousands of emails daily. Misrouted or slow-triaged emails cause real business damage β€” delayed incident response, lost sales, compliance violations. No existing benchmark tests LLM agents on the full triage pipeline: classify β†’ summarise β†’ route β†’ reply.

This environment fills that gap by evaluating agents across all four dimensions simultaneously using deterministic keyword-based reward scoring β€” zero LLM judges, fully reproducible.


πŸš€ Try It Now (No Setup Required)

# Health check
curl -X GET https://souller-email-triage-env.hf.space/health

# List available tasks
curl -X GET https://souller-email-triage-env.hf.space/tasks

# Start an episode
curl -X POST https://souller-email-triage-env.hf.space/reset \
     -H "Content-Type: application/json" \
     -d '{"task_id": "urgency-detection"}'

# Submit an action
curl -X POST https://souller-email-triage-env.hf.space/step \
     -H "Content-Type: application/json" \
     -d '{"label":"urgent","summary":"Server down","reply":"We are escalating immediately. We sincerely apologise.","department":"engineering"}'

πŸ—οΈ Agent Loop Architecture

flowchart TD
    classDef config fill:#1f2937,stroke:#6b7280,color:#f9fafb
    classDef env   fill:#1e3a5f,stroke:#3b82f6,color:#dbeafe
    classDef obs   fill:#3b0764,stroke:#9333ea,color:#f3e8ff
    classDef agent fill:#78350f,stroke:#f59e0b,color:#fef3c7
    classDef score fill:#14532d,stroke:#4ade80,color:#bbf7d0
    classDef done  fill:#1f2937,stroke:#22c55e,color:#f9fafb

    CFG["βš™οΈ Task Config\ntask_id = email-classification | urgency-detection | spam-filtering"]:::config
    CFG -->|"POST /reset"| ENV["πŸ“§ MyEnvironment\nLoads email + instruction from 20-email pool"]:::env
    ENV --> OBS["πŸ“¨ MyObservation\nemail Β· task_id Β· instruction"]:::obs
    OBS -->|"email + instruction"| AGT["πŸ€– Agent\nDual-Stage LLM β€” Draft Pass β†’ Reviewer Pass"]:::agent
    AGT -->|"POST /step"| ACT["⚑ MyAction\nlabel · summary · reply · department"]:::agent
    ACT --> GRADE["🏁 grader.grade()\nlabel 0.50 · summary 0.20 · reply 0.20 · dept 0.10"]:::score
    GRADE -->|"done=False β†’ next task"| ENV
    GRADE -->|"done=True"| END["βœ… Episode Complete\nreward: 0.01 – 0.99"]:::done
Loading

πŸ“Œ Overview

The agent receives an email and must produce a structured MyAction:

Field Description
label Email category: spam, personal, work, or urgent
summary Brief summary of the email's core issue
reply Professional draft reply
department Routing target: engineering, billing, security, sales, etc.

πŸ—‚οΈ Tasks & Scenarios

Task ID Difficulty What the Agent Must Do
email-classification 🟒 Easy Classify the email label only
urgency-detection 🟑 Medium Classify + summarise + draft a reply
spam-filtering πŸ”΄ Hard Full triage: label + department + summary + reply

The environment rotates across a pool of 20 emails (8 easy + 6 medium + 6 hard) per episode reset to prevent hard-coding.


🎯 Reward Evaluation (Deterministic Grader)

All grading uses hardened keyword matching and label heuristics β€” zero LLM judges, fully reproducible. Rewards clipped strictly to [0.01, 0.99].

Component Condition Reward
Label correct action.label == ground_truth.label +0.50
Label partial Both are work / urgent +0.10
Summary quality len(summary) > 10 chars +0.20
Reply keywords Keyword hit-rate Γ— 0.20 +0.00–0.20
Reply (no keywords) Any non-empty reply +0.10
Department routing Hard task + exact match +0.10

Maximum score: 0.01 + 0.50 + 0.20 + 0.20 + 0.10 = 0.99


Action & Observation Spaces

Action: MyAction

Field Type Required For
label spam | personal | work | urgent All tasks
summary str urgency-detection, spam-filtering
reply str urgency-detection, spam-filtering
department engineering | billing | security | ... spam-filtering only

Observation: MyObservation

Field Type Description
email str Raw email text to process
done bool Episode completion flag
reward float Score from last step
metadata dict Contains task_id, instruction, steps_taken

Example /step response:

{
  "observation": {
    "email": "The API is returning 500 errors since your last deployment.",
    "done": false,
    "reward": 0.91,
    "metadata": {
      "task_id": "urgency-detection",
      "instruction": "Classify this email with label, write a summary, and draft a reply.",
      "steps_taken": 1
    }
  },
  "reward": 0.91,
  "done": false
}

πŸ“Š Baseline Inference Scores

Evaluation via inference.py using Qwen/Qwen2.5-72B-Instruct with Dual-Stage Refinement.

Benchmark Chart

Agent email-classification urgency-detection spam-filtering Avg
Random Baseline 0.18 0.12 0.21 0.17
Llama 3 8B 0.41 0.33 0.44 0.39
Mistral 7B 0.48 0.39 0.50 0.46
GPT-3.5 Turbo 0.61 0.55 0.63 0.60
Llama 3 70B 0.67 0.60 0.69 0.65
Gemini 1.5 Flash 0.70 0.64 0.72 0.69
Qwen2.5-72B (Ours) β˜… 0.78 0.73 0.81 0.77
GPT-4o 0.82 0.79 0.85 0.82
# Run the full inference loop
python inference.py

πŸ“ Project Structure

email-triage-env/
β”œβ”€β”€ server/
β”‚   β”œβ”€β”€ app.py                  # FastAPI β€” /reset, /step, /state, /health, /tasks
β”‚   └── my_env_environment.py   # Core env β€” 20-email pool, episode logic
β”œβ”€β”€ grader.py                   # Deterministic reward scoring
β”œβ”€β”€ inference.py                # Dual-stage LLM agent (draft + reviewer)
β”œβ”€β”€ models.py                   # Pydantic schemas with Literal validation
β”œβ”€β”€ client.py                   # OpenEnv EnvClient
β”œβ”€β”€ openenv.yaml                # OpenEnv spec
β”œβ”€β”€ Dockerfile                  # Multi-stage HuggingFace Spaces build
β”œβ”€β”€ tests/
β”‚   └── test_all.py             # 23 pytest tests
└── .github/workflows/ci.yml    # CI β€” syntax, tests, smoke tests

πŸš€ Deployment & Setup

Local Run

git clone https://github.com/Yashasm18/email-triage-env.git
cd email-triage-env
pip install fastapi uvicorn pydantic openai httpx openenv-core
python server/app.py
# β†’ http://localhost:7860/docs

Docker

docker build -t email-triage-env .
docker run -p 7860:7860 email-triage-env

HuggingFace Space

openenv push --repo-id souller/email-triage-env

πŸ”Œ API Endpoints

Method Endpoint Description
GET /health Liveness probe β€” {"status": "ok"}
GET /tasks List all task IDs
POST /reset Start episode; optional {"task_id": "..."}
POST /step Submit MyAction, get reward + next observation
GET /state Current episode_id and step_count

βš™οΈ Environment Variables

Variable Description Default
API_KEY LLM provider API key β€”
HF_TOKEN HuggingFace token (fallback) β€”
API_BASE_URL OpenAI-compatible base URL https://router.huggingface.co/v1
MODEL_NAME Model for inference Qwen/Qwen2.5-72B-Instruct
SPACE_URL Running server URL http://localhost:7860

πŸ› οΈ Tech Stack

Python 3.10+ Β· OpenEnv Β· FastAPI + Uvicorn Β· Pydantic Β· OpenAI SDK Β· Docker Β· pytest (23 tests)


πŸ”­ Future Improvements

  • Real email datasets (Enron, TREC) for more robust benchmarking
  • Multi-turn episode support for back-and-forth email chains
  • Semantic reward scoring via embedding similarity
  • OpenEnv leaderboard integration
  • Fine-tuning data collection mode

πŸ“„ Citation

@software{emailtriageenv2026,
  title   = {Email Triage OpenEnv: Evaluating LLM Agents on Enterprise Email Triage},
  author  = {Yashas M},
  year    = {2026},
  url     = {https://huggingface.co/spaces/souller/email-triage-env},
  note    = {Deterministic RL environment for email classification, routing and reply generation}
}

πŸ‘€ Author

Yashas M β€” B.E. Computer Science Β· SJC Institute of Technology, Bengaluru GitHub Β· LinkedIn


πŸ“„ License

Apache 2.0

About

πŸ“§ Intelligent Agentic Workflow for Autonomous Enterprise Email Triage. Built with OpenEnv, featuring Chain-of-Thought reasoning and Self-Correcting agent logic for high-stakes corporate routing.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors