Consolidates:
FINAL_SUMMARY.md,IMPLEMENTATION_COMPLETE.md,IMPLEMENTATION_STATUS.md,INFERENCE_UPDATE.md,PIPELINE.md,QUICKSTART.md,ROUND2_COMPLETE.md,ROUND2_STATUS.md,ROUND2_TRAINING_READY.md,server/ROUND2_FINAL_STATUS.md, and the top-levelROUND2_ARCHITECTURE.md/ROUND2_IMPLEMENTATION_PLAN.md/ROUND2_QUICK_REFERENCE.md/OpenEnv-Complete.md.The HF-Space
README.mdis kept (it contains the YAML frontmatter Spaces needs). The per-directorydashboard/README.mddescribes only the local dashboard and stays with it.
An OpenEnv-compatible RL environment. An LLM agent must identify the 10 members of a coordinated fake-account ring hidden inside a synthetic social network. Round 2 makes detection platform-adaptive:
- Each episode belongs to a platform (Instagram, Snapchat, X, LinkedIn, Reddit, … any name).
- A
PlatformPolicyis compiled from real transparency-report text via a Bayesian threshold formula and cached per-platform. - The high-signal account fields (
photo_reuse_score,bio_template_score,ip_cluster_id) start hidden and are revealed only by explicit tool actions. - Reward shape, FP penalty, grader score, and the moderation-decision package are all derived from the compiled policy rather than hardcoded.
A separate shared evaluation runner drives episodes deterministically and consults the LLM at exactly two decision points per suspicious account; six thin model shims plug in HF-router or Bedrock models against that runner.
| Area | Round 1 | Round 2 |
|---|---|---|
| Platform | — | platform field per episode; any name supported (env defaults to seed-parity Instagram/Snapchat) |
| Policy | hardcoded thresholds | PlatformPolicy compiled dynamically from transparency reports (Bayesian θ*) with 30-day cache freshness and sanity checks |
| Signals | all visible at INSPECT | photo_reuse_score, bio_template_score, ip_cluster_id start at 0.0 / "" and are revealed only by tool actions |
| Visible accounts | populated only on INSPECT | populated for every visible account from reset; tool reveals propagate immediately |
| Per-step reward | null for non-terminal steps |
float delta of self._score returned every step |
| Actions | inspect, investigate_network, flag, unflag, submit |
+ get_policy, reverse_image_search, analyze_bio, check_ip |
| Reward shaping | terminal only | + +0.20 first-action GET_POLICY bonus, redundant-tool penalties, no-evidence flag deny |
| Submit response | {observation, done, reward, message} |
+ top-level decision_package and grader_score |
| Eval | one monolithic qwen_test_judge_eval.py per model |
shared _round2_runner.py + 6 thin shims, two LLM decision points per account |
Platform assignment is deterministic in the env: seed % 2 == 0 → Instagram, else Snapchat. The eval runner remaps seeds so any requested platform actually fires (--platform Instagram forces even seeds, --platform Snapchat forces odd).
This is the spine of Round 2. Every other component reads from this pipeline.
(ONE-TIME / OFFLINE)
transparency-report URLs policy_cache/
──────────────────────── ─────────────
│ │
▼ ▲
Tavily search │
query: "{platform} fake account content │
policy enforcement 2024 2025" │
│ │
▼ │
Groq Llama-3.1-8B extraction │
→ {base_rate π, fn_cost_signal, fp_cost_signal, │
harm_weight, primary_signal, confidence} │
│ │
▼ │
sanitize_pi() — clamp [0.0005, 0.05] │
(>0.05 ⇒ "enforcement rate misread", clamp + warn) │
│ │
▼ │
compute_threshold(π, fn_signal, fp_signal, hw) │
──────────────────────────────────────────────── │
C_fn = FN_COST_MAP[fn_signal] │
C_fp = FP_COST_MAP[fp_signal] │
θ_raw = C_fn·π / [C_fn·π + C_fp·(1−π)] │
θ* = clamp(θ_raw / harm_weight, 0.01, 0.95) │
fp_penalty_weight = C_fp │
│ │
▼ │
PlatformPolicy(threshold=θ*, base_rate=π, │
fn/fp_cost_signal, harm_weight, │
primary_enforcement_signal, │
fp_penalty_weight=C_fp, │
confidence, sources, used_fallback) ────┘
│
▼ sanity_check_policy() — surfaces warnings
▼ (high θ*, suspicious π, low confidence, bad signal name)
▼
cached to policy_cache/{platform}.json
│
=====================│=====================
│ (PER EPISODE — RUNTIME)
▼
client.reset(task, seed)
env.platform = "Instagram"
env._policy = get_policy("Instagram") ◄── reads cached JSON
│ (recompiles if >30 days old)
▼
deterministic step 0: GET_POLICY (free, +0.20 first-action bonus)
message: "Policy compiled: Platform: Instagram |
Threshold: 0.369 | Primary Signal: photo_reuse | FP Penalty: 0.1x | …"
│
▼
runner._policy_from_message() → policy dict {threshold, primary_signal, fp_weight}
│
▼
per suspicious account, sorted by risk_score desc:
INSPECT (deterministic)
INVESTIGATE_NETWORK if risk ≥ 0.80 (deterministic, once)
┌─ DP1 (LLM) ─────────────────────────────┐
│ prompt includes platform, primary_signal,│
│ θ*, revealed-vs-None signals, budget │
│ → "reverse_image_search" / "analyze_bio" │
│ / "check_ip" / "done" │
└──────────────────────────────────────────┘
↓ (loop until "done" or signals sufficient)
┌─ DP2 (LLM) ─────────────────────────────┐
│ prompt includes revealed signals, │
│ θ*, fp_penalty=C_fp, running tp/fp count │
│ → "flag" / "skip" │
└──────────────────────────────────────────┘
│
▼
SUBMIT (deterministic)
reward = tp·1.0 − fp·C_fp − fn·0.3 + bonuses − penalties
▲
└── platform-specific via fp_penalty_weight
grader_score and decision_package surfaced at top level of /step response.
Two views of the same policy:
θ*is in the prompt at DP1/DP2 → the LLM conditions on it.C_fp(=fp_penalty_weight) is in the terminal reward → the LLM is graded against it.
Both come from the same compile-time computation; they cannot drift apart.
θ_raw = C_fn · π / [C_fn · π + C_fp · (1 − π)]
θ* = clamp(θ_raw / harm_weight, 0.01, 0.95)
fp_penalty_weight = C_fp
Action rule the threshold serves: FLAG if risk_score ≥ θ*.
θ_raw is the share of expected cost coming from missed fakes. Higher C_fn or higher base rate → higher θ_raw → lower threshold (the agent should flag more aggressively when misses are expensive).
harm_weight > 1 strict (lowers θ*); harm_weight < 1 lenient (raises θ*).
History note. The original spec used
θ_raw = C_fp(1−π) / [C_fp(1−π) + C_fn·π]— the complementary probability. With small π that formula collapses to≈ 1for every platform (π is the bottleneck, not the costs). Audit on 2026-04-25 confirmed this was a formula-direction error; the orientation above is correct for our action rule.
FN_COST_MAP = {"low": 0.5, "medium": 1.0, "high": 2.0, "critical": 4.0}
FP_COST_MAP = {"low": 0.1, "medium": 0.5, "high": 1.5}Signals are extracted from policy text by an LLM and constrained to these keys (defaults high / medium if absent or invalid).
| Field | Source | Sanitization |
|---|---|---|
base_rate (π) |
LLM extraction from transparency report | sanitize_pi: clamp to [0.0005, 0.05]; >0.05 logs "likely enforcement rate misread, clamped". The prompt also instructs the LLM to return 0.005 if it sees an enforcement rate or no prevalence figure. |
fn_cost_signal |
LLM extraction | invalid → high |
fp_cost_signal |
LLM extraction | invalid → medium |
harm_weight |
LLM extraction | non-numeric → 1.0 |
primary_enforcement_signal |
LLM extraction | None / blank / non-string → photo_reuse |
confidence |
LLM extraction | non-numeric → 0.0 |
query = f"{platform} fake account content policy enforcement 2024 2025"The previous query was Meta/Instagram-specific; the generic form works for any platform name. Domain filtering (is_high_signal_source) was removed for the same reason — it gated to meta.com/snap.com domains.
- Cached at
policy_cache/{platform_lowercase}.json. - Entries older than
CACHE_TTL_DAYS = 30are treated as stale and recompiled. compile_policy(platform, use_cache=True)is the runtime entry;--use-cacheflag controls CLI behavior (default re-compile when invoked from CLI).
FALLBACK_POLICIESprovides hardcoded params for Instagram / Snapchat. Any other platform falls back toGENERIC_FALLBACK(π=0.005, fn=high, fp=medium, hw=1.0).- Fallback policies set
used_fallback=True(a new field onPlatformPolicy). - The threshold value in fallbacks is computed via the same formula — there is no hardcoded threshold in the policy compiler anymore.
After every compile, the compiler prints warnings for any of:
| Trigger | Meaning |
|---|---|
θ* > 0.90 |
agent will almost never flag — check fn_cost extraction |
θ* < 0.005 |
agent will flag nearly everything — check fp_cost extraction |
base_rate > 0.05 |
likely enforcement-rate misread |
confidence < 0.60 |
low extraction quality; consider falling back |
primary_signal ∉ {photo_reuse, bio_template, ip_cluster, behavior} |
not a known tool action |
Sanity check does not block compilation; it surfaces issues so an operator can review before running eval.
python -m server.policy_compiler --platform <Name> # always recompile
python -m server.policy_compiler --platform <Name> --use-cache| Platform | π | fn_signal | fp_signal | hw | θ* | C_fp | confidence | used_fallback |
|---|---|---|---|---|---|---|---|---|
| X | 0.005 | high | low | 1.0 | 0.091 | 0.10 | 0.80 | False |
| 0.030 | critical | low | 1.5 | 0.369 | 0.10 | 0.80 | False | |
| Snapchat | 0.005 | low | low | 1.0 | 0.025 | 0.10 | 0.50 ⚠ | False |
| 0.005 | critical | low | 1.0 | 0.167 | 0.10 | 0.80 | False | |
| 0.005 | low | low | 1.0 | 0.025 | 0.10 | 0.50 ⚠ | False |
Snapchat and Reddit currently raise the low confidence sanity warning — extraction is noisy on those transparency reports. Consider forcing the fallback path before training on them.
5. Hidden-signal architecture
Episode JSON stores hidden signals at episode level, not per account:
{
"episode_id": "easy_042_Instagram",
"platform": "Instagram",
"hidden_signals": {
"photo_reuse": {"acc_0001": 0.87, ...},
"bio_template": {"acc_0001": 0.72, ...},
"ip_cluster": {"acc_0001": "ip_gang_42", ...}
}
}account.features start with photo_reuse_score = 0.0, bio_template_score = 0.0, ip_cluster_id = "". Tool handlers copy from ep["hidden_signals"] into account.features and refresh the cached profile so subsequent observations carry the revealed value.
Known limitation.
generator.pyaccepts aplatformarg but currently produces identical hidden-signal distributions for every platform. Platform conditioning is therefore purely prompt-side — the LLM learns to read θ* and C_fp from the prompt and reward, not to recognize platform-specific data shape. Parametrizing the generator by platform is a separate follow-up.
Stateless risk functions (kept from Round 1): compute_node_risk, compute_behavior_risk, compute_graph_risk, compute_hub_legitimacy, compute_fake_risk.
Round 2 additions:
compute_weighted_fake_risk(..., primary_signal)boosts the platform's primary signal (node risk +0.15 for content signals; behavior risk +0.15 forip_cluster).classify_risk(fake_risk, threshold)accepts platform threshold.grader_score(tp, fp, fn, steps, max_steps, threshold, fp_penalty_weight)adds0.05 × (1 − threshold)to reward stricter platforms.
Win conditions (unchanged from Round 1): easy/medium recall ≥ 0.8, precision ≥ 0.7; hard recall ≥ 0.9, precision ≥ 0.8.
| Action | Step cost | Score delta | Reveals | Notes |
|---|---|---|---|---|
GET_POLICY |
0 | +0.20 once (first action) |
— (returns PlatformPolicy summary in message) |
Free; bonus only fires on _action_count == 1 |
INSPECT |
1 | −0.01 |
full profile, edges | needed before any DP1/DP2 logic |
REVERSE_IMAGE_SEARCH |
1 | −0.01 (−0.05 if redundant) |
photo_reuse_score |
sets account.features.photo_reuse_score |
ANALYZE_BIO |
1 | −0.01 (−0.05 if redundant) |
bio_template_score |
sets account.features.bio_template_score |
CHECK_IP |
2 | −0.02 (−0.10 if redundant) |
ip_cluster_id + cluster-size message |
heaviest tool; only worth it for shared_ip ≥ 5 |
INVESTIGATE_NETWORK |
2 | −0.02 |
2-hop expansion + SUSPECT cascade | unchanged from Round 1 |
FLAG |
0 | −0.15 if no evidence (deny) |
dual SUSPECT cascade (follow-graph + IP) | "no evidence" = not inspected AND no tool used on the account |
UNFLAG |
0 | 0 | — | unchanged |
SUBMIT |
0 | terminal-reward formula (§ 8) | end episode | also surfaces decision_package and grader_score at top level |
All tool handlers validate acc_id in self._accounts, refresh the cached profile, and force _do_submit(forced=True) if max steps were consumed.
Per-step delta is now visible on every /step response: it is round(self._score - self._last_score, 4). terminal_reward overrides the delta on the SUBMIT step so the caller sees the full episode reward there.
+0.20 GET_POLICY as first action (once per episode)
-0.01 per inspect / reverse_image_search / analyze_bio (time cost)
-0.02 per check_ip / investigate_network (time cost)
-0.05 per redundant reverse_image_search / analyze_bio
-0.10 per redundant check_ip
-0.15 blind FLAG (no inspect, no tool used on account) ← deny + penalty
reward = tp · 1.0
− fp · self._policy.fp_penalty_weight (= C_fp; varies per platform)
− fn · 0.3
+ 5.0 if recall ≥ win_recall AND precision ≥ win_precision
+ 3.0 if tp == 10 (perfect recall)
+ 2.0 if partial win (recall met, precision missed)
+ 1.0 if SUBMIT with ≥ 50% steps remaining
+ 2.0 if Instagram and precision ≥ 0.95
+ 2.0 if Snapchat and recall ≥ 0.95
− 1.0 × evasion_count (hard task only)
− 2.0 if forced SUBMIT (ran out of steps)
− 0.15 × |unsupported_flags| (flags with no revealed signals at submit time)
Note: fp_penalty_weight is platform-specific and is the principal lever the policy compiler pulls. Same FP behavior costs more on X (1.5) than on Instagram/Snapchat (0.1).
FakeGangAction:action_type: ActionType,account_id: Optional[str]FakeGangObservation:done,reward(per-step delta or terminal),visible_accounts[AccountProfile](now populated for every visible id),visible_account_ids,flagged_ids,inspected_ids,graph_edges,steps_remaining,evasion_triggered,evasion_count,task,message,suspect_ids,platformFakeGangState:episode_id,step_count,task,score_so_far,evasion_count,network_size,gang_size,episode_seed,platformPlatformPolicy:platform,threshold,base_rate,fn_cost_signal,fp_cost_signal,harm_weight,primary_enforcement_signal,fp_penalty_weight,sources,confidence,compiled_at,used_fallback
{
"observation": { ... },
"done": <bool>,
"reward": <float | null>,
"message": "...",
"decision_package": { ... } | null, // populated after SUBMIT
"grader_score": <float> | null // populated after SUBMIT, sourced from decision_package
}decision_package (after SUBMIT) carries:
platform,flagged_accounts[],recommended_action ∈ {queue_for_review, temporary_hold, scheduled_ban, batch_takedown}evidence_summary:flagged,revealed_photo_reuse,revealed_bio_template,revealed_ip_cluster,unsupported_flags[]policy_rationale: textual explanation including θ*, primary signal, FP penalty, observed precision/recalltp,fp,fn,precision,recall,reward,grader_score
The terminal message also embeds the keywords flagged_accounts, evidence_summary, policy_rationale, grader_score for callers that grep the message string.
| Endpoint | Method | Notes |
|---|---|---|
/health |
GET | {"status":"healthy"} |
/reset |
POST | {task, seed, episode_id} → StepResponse |
/step |
POST | FakeGangAction body → StepResponse (per-step reward delta + decision_package + grader_score on SUBMIT) |
/state |
GET | Current FakeGangState |
/tasks |
GET | Task list + Round 2 action_schema (9 actions) |
/grader |
GET | Normalized [0,1] score; requires SUBMIT first |
/metadata |
GET | HF Spaces metadata |
/schema |
GET | Pydantic JSON schemas |
/mcp |
POST | MCP JSON-RPC for tools/list |
/baseline |
POST | Runs rule-based baseline on all 3 tasks |
/ |
GET | Gradio playground |
openenv.yaml action schema mirrors all nine action types (Round 1 five plus the four Round 2 tools).
reset(task, seed)
↓
GET_POLICY (step 0) ← always; bonus +0.20
↓
loop over visible accounts sorted by (suspect_flag, risk_score) desc:
INSPECT if not yet inspected
INVESTIGATE_NETWORK if risk_score ≥ 0.80 (once per account, ≥5 steps left)
DP1 loop (LLM) — pick a tool or "done"
reverse_image_search | analyze_bio | check_ip | done
stops on "done", missing budget, or photo + bio both revealed
DP2 (LLM) — flag-or-skip
flag → env.step(FLAG)
skip → leave alone, move to next account
↓
SUBMIT
Stops early when done is signaled, steps_remaining ≤ 1, or max_accounts_per_episode = 15 accounts have been processed.
DP1 — tool selection prompt includes:
platform,primary_signal,θ*account_id,risk_score,hub_legitimacy- Each revealed signal value or
None steps_remaining, tool costs
DP2 — flag decision prompt includes:
- All revealed signals for the account
θ*,fp_penalty = C_fp- Running
flagged / 10,steps_remaining
Each prompt asks for exactly one token so parsing is robust. Invalid completions are counted in dp1_invalid / dp2_invalid for QA.
eval-models/results/{model}_{platform}_results.jsonl — one line per episode:
{
"model": "Bedrock/qwen.qwen3-next-80b-a3b",
"platform": "Instagram",
"task": "easy", "seed": 0,
"episode_id": "easy_000_Instagram",
"threshold": 0.369, "primary_signal": "photo_reuse",
"steps_taken": 14, "inspected": 5,
"tool_calls": {"reverse_image_search": 5, "analyze_bio": 4, "check_ip": 1,
"get_policy": 1, "investigate_network": 1},
"flagged": 7,
"dp1_calls": 12, "dp2_calls": 5, "dp1_invalid": 0, "dp2_invalid": 0,
"reward": 4.32, "grader_score": 0.71,
"final_message": "...", "wall_seconds": 23.4
}from _round2_runner import run_evaluation
run_evaluation(
model_name="qwen-72b",
call_llm=lambda prompt: ..., # injectable adapter
platform="Instagram",
base_url="http://localhost:7860",
tasks=["easy", "medium", "hard"],
seeds=[0, 1, 2],
)The runner remaps requested seeds to the env's parity rule so --platform Instagram actually runs Instagram episodes (even) and --platform Snapchat runs Snapchat (odd). Other platform names pass seeds through unmodified (env then falls back to its parity default for that seed).
The runner unconditionally inserts the project root at sys.path[0] and evicts any cached models / client modules so a stale copy in ~/.local/lib/python3.12/site-packages cannot win. If your shim raises ActionType has no attribute GET_POLICY, that means the safety insert was skipped — verify you are running today's runner.
Each shim is ~30 lines. It declares the model identifiers and delegates to the runner via _llm_adapters.make_caller:
| Shim | HF model | Bedrock model |
|---|---|---|
| qwen | Qwen/Qwen2.5-72B-Instruct |
qwen.qwen3-next-80b-a3b |
| gemma | google.gemma-3-12b-it |
same |
| deepseek | deepseek.v3.2 |
same |
| llama | meta.llama4-scout-17b-instruct-v1:0 |
same |
| mistral | mistral.ministral-3-8b-instruct |
same |
| nvidia | nvidia.nemotron-super-3-120b |
same |
_llm_adapters.py exposes make_hf_caller(model), make_bedrock_caller(model_id), and a unified make_caller(backend, hf_model, bedrock_model). Both backends strip <think>...</think> reasoning blocks and retry up to 3× with exponential backoff.
# HF router (needs HF_TOKEN)
python eval-models/qwen_test_judge_eval.py --url http://localhost:7860 --platform Instagram
# AWS Bedrock (needs AWS_* env vars)
python eval-models/qwen_test_judge_eval.py --bedrock --url http://localhost:7860 --platform Snapchat \
--tasks easy medium --seeds 0 1 2Source of truth (read first):
reference.md— this filemodels.py— data schemas (PlatformPolicy.used_fallbackis new)server/policy_compiler.py— Bayesian θ*, sanity check, generic Tavily, 30-day cacheserver/environment.py— reset/step/state, tool handlers, per-step reward delta, no-evidence flag deny, decision packageserver/app.py—StepResponsewith top-leveldecision_packageandgrader_scoreserver/scoring.py— risk/grader mathserver/generator.py— episode generation,hidden_signalseval-models/_round2_runner.py— deterministic loop + DP1/DP2eval-models/_llm_adapters.py— HF + Bedrock callerseval-models/{model}_test_judge_eval.py— six thin shimsopenenv.yaml— action schema mirrors all 9 actionscheck.sh— 12-step Round 2 system check (server side)
Operational:
policy_cache/{platform}.json— compiled policies (delete to force recompile)episodes/{task}_{seed}.json— generated episodes (regenerate withpython -m server.generator)eval-models/results/{model}_{platform}_results.jsonl— per-episode eval logs
Round 1 still functional:
agent/train.py,agent/policy.py,agent/memory.py,agent/reflection.py,agent/hybrid_policy.pyinference.py,bedrock_model.py,client.pyvalidate.py,test_round2.py
# 1. Install
cd fake_gang_env
uv sync # or: pip install -r requirements.txt
# 2. Compile / refresh platform policies (one-time, then per ≥30 days)
python -m server.policy_compiler --platform Instagram
python -m server.policy_compiler --platform Snapchat
python -m server.policy_compiler --platform X
python -m server.policy_compiler --platform LinkedIn
# 3. (Re)generate episodes
python -m server.generator
# 4. Start the env server
python -m uvicorn server.app:app --port 7860
# 5. End-to-end system check (12 verifications)
bash check.sh
# 6. Run a model shim against the live server
export HF_TOKEN=... # or AWS_*
python eval-models/qwen_test_judge_eval.py \
--url http://localhost:7860 \
--platform Instagram \
--tasks easy medium hard \
--seeds 0 1 2
# Logs: eval-models/results/Qwen_Qwen2.5-72B-Instruct_instagram_results.jsonlDocker:
docker build -f server/Dockerfile -t graphstrike .
docker run -p 7860:7860 -v $(pwd)/memory:/app/memory -v $(pwd)/runs:/app/runs graphstrikeTwelve numbered checks against a running server at http://localhost:7860:
| # | Check | Pass criterion |
|---|---|---|
| 1–4 | health, /tasks, /reset, /step GET_POLICY | endpoints respond; action schema lists 9 types; threshold appears in message |
| 5 | INSPECT first visible account | profile returned; account_id is real (extracted from visible_account_ids before inspect) |
| 6 | REVERSE_IMAGE_SEARCH | photo_reuse_score > 0 for that account in observation.visible_accounts[*] |
| 7 | ANALYZE_BIO | bio_template_score > 0 |
| 8 | CHECK_IP | message reports cluster, shared_ip_count populated |
| 9 | GET_POLICY first-action bonus | per-step reward ≥ 0.15 |
| 10 | redundant tool penalty | second reverse_image_search reward < first |
| 11 | blind FLAG penalty | flag without prior inspect/tool → reward ≤ −0.10 |
| 12 | full episode | submit response carries the four decision-package keywords + non-null grader_score |
CHECK 5–8 read from observation.visible_accounts[*] rather than a non-existent top-level profile field — the prior version of check.sh had that bug.
| # | File | Symptom | Root cause | Fix |
|---|---|---|---|---|
| 1 | server/environment.py |
reward: null on every non-terminal step |
_make_observation only set terminal_reward |
Track _last_score; return score - _last_score as per-step delta |
| 2 | server/environment.py |
visible_accounts: [] until INSPECT |
observation included only _profiled |
Build a profile for every _visible_id (cached for inspected, fresh otherwise). Tool reveals propagate because _build_profile reads from account.features which the tool handlers update. |
| 3 | server/environment.py |
Tool reveals invisible to caller | covered by Bug 2 | — |
| 4 | server/environment.py |
GET_POLICY +0.20 not visible | accumulated into _score but _make_observation never returned it |
covered by Bug 1 |
| 5 | server/environment.py, server/app.py |
submit response missing decision-package keywords + grader_score |
message lacked the literal keywords; StepResponse only had four fields | enrich submit message; add decision_package and grader_score as top-level fields on StepResponse |
| 6 | server/environment.py |
blind FLAG (no inspect, no tool) returned 0 reward | submit-time unsupported_flags only fires at SUBMIT |
_do_flag now denies blind flags immediately with −0.15 |
| 7 | eval-models/_round2_runner.py |
ActionType has no attribute GET_POLICY when running shims |
if _PARENT not in sys.path guard skipped the insert because path was already present at lower priority; site-packages models.py won |
Insert _PARENT at index 0 unconditionally; evict cached models/client from sys.modules |
| 8 | check.sh |
acc_000 hardcoded; profile field read from wrong path | script bugs | extract real account_id from observation.visible_account_ids before CHECK 5; read profiles from observation.visible_accounts[*] |
| — | server/policy_compiler.py |
θ* always ≈ 0.95 | formula direction inverted (computed FP-cost share, not FN-cost share) | θ_raw = C_fn·π / [C_fn·π + C_fp·(1−π)] |
| — | server/policy_compiler.py |
enforcement-rate misreads (e.g. Snap π=0.262) | LLM confusion between "% removed" and "% prevalence" | sanitize_pi clamp [0.0005, 0.05] + warning; extraction prompt explicitly disambiguates |
| — | server/policy_compiler.py |
crash on Pydantic validation when LLM returned None for primary_enforcement_signal |
strict typing | coerce None / blank to photo_reuse; same for confidence |
After running python -m server.policy_compiler --platform <Name>:
| Property | Acceptable range | Action if outside |
|---|---|---|
threshold |
[0.005, 0.90] |
review — likely cost-signal extraction issue |
base_rate |
[0.0005, 0.05] |
review — likely enforcement-rate misread |
confidence |
≥ 0.60 |
force fallback or improve sources |
primary_signal |
one of {photo_reuse, bio_template, ip_cluster, behavior} |
coerced to photo_reuse |
used_fallback |
match expectation | ensure Tavily/Groq keys are set if False expected |
Cross-platform ordering is not an invariant. Any platform may land anywhere on the [0.01, 0.95] θ* scale depending on its actual policy.
- Platform-specific episode generation —
generate_episodeaccepts aplatformarg but produces identical hidden-signal distributions. Parametrize π, signal strengths, and evasion behavior per platform for richer training data. - TRL/GRPO trainer wrapper — runner produces
(prompt, completion)pairs at DP1/DP2 and per-step rewards. Threading these into a TRLDataCollatoris the next step (training-side scope, not part of this readiness pass). - Force-fallback flag on the CLI — convenient way to ignore Tavily and use hardcoded params when sanity check raises low-confidence warnings.
hybrid_policy.pyplatform-aware upgrade — Round-1 rule engine still uses fixed_THRESHOLDS; could readenv._policy.threshold. Low priority sinceagent/train.pyand the eval runner are independent.- Dashboard —
dashboard/DASHBOARD_SPEC.mddescribes a React + D3 demo; not required.
- Hidden signals at episode level, not account level — easier to track revelation, cleaner rollback between episodes.
- Platform assignment by seed parity (env) — reproducible without extra RNG state; eval runner remaps seeds when
--platformis requested. - Bayesian θ* — principled, explainable, varies sensibly when policy text changes. Action rule is
FLAG if risk ≥ θ*. - Asymmetric tool costs — CHECK_IP is 2× to force the agent to use cheap signals first.
- Cached policies + 30-day TTL — hackathon-demo viable without network; live recompile on staleness.
- Two LLM decision points — keeps the LLM's job focused (tool-pick + flag/skip) and makes (prompt, completion, reward) tuples cleanly attributable for future RL training.
- Top-level
decision_package+grader_score— callers shouldn't have to grep the message string for the four submission fields.
bash check.sh— 12-step end-to-end against a running server (Round 2 system check).test_round2.py— 9-stage Python test againstserver/environment.py.validate.py— 24 HTTP validator checks against a running server.eval-models/{model}_test_judge_eval.py— judge model vs. environment scoring with two-decision-point loop.
All four were verified against the current tree on 2026-04-25.