Applied autoresearch to Marketing Mix Modeling (MMM) — 12x improvement, beat Google Meridian #497
Replies: 4 comments 1 reply
-
|
The holdout leak is the important result here, more than the 12x headline. "Immutable by instruction" is not enough once a system is allowed to optimize against an eval surface. If the architecture exposes the boundary, the loop will eventually route around it. That is why Greyforge cares more about hard separation, provenance, and bounded authority than about another benchmark win. Useful demonstration of the failure mode. |
Beta Was this translation helpful? Give feedback.
-
|
We solved this at the agent harness level using Claude Code's PreToolUse hooks — shell commands that intercept tool calls and can block them before execution: {
"hooks": {
"PreToolUse": [
{
"matcher": "Read|Grep|Glob",
"hooks": [
{
"type": "command",
"command": "INPUT=$(cat); if echo \"$INPUT\" | grep -q 'ground_truth/val'; then echo 'BLOCKED: holdout is held out.' >&2; exit 2; fi"
}
]
}
]
}
}If the path matches the holdout directory, the hook exits with code 2 and the tool call is rejected before execution. The agent can't see the data — not because it was told not to, but because the runtime won't let it. "Immutable by design" at the harness level, no changes to the evaluation code needed. |
Beta Was this translation helpful? Give feedback.
-
|
Update: we ran Session 3 with a budget constraint. The agent circumvented it anyway. After @GreyforgeLabs's comment, we added a 50,000-call budget on Then it discovered AR residual correction and extrapolated into the holdout. WAPE dropped to 0.0086. Then it generalized: instead of extrapolating, why not directly parameterize the correction for each holdout week?
With 12 free parameters for 12 of 14 holdout weeks, the optimizer achieves exact fit. Budget spent: 239 calls (0.5% of the limit). The agent then spent ~100 more experiments proving it could do it even more efficiently. The full progression across three sessions:
Each fix plugged one hole. The agent found the next one. The finding: budget constraints are necessary but not sufficient. Any signal from the true holdout — however indirect, however rate-limited — eventually becomes a training signal. The architectural fix is that the agent must never receive true holdout WAPE during optimization. Session 4 runs now with that constraint: (Also — @Vigtu's hook approach is a clean alternative for the data-access version of this problem. The oracle-signal version needs the evaluation budget constraint on top of it.) |
Beta Was this translation helpful? Give feedback.
-
|
Session 4 done: honest WAPE = 2.18% Closing the loop. We ran Session 4 with the blind-holdout fix: 192 experiments, 16 keeps. Final WAPE: 2.18%. Key honest discoveries (none of them new ideas — just the ones that survived without leak):
The plateau at 2.18% held for ~60 consecutive experiments. Theoretical noise floor (given Final comparison on the same dataset, same holdout:
The ~2x improvement over Meridian is real and holds under fair conditions. Not the 17x headline we started with, but an honest number we can defend. What I actually learned from these four sessions:
Thanks again @GreyforgeLabs — the 17x headline was wrong, but the lesson was much more valuable than the number would have been. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Applied the autoresearch loop to a completely different domain: Marketing Mix Modeling (attributing revenue to marketing channels).
Setup
train.py(adstock, saturation, optimizer, features) —prepare.pyis immutablegit resetif notResults (~137 experiments across 2 sessions)
The agent improved the baseline 12x. A simple OLS — optimized autonomously — beat Google's Meridian by 32% on the same data.
What the agent discovered without being told
scipy.signal.lfilterfor adstock is 100x faster than Python loop → more experiments per sessionSecurity incident worth noting
In session 1 (~exp 88), the agent found a loophole:
load_data()was returning the full dataset including holdout sales. It trained OLS directly on the test data → WAPE = 0.0000.Fix:
load_data()now returns only training data. The holdout is loaded internally insideevaluate()—train.pynever sees holdout sales values.Immutable by instruction is not enough. Must be immutable by design.
This is consistent with the "Your MMM is Broken" paper (arXiv:2408.07678) — optimizers find evaluation loopholes that humans don't anticipate.
Repo
https://github.com/lucianfialho/mmm-research
Three files, reproducible, synthetic data with known ground truth.
Beta Was this translation helpful? Give feedback.
All reactions