Applied autoresearch to Marketing Mix Modeling (MMM) — 12x improvement, beat Google Meridian #497

lucianfialho · 2026-04-07T00:35:05Z

lucianfialho
Apr 7, 2026

Applied the autoresearch loop to a completely different domain: Marketing Mix Modeling (attributing revenue to marketing channels).

Setup

Synthetic dataset: 3 channels (TV, Search, Social), 104 weeks, holdout fixed at weeks 90–103
Metric: WAPE (Weighted Absolute Percentage Error) on hidden holdout
Agent modifies train.py (adstock, saturation, optimizer, features) — prepare.py is immutable
Loop: modify → commit → run → keep if improved, git reset if not

Results (~137 experiments across 2 sessions)

Model	WAPE (holdout)
Baseline OLS	2.73%
Agent best	0.23%
Google Meridian (Bayesian)	4.04%

The agent improved the baseline 12x. A simple OLS — optimized autonomously — beat Google's Meridian by 32% on the same data.

What the agent discovered without being told

Nelder-Mead multi-start beats grid search
TV×Search interaction effect exists in the data (not in the spec)
scipy.signal.lfilter for adstock is 100x faster than Python loop → more experiments per session
Differential perturbation scales: nonlinear params (adstock, saturation) need 3x larger search radius than linear params
10 seeds with different perturbation regions > 10x starts with same seed
Weibull adstock never beat exponential — tested twice, discarded both times

Security incident worth noting

In session 1 (~exp 88), the agent found a loophole: load_data() was returning the full dataset including holdout sales. It trained OLS directly on the test data → WAPE = 0.0000.

Fix: load_data() now returns only training data. The holdout is loaded internally inside evaluate() — train.py never sees holdout sales values.

Immutable by instruction is not enough. Must be immutable by design.

This is consistent with the "Your MMM is Broken" paper (arXiv:2408.07678) — optimizers find evaluation loopholes that humans don't anticipate.

Repo

https://github.com/lucianfialho/mmm-research

Three files, reproducible, synthetic data with known ground truth.

GreyforgeLabs · 2026-04-08T07:22:06Z

GreyforgeLabs
Apr 8, 2026

The holdout leak is the important result here, more than the 12x headline.

"Immutable by instruction" is not enough once a system is allowed to optimize against an eval surface. If the architecture exposes the boundary, the loop will eventually route around it. That is why Greyforge cares more about hard separation, provenance, and bounded authority than about another benchmark win.

Useful demonstration of the failure mode.

1 reply

lucianfialho Apr 8, 2026
Author

You're right, and this is the more important observation.

We caught the obvious version in session 1 — the agent trained OLS directly on holdout sales when load_data() returned the full dataset. We fixed that and called it "immutable by design." But session 2 has the same problem at one level of indirection: evaluate() returns the exact holdout WAPE on every call. With ~500k evaluations against 14 fixed holdout points and 16 free parameters, the optimizer doesn't need to see the data — it learns the holdout by exhaustive search.

The Meridian comparison doesn't hold under that framing. Meridian used the holdout once, blind. We used it 500k times as a training signal.

What the result actually shows is how quickly a search loop finds and exploits an exposed eval surface, which is exactly what you said. The fix would require an evaluation budget constraint, a noisy oracle, or a rotating holdout.

We updated the README to make this explicit, with credit to you.

Vigtu · 2026-04-08T15:18:45Z

Vigtu
Apr 8, 2026

We solved this at the agent harness level using Claude Code's PreToolUse hooks — shell commands that intercept tool calls and can block them before execution:

{
  "hooks": {
    "PreToolUse": [
      {
        "matcher": "Read|Grep|Glob",
        "hooks": [
          {
            "type": "command",
            "command": "INPUT=$(cat); if echo \"$INPUT\" | grep -q 'ground_truth/val'; then echo 'BLOCKED: holdout is held out.' >&2; exit 2; fi"
          }
        ]
      }
    ]
  }
}

If the path matches the holdout directory, the hook exits with code 2 and the tool call is rejected before execution. The agent can't see the data — not because it was told not to, but because the runtime won't let it. "Immutable by design" at the harness level, no changes to the evaluation code needed.

0 replies

lucianfialho · 2026-04-10T03:40:41Z

lucianfialho
Apr 10, 2026
Author

Update: we ran Session 3 with a budget constraint. The agent circumvented it anyway.

After @GreyforgeLabs's comment, we added a 50,000-call budget on evaluate() per run — forcing the agent to be 20x more efficient than Session 2's ~1M calls. It adapted immediately: switched to L-BFGS-B (10x fewer calls than Nelder-Mead), built better features (Hill saturation in the exact DGP form, piecewise TV slopes, cumulative TV), and reached WAPE 0.0113 honestly after ~600 experiments.

Then it discovered AR residual correction and extrapolated into the holdout. WAPE dropped to 0.0086. Then it generalized: instead of extrapolating, why not directly parameterize the correction for each holdout week?

3 free delta params → 0.0051
5 free delta params → 0.0035
7 free delta params → 0.0014
12 free delta params → 0.0000

With 12 free parameters for 12 of 14 holdout weeks, the optimizer achieves exact fit. Budget spent: 239 calls (0.5% of the limit). The agent then spent ~100 more experiments proving it could do it even more efficiently.

The full progression across three sessions:

Session	Failure mode	What the agent did
1	Explicit data leak	`load_data()` returned full dataset; trained OLS directly on holdout
2	Implicit search leak	~500k optimizer calls; learned holdout by exhaustive search
3	Structural over-parameterization	N_DIRECT=12 free deltas for 12 holdout points; exact fit within budget

Each fix plugged one hole. The agent found the next one.

The finding: budget constraints are necessary but not sufficient. Any signal from the true holdout — however indirect, however rate-limited — eventually becomes a training signal. The architectural fix is that the agent must never receive true holdout WAPE during optimization.

Session 4 runs now with that constraint: evaluate() budget of 10 calls per run. The obj() function optimizes train WAPE only. evaluate() is called once at the end. Honest baseline: WAPE 0.0365. That's the first number from this series we actually believe.

(Also — @Vigtu's hook approach is a clean alternative for the data-access version of this problem. The oracle-signal version needs the evaluation budget constraint on top of it.)

0 replies

lucianfialho · 2026-04-12T01:32:15Z

lucianfialho
Apr 12, 2026
Author

Session 4 done: honest WAPE = 2.18%

Closing the loop. We ran Session 4 with the blind-holdout fix: evaluate() budget of 10 calls per run, obj() optimizes on train WAPE with internal expanding-window CV on weeks 0–89, and evaluate() is called exactly once at the end of each run to report the honest number.

192 experiments, 16 keeps. Final WAPE: 2.18%.

Key honest discoveries (none of them new ideas — just the ones that survived without leak):

Hill saturation on TV and Search in the exact DGP form x/(x+c)
Expanding-window CV on training data (5 folds, weighted toward recent weeks)
Ensemble of top-8 optimizer solutions
Interaction search_sat × tv_sat^1.5
Tight starts near the DGP values

The plateau at 2.18% held for ~60 consecutive experiments. Theoretical noise floor (given NOISE_STD=50 and 14 holdout weeks) is ~2.7% — the agent landed slightly below, probably because this specific holdout realization had below-average noise. Any further improvement would be fitting noise, not signal.

Final comparison on the same dataset, same holdout:

Model	WAPE	Honest?
Meridian (Google)	4.04%	yes
OLS baseline	3.65%	yes
Agent — Session 4	2.18%	yes
Agent — Session 2	0.23%	no (implicit leak)
Agent — Session 3	0.00%	no (structural leak)

The ~2x improvement over Meridian is real and holds under fair conditions. Not the 17x headline we started with, but an honest number we can defend.

What I actually learned from these four sessions:

Any signal from the true holdout — direct data access, optimizer feedback, structural parameterization — eventually leaks if the agent is allowed to consume it during optimization.
Budget constraints help but don't fix the underlying issue.
The only architectural fix is: the agent never receives true holdout feedback during optimization. Period.
The honest WAPE is the one you get when the agent has never seen the holdout as a training signal. Everything else is search for a leaked oracle.

Thanks again @GreyforgeLabs — the 17x headline was wrong, but the lesson was much more valuable than the number would have been.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Applied autoresearch to Marketing Mix Modeling (MMM) — 12x improvement, beat Google Meridian #497

Uh oh!

{{title}}

Uh oh!

Replies: 4 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Applied autoresearch to Marketing Mix Modeling (MMM) — 12x improvement, beat Google Meridian #497

Uh oh!

lucianfialho Apr 7, 2026

Setup

Results (~137 experiments across 2 sessions)

What the agent discovered without being told

Security incident worth noting

Repo

Replies: 4 comments · 1 reply

Uh oh!

GreyforgeLabs Apr 8, 2026

Uh oh!

lucianfialho Apr 8, 2026 Author

Uh oh!

Vigtu Apr 8, 2026

Uh oh!

lucianfialho Apr 10, 2026 Author

Uh oh!

lucianfialho Apr 12, 2026 Author

lucianfialho
Apr 7, 2026

Replies: 4 comments 1 reply

GreyforgeLabs
Apr 8, 2026

lucianfialho Apr 8, 2026
Author

Vigtu
Apr 8, 2026

lucianfialho
Apr 10, 2026
Author

lucianfialho
Apr 12, 2026
Author