Skip to content

test(benchmark): pure-simulation calibration controller comparison framework#2007

Draft
cl445 wants to merge 9 commits into
KartoffelToby:developfrom
cl445:feature/calibration-benchmark
Draft

test(benchmark): pure-simulation calibration controller comparison framework#2007
cl445 wants to merge 9 commits into
KartoffelToby:developfrom
cl445:feature/calibration-benchmark

Conversation

@cl445
Copy link
Copy Markdown
Collaborator

@cl445 cl445 commented May 27, 2026

Summary

Adds tests/benchmark/ — a pure-simulation framework for comparing every Better Thermostat calibration mode against reproducible thermal-dynamics scenarios. No Home Assistant runtime, no hardware, no external data — every result is a deterministic function of code + seeds.

The framework reports an oracle-normalised 0..1 score matrix across four dimensions (overall, comfort, actuator, energy), each with a per-dimension σ across scenarios so the mean alone can't hide bimodal behaviour. A weighted overall collapses the three dimensions under a configurable user profile (balanced, comfort_first, longevity_first, energy_first).

Quick start: python -m tests.benchmark.runner

What's in here

  • Plant models: RC2 / RC3 lumped-parameter thermal plants, multi-TRV (N parallel radiators sharing one room), 13 plant profiles
  • Scenarios: 37 reproducible setpoint/disturbance scenarios + a synthetic AR(1) weather generator (Chicago-like, Denver-like presets)
  • Production controllers — adapters for every BT calibration mode:
    • mpc, pid, tpi — the three smart controllers
    • heating_power ("Time Based" UI) — EMA-learned heating rate + valve heuristic from utils/helpers.py
    • default ("Target Temperature Based") — offset-calibrated TRV-internal P-loop on the room sensor
    • aggressive ("Fix Calibration") — DEFAULT + −2.5 K boost while heating
    • no_calibration — TRV-internal P-loop on the TRV's own body sensor (no offset compensation)
  • Baselines: BangBang, LinearP, IdealOracle (sanity floor and ceiling)
  • Indirect-TRV wrapper: models Tado/Bosch/Tuya/Sonoff offset-mode behaviour around each of the four smart controllers
  • Metrics & scoring: comfort (overshoot/settling/SSE), actuator (valve travel), energy (integral valve usage), each oracle-normalised so 1.0 = oracle-equivalent
  • Reporter: console matrices for single-TRV, multi-TRV, and cross-plant comparisons, with per-metric σ across scenarios
  • Plant-fit demo (plant_fit/): derives PlantParams from an HA recorder-style CSV; ships with a deterministic synthetic dataset (no real-data leak)
  • README: quick-start, interpretation guide, threshold tables, anchor points, multi-TRV & indirect-TRV sections, limitations

Current results

python -m tests.benchmark.runner — single-TRV, balanced profile:

  controller          overall      σ  comfort      σ  actuator      σ   energy      σ    n
 *ideal_oracle          0.963  0.109    0.954  0.141     0.987  0.079    0.951  0.172   37
  pid                   0.818  0.121    0.946  0.123     0.525  0.373    0.938  0.160   37
  tpi                   0.804  0.161    0.747  0.220     0.825  0.277    0.917  0.157   37
  pid+indirect_trv      0.783  0.120    0.920  0.146     0.455  0.360    0.932  0.161  148
  heating_power         0.743  0.150    0.656  0.221     0.782  0.286    0.901  0.155   37
  tpi+indirect_trv      0.724  0.160    0.774  0.229     0.512  0.378    0.916  0.160  148
  linear_p              0.710  0.148    0.588  0.199     0.822  0.296    0.847  0.151   37
  mpc                   0.707  0.112    0.906  0.121     0.250  0.354    0.892  0.156   37
  mpc+indirect_trv      0.701  0.123    0.894  0.140     0.258  0.366    0.885  0.160  148
  default               0.697  0.152    0.579  0.195     0.851  0.304    0.760  0.148   37
  aggressive            0.685  0.070    0.948  0.116     0.093  0.272    0.913  0.160   37
  heating_power+indirect_trv  0.677  0.158  0.712  0.232  0.465  0.365   0.908  0.159  148
  bangbang              0.604  0.120    0.840  0.126     0.129  0.301    0.729  0.175   37
  no_calibration        0.564  0.133    0.546  0.201     0.855  0.290    0.173  0.106   37

Single-TRV story:

  • pid leads overall (strong comfort, weak actuator); tpi is best on actuator among smart controllers
  • aggressive posts the second-best comfort score in the field (0.948) but pays the worst actuator score after BangBang (0.093) — the +2.5 K boost forces near-full-open during every transient
  • no_calibration has the worst energy score by a wide margin (0.173) — the TRV body stays warmer than the room, so its internal P-loop closes the valve while the room is still cold (chronic under-heating)
  • default lands close to linear_p (it essentially is one — proportional on the room sensor with the TRV's internal gain)
  • heating_power ("Time Based") sits mid-pack — solid actuator and energy, weaker comfort because the heuristic doesn't actively track the setpoint

--multi-trv — same scenarios through three parallel radiators + BT's distribute_valve_percent. The ranking inverts: smooth-output controllers (linear_p, tpi) distribute cleanly across radiators; aggressive ones (pid, mpc, bangbang) divergence-amplify through the distribution stage. bangbang collapses to half its single-TRV score.

--plant-sweep — cross-plant overall, balanced profile, realistic anchor + four DOE envelope classes:

  controller         plant=realist plant=doe_sfd plant=doe_sfd plant=doe_sfd plant=doe_mid      mean      ±σ
  ideal_oracle               1.000         1.000         1.000         1.000         1.000     1.000   0.000
  tpi                        0.880         0.941         0.911         0.863         0.862     0.892   0.030
  linear_p                   0.812         0.878         0.849         0.825         0.821     0.837   0.024
  pid                        0.725         0.787         0.841         0.899         0.861     0.823   0.061
  mpc                        0.747         0.766         0.828         0.878         0.844     0.812   0.049
  bangbang                   0.363         0.517         0.684         0.876         0.751     0.638   0.180

tpi is the most plant-robust production controller (σ across plants = 0.030); pid/mpc swing 0.06–0.07 between best and worst envelopes; bangbang's σ of 0.18 makes it the by-far worst on cold-leaky envelopes and ironically competitive on heavy-mass plants.

Coverage

242 unit tests, ~97 % production coverage. ruff check, ruff format --check, pyright and mypy all run clean over the benchmark.

Scope

This is purely additive — it touches nothing under custom_components/. The benchmark consumes the production calibration modules (mpc, pid, tpi) and the heating-power constants from utils/const.py; those are read-only call sites.

One small fix outside the benchmark: tests/unit/test_sensor.py::test_cleanup_stale_empty_entry_removed used the legacy asyncio.get_event_loop().run_until_complete() pattern, which raises RuntimeError on Python 3.13 unless an earlier test set up a loop. Adding the benchmark tests changes pytest's collection order and exposes the latent issue — converted to the standard @pytest.mark.asyncio / await idiom.

Known limitations

The framework deliberately does not model:

  • Multi-zone thermal coupling between separate rooms
  • Occupancy / presence schedules beyond the diurnal sinus already in the scenario library
  • HVAC topologies other than radiator + boiler / heat-pump
  • Cross-session learning state (each scenario starts cold; heating_power defaults to 0.02 °C/min, a representative settled-in value)
  • The heat_loss cooling-rate estimator from utils/thermal_learning.py (not consumed by any active calibration mode)

These are plant-architecture extensions rather than missing adapters and would be appropriate follow-up work.

Test plan

  • python -m pytest tests/benchmark/tests -p no:homeassistant passes
  • python -m tests.benchmark.runner produces a score matrix for all controllers
  • python -m tests.benchmark.runner --plant-sweep produces cross-plant summary
  • python -m tests.benchmark.runner --multi-trv produces the three multi-TRV matrices
  • python -m tests.benchmark.plant_fit.fit_plant runs the in-memory demo

Summary by CodeRabbit

Release Notes

  • Tests
    • Added comprehensive benchmark testing suite for heating system controller calibration, including thermal plant simulation models, scenario definitions, and performance metrics computation.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 27, 2026

Review Change Stack

📝 Walkthrough

Walkthrough

Adds a full calibration benchmark: simulators (plant/sensor/actuator), adapters (PID/TPI/MPC/passive/oracle/heating-power/indirect TRV), scenarios, synthetic weather, metrics/scoring/reporting, CLI runner (single/multi‑TRV), plant-fit demo, README, and comprehensive tests.

Changes

Calibration Benchmark Suite

Layer / File(s) Summary
Thermal models and I/O devices (plant, multi-TRV plant, sensor, actuator)
tests/benchmark/plant.py, tests/benchmark/multi_trv_plant.py, tests/benchmark/sensor.py, tests/benchmark/actuator.py
Implements RC2/RC3 TwoStatePlant with optional pipe delay, multi-TRV plant with profiles, deterministic Sensor, and Actuator with shaping profiles.
Controller adapters and wrappers (base, baselines, passive, PID/TPI/MPC, heating-power, indirect TRV)
tests/benchmark/adapters/*
Defines adapter protocol and adapters: baselines, passive modes, PID/TPI/MPC wrappers, heating power learner, and indirect TRV wrapper with vendor presets.
Scenarios, schedules, and synthetic weather
tests/benchmark/scenarios.py, tests/benchmark/schedules.py, tests/benchmark/weather/*
Adds schedule builders, synthetic climate generator/presets, and many ScenarioConfig instances.
Metrics, scoring, and text reporting
tests/benchmark/metrics.py, tests/benchmark/scoring.py, tests/benchmark/reporter.py
Computes metrics, profile-weighted scores, and renders score matrices and plant-sweep tables.
Benchmark runners (single-TRV and multi-TRV)
tests/benchmark/runner.py, tests/benchmark/multi_trv_runner.py
CLI runner executes scenarios across adapters; multi-TRV runner distributes valve across radiators and aggregates results.
Plant-fit demo and synthetic dataset generator
tests/benchmark/plant_fit/*
Generates deterministic synthetic datasets and fits plant parameters from CSV; includes CLI and ignore file.
Documentation and package inits
tests/benchmark/README.md, tests/benchmark/__init__.py, tests/benchmark/adapters/__init__.py
README with usage/interpretation, and package/module docstrings.
Tests: actuator, sensor, plant (RC2/RC3), pipe delay
tests/benchmark/tests/test_actuator_and_sensor.py, tests/benchmark/tests/test_plant.py, tests/benchmark/tests/test_rc3_and_equal_percentage.py, tests/benchmark/tests/test_sensor_lag_and_pipe_delay.py
Validates actuator shaping, sensor behavior, RC2/RC3 dynamics, and pipe-command delay.
Tests: adapters (smoke, behavior, heating-power)
tests/benchmark/tests/test_adapters.py, tests/benchmark/tests/test_heating_power_adapter.py
Adapter smoke tests and detailed HeatingPowerAdapter learning/reset/export behavior.
Tests: metrics, scoring, reporter formatting
tests/benchmark/tests/test_metrics.py, tests/benchmark/tests/test_scoring.py, tests/benchmark/tests/test_reporter.py
Checks metric math, scoring transforms, and formatted table rendering/ordering.
Tests: CLI, smoke, sensitivity, multi-TRV
tests/benchmark/tests/test_runner_cli.py, tests/benchmark/tests/test_runner_smoke.py, tests/benchmark/tests/test_sensitivity.py, tests/benchmark/tests/test_multi_trv.py
Covers CLI flows, MPC smoke, plant sensitivity/stabilisation effects, and multi-TRV runs.
Tests: schedules, weather synthetic, synthetic dataset and benchmark tests init
tests/benchmark/tests/test_schedules.py, tests/benchmark/tests/test_weather_synthetic.py, tests/benchmark/tests/test_synthetic_dataset.py, tests/benchmark/tests/__init__.py
Validates schedules and weather generators; runs dataset generation and fit roundtrip.
Misc: existing unit test async style update
tests/unit/test_sensor.py
Converts a coroutine run to native async test awaiting cleanup helper.

Sequence Diagram(s)

sequenceDiagram
  rect rgba(66, 135, 245, 0.5)
  participant CLI as Runner CLI
  participant AD as Adapter
  participant PL as Plant/Facade
  participant SE as Sensor
  participant MT as Metrics
  participant SC as Scoring
  participant RP as Reporter
  end
  CLI->>AD: step(ctx)
  AD->>PL: valve_percent
  PL-->>SE: true temps
  SE-->>CLI: sampled temps
  CLI->>MT: TimeSeries
  MT-->>CLI: MetricValues
  CLI->>SC: metrics + oracle + profile
  SC-->>CLI: scores
  CLI->>RP: scored results
  RP-->>CLI: score matrix text
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related PRs

Suggested labels

enhancement

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

@cl445 cl445 force-pushed the feature/calibration-benchmark branch 3 times, most recently from 55c720c to 90e2b42 Compare May 29, 2026 08:57
cl445 added 8 commits June 1, 2026 12:02
Lumped-parameter thermal plant that the benchmark drives controllers
against. The plant has two states (room + radiator) by default and an
optional third (wall mass) for RC3 envelopes; an explicit Euler
integrator on the caller's step size is sufficient for typical
``tau_rad ≥ 5 min`` settings. Profiles cover small/standard/large
residential rooms, an underfloor variant, four DOE Reference Building
envelope classes, a heat-pump low-supply-temperature setup, and the
realistic combined profile used by ``--plant-sweep``.

Sensor model: sampled-output with optional thermal lag (first-order),
EMA smoothing, additive noise, dropout, calibration bias and drift,
and Zigbee-style sample jitter. ``REALISTIC_SENSOR_PARAMS`` ships as a
preset matched to typical Zigbee TRVs.

Actuator model: linear, threshold (deadband), exponential, and
equal-percentage flow profiles plus optional hysteresis, integer
quantisation, and a generic deadband. Equal-percentage approximates
real residential TRV valve dynamics.

Multi-TRV plant: ``MultiTrvPlant`` shares one room state across N
parallel radiators with per-TRV gain, sensor offset, and deadband.
Used by ``multi_trv_runner.py`` to evaluate single-output controllers
plus BT's ``distribute_valve_percent`` distribution heuristic.

Modelling choices follow the standard lumped-RC literature
(Bacher & Madsen 2011; ISO 52016-1; Burford & Madsen 2019 for pipe
transport delay).

Tests cover plant stability and convergence, RC2 ↔ RC3 equivalence
in the wall-fast limit, equal-percentage shape, sensor thermal-lag
step response, and pipe-delay buffer correctness.
37 deterministic scenarios cover the operating regimes a residential
calibration controller meets in practice:

* Setpoint transitions (S01–S06, S25): small/large steps, frost-to-
  comfort cold-start, mid-heat changes, demand-response sequences.
* Outdoor dynamics (S07, S08, S22): cold step, slow warming ramp,
  diurnal cycle.
* Disturbances (S09–S11, S21): brief and long window openings, solar
  ramp, stochastic windows à la IEA Annex 79.
* Sensor pathologies (S12, S17, S18, S35): dropout, fixed bias, slow
  drift, asynchronous sampling jitter.
* Actuator pathologies (S19, S20): stiction and deadband.
* Plant variants (S05, S23, S29, S31, S33): slow radiator, boiler-
  limited supply, heat-pump sweet-spot, boiler cycling, UFH asymmetric
  setpoint cost.
* Lifecycle (S13, S14, S16, S24, S27, S30, S36): cold start, nightly
  setback, vacation, controller restart, pipe fill after idle, frost-
  drift recovery, mid-episode user override.
* Indirect TRV (S28): SP step on a Tado-style offset-mode TRV.
* Weather slices (S32, S37, S38): forecast-mismatch solar, week-long
  humid-continental and semi-arid winter slices.

``schedules.py`` factors out reusable shape generators (step,
constant, pulse, ramp, piecewise_step, sinus_diurnal, stochastic
windows, solar trapezoid).

``weather/synthetic.py`` produces ``(outdoor, solar)`` schedules from a
small statistical model — seasonal cosine + diurnal cosine + AR(1)
synoptic anomaly, plus an astronomical clear-sky envelope times an
AR(1)-cloud-fraction. Two climate presets (``CHICAGO_LIKE`` /
``DENVER_LIKE``) bracket distinct mid-latitude winter dynamics. No
external weather files; everything reproduces from a seed.

Scenario design draws on BOPTEST, ASHRAE Guideline 36, EN 16798-1
comfort categories, the DOE Reference Building envelope families, and
IEA EBC Annex 79 occupant-behaviour patterns.
``metrics.py`` computes from a recorded time series:

* Comfort: max overshoot, max undershoot, settling time, steady-state
  error, RMSE tracking, time above / below setpoint (K·h, BOPTEST
  ``tdis_tot`` split).
* Actuator: valve cycle count, total travel (Σ|Δu_pct|),
  sweet-spot residency (40–60 % modulation, heat-pump-relevant).
* Energy: integral valve usage.

``scoring.py`` turns the raw metrics into a continuous 0..1 score per
dimension, normalised against an ``IdealOracle`` reference on the
same scenario:

* ``comfort_score`` combines overshoot, settling, ss-error (0.4/0.4/0.2).
* ``actuator_score`` is total-travel-based — small high-frequency
  motions add up even when they don't reverse direction.
* ``energy_score`` is symmetric around the oracle's integral, so both
  over- and under-heating are penalised.

Four ``UserProfile`` weightings (``balanced``, ``comfort_first``,
``longevity_first``, ``energy_first``) recombine the three dimensions
into one ``overall`` value.

``reporter.py`` formats the results as a controller × dimension score
matrix (default), an optional per-scenario detail table, and a
cross-plant matrix used by ``--plant-sweep``. There is no PASS/FAIL
output — the scoring is continuous everywhere.

Tests cover boundary conditions of each scorer (oracle-equal returns
1.0, double-oracle integral returns 0.0, ``inf`` settling penalised,
profile weight invariants).
A controller plugs into the benchmark through ``adapters.base``:

  class ControllerAdapter(Protocol):
      name: str
      family: ControllerFamily
      def reset(self, prior: dict | None = None) -> None: ...
      def step(self, ctx: BenchmarkContext) -> BenchmarkOutput: ...

Production controllers (each ~30–50 LoC of thin glue around the
real BT module):

* ``mpc_adapter`` wraps ``compute_mpc`` from
  ``utils/calibration/mpc``; runs the controller against the
  benchmark's simulated time instead of wall clock.
* ``tpi_adapter`` wraps ``compute_tpi``.
* ``pid_adapter`` wraps ``compute_pid``.

Baselines (``adapters.baselines``):

* ``BangBangAdapter`` — hysteresis on/off at the setpoint.
* ``LinearPAdapter`` — proportional only, ``u = clip(Kp · err, 0, 100)``.
* ``IdealOracleAdapter`` — knows the plant model and computes the
  steady-state-inverting valve fraction. Used by the scorer as the
  per-scenario reference; also serves as the stabilisation controller
  before each scenario.

Indirect-TRV wrapper (``adapters.indirect_trv``) sits in front of any
inner adapter and converts its valve-percent decision into a TRV
setpoint subject to ``setpoint_step_K`` quantisation, an internal
hysteresis band, an internal P-loop, and optional command latency.
Four vendor presets — Tado, Bosch BTH-RA, Tuya TS0601, Sonoff TRVZB —
characterise the failure modes characteristic of offset-mode TRVs.

Tests construct each adapter, verify it produces well-formed output,
and that a temperature error drives demand for feedback-family
controllers.
``runner.py`` is the user-facing entry point. Each invocation:

1. Pre-computes ``IdealOracle`` metrics per (scenario, plant) for the
   scorer to normalise against.
2. Runs every requested controller × scenario × plant combination
   through ``_drive_adapter``, a small loop parametrised by a
   ``PlantFacade`` protocol so the same drive logic serves the
   single-TRV (``_SingleTrvFacade``) and multi-TRV
   (``_MultiTrvFacade``) plants.
3. Stabilises the plant with the Oracle for ``--stabilisation-min``
   simulated minutes before the test adapter takes over.
4. Honours a per-scenario ``controller_restart_t_s`` that simulates
   an HA restart mid-flight: snapshot + cold reset of the adapter.
5. Renders the score matrix (default) or, with ``--plant-sweep``,
   per-plant matrices plus a cross-plant summary.

CLI:

  python -m tests.benchmark.runner
  python -m tests.benchmark.runner --controller mpc --scenario S01_setpoint_step_small
  python -m tests.benchmark.runner --profile longevity_first --per-scenario
  python -m tests.benchmark.runner --plant-sweep

Flags: ``--controller`` (repeatable), ``--scenario`` (repeatable),
``--step-s`` (simulator step), ``--plant`` (repeatable; ``all`` for
the full sweep), ``--stabilisation-min``, ``--profile``,
``--per-scenario``, ``--plant-sweep``.

``multi_trv_runner.py`` adds the corresponding multi-TRV wrapper —
single-output controllers run on an aggregated equivalent plant; BT's
``distribute_valve_percent`` heuristic dispatches the valve command
across the parallel radiators.

Tests cover the runner smoke path, the plant-profile registry, and
the time-scale-aware duration scaling for slow plants.
Demonstrates how to estimate ``PlantParams`` from a Home Assistant
``recorder`` temperature export.

``generate_synthetic_data.generate_dataset`` produces a 60-day, two-
room, one-outdoor hourly dataset entirely in memory from a fixed seed
— deterministic, no committed CSV. The ground-truth ``PlantParams``
are documented inline so the fitter has a known answer to recover.

``fit_plant.main`` runs the fitting methodology:

1. Find natural cooling windows (monotonically dropping room temp).
2. Fit ``T(t) = T_out + (T0 − T_out) exp(−t / tau)`` per window.
3. Aggregate ``tau_room_min`` as median across windows.
4. Detect heating windows and estimate a coarse heating rate.

By default the fitter consumes the synthetic dataset in memory. Pass
a CSV path to fit against real recorder data (the long-format
``timestamp,value,entity_id`` that ``recorder``'s statistics table
exports). ``generate_synthetic_data.main(out_path)`` dumps the same
dataset to CSV for inspecting the recorder format.

The roundtrip test asserts that a freshly generated dataset produces
sensible-magnitude tau values for both rooms under the fitter's
algorithm — guarding against drift in either side of the pipeline.

The committed CSV is intentionally absent (``.gitignore``); it
regenerates on demand.
Five-section README that gets a reader from "what is this?" to
running the benchmark in under a minute:

* **Quick start** — one command + example score matrix output, with
  a one-paragraph explanation of what the numbers mean.
* **Common runs** — a table of the six most useful flag combinations
  (default, focused single-controller-and-scenario, profile reweight,
  per-scenario detail, multi-TRV, plant-sweep).
* **Score dimensions** — three bullets defining comfort, actuator
  and energy, plus how the four user profiles recombine them.
* **Adding a controller / scenario** — five-line recipes pointing
  at ``adapters/base.py`` and ``schedules.py``.
* **Module layout** — annotated directory tree as the architecture
  map; one line per module describing its job.

Plus a "plant-fit demo" pointer for users who want to fit
``PlantParams`` from their own HA recorder data.

No motivation walltext, no bibliography, no DESIGN.md narrative —
the code's docstrings and ``--help`` are the canonical references;
this file just opens the door.
The test used the broken ``asyncio.get_event_loop().run_until_complete()``
pattern. In Python 3.13 that raises ``RuntimeError: There is no current
event loop in thread 'MainThread'`` unless an earlier test happened to
set up a loop in the same session.

Convert to the ``@pytest.mark.asyncio`` / ``await`` idiom that the rest
of the test suite uses.
@cl445 cl445 force-pushed the feature/calibration-benchmark branch from 90e2b42 to c88eac7 Compare June 1, 2026 10:06
@cl445
Copy link
Copy Markdown
Collaborator Author

cl445 commented Jun 1, 2026

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Jun 1, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@KartoffelToby KartoffelToby added this to the 1.9.0 milestone Jun 1, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 33

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/benchmark/adapters/base.py`:
- Around line 32-43: BenchmarkOutput currently only documents that exactly one
of valve_percent, setpoint_offset_K, or duty_cycle_pct must be set but doesn't
enforce it; add a __post_init__ method on the BenchmarkOutput dataclass that
counts how many of those three fields are not None and raises a ValueError if
the count is not exactly 1, including the actual values or which fields were set
in the error message; because the class is frozen, do not attempt to mutate
fields in __post_init__, only validate and raise on violation.

In `@tests/benchmark/adapters/baselines.py`:
- Around line 166-179: The oracle feedback P-term is using the noisy sensor
value ctx.current_temp_C; change it to use the plant truth ctx.raw_room_temp_C
when computing error_K so the perfect-knowledge adapter remains an upper bound.
Locate the block that computes error_K, u_fb_pct and returns BenchmarkOutput
(referenced by variables error_K, u_fb_pct, u_ff_pct and the BenchmarkOutput
call) and replace ctx.current_temp_C with ctx.raw_room_temp_C, keeping the
clamping/rounding logic unchanged.

In `@tests/benchmark/adapters/heating_power_adapter.py`:
- Around line 127-155: When handling the transition into heating in the block
guarded by "if is_heating and not self._was_heating" ensure any pending cycle is
finalized before resetting cycle state: detect an unfinished cycle (e.g.,
self._cycle_peak_temp or self._cycle_peak_t is set) and call
self._finalize_cycle() first, then proceed to set self._cycle_start_temp,
self._cycle_start_t and clear peak fields; this guarantees the previous cycle
updates heating_power rather than being overwritten.

In `@tests/benchmark/adapters/indirect_trv.py`:
- Around line 129-133: The reset(prior) implementation currently forwards the
entire exported snapshot to self.inner.reset and then always clears wrapper
state; change it to accept the same shape export_state() produces by extracting
the child snapshot under the "inner" key (e.g. child_prior = prior.get("inner")
if prior else None) and pass that to self.inner.reset(child_prior), and only
clear or restore the TRV-layer cache (_last_quantised_setpoint_C and
_pending_setpoints) from top-level keys in prior when present (fall back to
clearing when prior is None or keys missing); apply the same extraction/restore
pattern to the other reset implementations that mirror this behavior (the ones
near the later duplicated blocks).

In `@tests/benchmark/adapters/mpc_adapter.py`:
- Around line 34-43: The constructor currently uses a fixed default key which
causes different MpcAdapter instances to collide in the module-global
_MPC_STATES; change __init__ in MpcAdapter so when key is not explicitly
provided it generates a per-instance unique key (e.g., using uuid4 or combining
id(self) with a timestamp/monotonic counter), assign that to self._key, and then
pop only that key from mpc_mod._MPC_STATES; apply the same unique-key behavior
to the other constructor/initialization paths referenced (the alternative init
blocks around lines 51-56 and 63-75) so every instance isolates its MPC state.

In `@tests/benchmark/adapters/pid_adapter.py`:
- Around line 27-36: The default key "bench:trv:t0" causes instances of
PidAdapter to clobber each other's entries in pid_mod._PID_STATES; change the
constructor signature to accept key: str | None = None and when key is None
generate a unique key (e.g. using uuid4() or a timestamp+pid) and assign it to
self._key so each PidAdapter (and its reset() method) uses a distinct
pid_mod._PID_STATES key; update any references to the previous literal key in
__init__ and reset() to use self._key and only share a fixed key when the caller
explicitly passes one.

In `@tests/benchmark/adapters/tpi_adapter.py`:
- Around line 31-39: The adapter uses the fixed default key "bench:trv:t0" so
multiple TpiAdapter instances share the same entry in tpi_mod._TPI_STATES;
change the ctor to generate a unique default key when key is not provided (e.g.,
use uuid.uuid4().hex or similar) and keep explicit keys unchanged, update the
places that currently reference the literal default (the __init__ where
self._key is set and where tpi_mod._TPI_STATES.pop(...) is called) and also
adjust the other occurrences noted (the blocks around the current lines 47-52
and 59-60) to use the instance's self._key so each TpiAdapter uses its own TPI
state. Ensure to import uuid if you use it.

In `@tests/benchmark/metrics.py`:
- Around line 161-169: The integration currently uses the full interval dt_h
between series.t_s[i-1] and series.t_s[i], which overcounts if transient_start_s
falls inside that interval; update the loop that computes dt_h/err (using
series.t_s, T_room_C, T_setpoint_C) to clip the first interval by computing
start_t = max(series.t_s[i-1], transient_start_s) and then dt_h = (series.t_s[i]
- start_t) / 3600.0 (skip if dt_h <= 0), then compute err = series.T_room_C[i] -
series.T_setpoint_C[i] and accumulate into above_K_h / below_K_h as before.
- Around line 15-22: Add a TimeSeries.__post_init__ that validates the four
sequences (t_s, T_room_C, T_setpoint_C, valve_pct) all have identical lengths
and that t_s is strictly increasing (every t_s[i+1] > t_s[i]); if not, raise a
ValueError with a clear message. Because TimeSeries is frozen, perform checks
without mutating the instance (e.g., convert sequences to lists locally for
length and monotonic checks) and reference the attributes by name (t_s,
T_room_C, T_setpoint_C, valve_pct) so bad benchmark inputs fail fast.

In `@tests/benchmark/multi_trv_plant.py`:
- Around line 56-61: The configuration validation is missing for
deadband_pcts_per_trv which can cause IndexError in step(); in the __init__
where other list-length checks occur (see gain_heaters, coupling_rad_room,
trv_sensor_offsets_K), add a check that either len(params.deadband_pcts_per_trv)
== 0 or len(params.deadband_pcts_per_trv) == params.n_trvs and raise a
ValueError with a clear message if not; this will ensure step() can safely index
deadband_pcts_per_trv[i] for each TRV.

In `@tests/benchmark/multi_trv_runner.py`:
- Around line 127-137: run_multi_trv_scenario currently ignores per-scenario
overrides and always uses the function default stabilisation_min; change it to
honor scenario.stabilisation_min when provided by computing an effective value
(e.g. use scenario.stabilisation_min if not None else the stabilisation_min
parameter) and pass that effective value into _stabilise_multi_trv(plant,
scenario, ..., step_s); reference the function run_multi_trv_scenario and the
call to _stabilise_multi_trv to locate where to apply this change.

In `@tests/benchmark/plant_fit/fit_plant.py`:
- Around line 183-186: The median calculation currently sets median =
values_sorted[n // 2], which biases toward the upper middle for even n; update
the code that computes median (the variables values_sorted, n, median) to use a
proper median: either call statistics.median(values_sorted) from the stdlib or
compute the average of the two middle elements ((values_sorted[(n-1)//2] +
values_sorted[n//2]) / 2) when n is even so the summary is unbiased for even
window counts.
- Around line 211-220: The CSV mode uses hard-coded entity IDs (rooms dict and
outdoor_id) after calling _load_csv(path); change this to detect or accept real
recorder entity IDs: when path is provided, parse the CSV headers (keys returned
by _load_csv) and either (a) auto-map room sensors by matching a temperature
pattern (e.g., names containing "temperature" or configurable regex) to populate
rooms and outdoor_id dynamically, or (b) allow the user to pass a mapping
argument to overrideIDs. Update the code paths that reference rooms and
outdoor_id (the block that builds rooms and checks outdoor_id membership) to use
the discovered/mapped IDs so real recorder exports no longer fail; ensure the
same fix is applied to the corresponding later block that uses these variables.

In `@tests/benchmark/README.md`:
- Around line 16-28: The README.md has three minor markdownlint issues: add a
language specifier to the fenced example output block (replace ``` with ```text
or ```), add a blank line before the "Indirect TRVs" heading to ensure proper
heading separation, and add a language specifier to the module layout fenced
block (use ```text); locate the offending blocks by searching for the example
output block shown under the score matrix, the "Indirect TRVs" heading, and the
module layout block near the tests/benchmark tree, then update those three
fenced code blocks to use ```text and insert one blank line immediately before
the "Indirect TRVs" heading.

In `@tests/benchmark/reporter.py`:
- Around line 187-191: The ranking currently converts missing (controller,
plant) cells to NaN (via cell.get) and then uses _mean/_stdev which ignore NaNs,
allowing controllers with partial coverage to be ranked with inflated stats;
update the loop that builds vals/cross_mean/cross_std to compute a coverage
count (e.g., count non-NaN entries from vals), only compute cross_mean/cross_std
when coverage == len(plant_names) (otherwise set cross_mean to a sentinel that
sorts after complete rows or store coverage separately), and adjust the sort
(controllers.sort(...)) to first prefer full-coverage controllers and then sort
by -cross_mean; include the coverage metric in the data structures (e.g.,
cross_coverage) so incomplete rows are deterministically ordered after complete
ones.
- Around line 47-73: score_results drops the block label when creating
ScoredResult causing distinct scenarios from different plant overrides to
collide; fix by adding a block_label (or label) field to the ScoredResult model
and populate it in score_results (use the existing loop variable label) when
constructing ScoredResult, keep compute_scores/metrics unchanged, and then
update any consumers (e.g., render_per_scenario) to use the composite identity
(block_label + scenario) or group by the new field when rendering so same-named
scenarios from different blocks do not overwrite each other.

In `@tests/benchmark/runner.py`:
- Around line 345-351: Add an explicit guard at the start of run_scenario to
reject non-positive step_s (raise ValueError) before any simulation logic runs:
validate step_s > 0 and also mirror the same check for the other public entry
point referenced in the comment (the function covering lines 431-455) so callers
cannot pass step_s <= 0; this prevents division-by-zero in _stabilise_plant()
and the time never-advancing hang in _drive_adapter().
- Around line 521-531: When args.plant_sweep is set, the code currently does
list comprehension that indexes ALL_SCENARIOS[s] (in the scenarios = [s for s in
scenarios if ALL_SCENARIOS[s].plant is PROFILE_STANDARD]) before validating that
each s exists in ALL_SCENARIOS; change this so you first validate scenario names
(ensure every s in scenarios is a key in ALL_SCENARIOS and exit with code 2 for
unknown names) or alter the comprehension to guard existence (e.g., test s in
ALL_SCENARIOS before accessing ALL_SCENARIOS[s]) so that unknown scenarios no
longer raise KeyError when --plant-sweep is used; update the flow around
args.plant_sweep and the scenarios variable so validation occurs before any
indexing into ALL_SCENARIOS.

In `@tests/benchmark/scenarios.py`:
- Around line 31-40: The short docstrings for the dataclasses InitialConditions
and ScenarioConfig should be converted to NumPy-style: replace the one-line
triple-quoted summaries with a summary followed by an "Attributes" or
"Parameters" section listing each field (e.g., T_room_C: float, T_rad_C: float)
with brief descriptions and types; update the docstrings for any similar blocks
noted (the ones around the other occurrences referenced) to the same NumPy-style
format so all new scenario types/helpers follow the project's docstring
convention.
- Around line 208-218: The SensorParams for S12_SENSOR_DROPOUT is missing the
start timestamp so the 10-minute outage is not encoded; update the SensorParams
passed into S12_SENSOR_DROPOUT to include the dropout start (e.g., add
dropout_from_t_s or dropout_start_t_s = 60 * 60.0 depending on the project’s
parameter name) so the dropout covers 60*60.0 to 70*60.0, or alternatively
change the description to reflect the actual behavior if you intend no start
field.

In `@tests/benchmark/schedules.py`:
- Around line 16-17: Convert the simple docstrings for the public schedule
helper functions (e.g., step and the other schedule helpers in
tests/benchmark/schedules.py) to NumPy-style docstrings: start with a one-line
summary, add a extended description if needed, then a Parameters section listing
each parameter with type and description (t_threshold_s: float, before: float,
after: float, etc.), a Returns section describing the return type
Callable[[float], float], and an optional Examples section showing typical
usage; update all helper functions flagged in the review so they follow this
NumPy-style layout consistently.
- Around line 38-61: The ramp function currently divides by (t_end_s -
t_start_s) which fails or yields undefined behavior for zero/negative spans; in
the ramp function validate the inputs early (for example in ramp) and if t_end_s
<= t_start_s raise a ValueError with a clear message (e.g., "t_end_s must be
greater than t_start_s") so zero- or negative-length windows are rejected;
reference the ramp function and the parameters t_start_s and t_end_s when adding
the check and update the docstring to state this precondition.
- Around line 97-125: stochastic_windows currently allows count <= 0 and can
create overlapping events because durations are sampled from [min_duration_s,
max_duration_s] without regard to remaining slot space; add an upfront
validation that count > 0 (raise ValueError otherwise), compute slot_s =
duration_s / count as before, then for each slot after setting t_start compute
slot_remaining = (i + 1) * slot_s - t_start and ensure dur does not exceed that
remaining budget (e.g., if slot_remaining <= 0 set dur = 0 else sample dur =
rng.uniform(min_duration_s, max_duration_s) and then dur = min(dur,
slot_remaining)); keep using events.append((t_start, t_start + dur)) and leave
schedule unchanged so windows remain non-overlapping and short-slot cases no
longer spill into the next slot.

In `@tests/benchmark/scoring.py`:
- Around line 35-37: Update the public dataclass and function docstrings in
tests/benchmark/scoring.py to use NumPy-style format: for each public symbol
such as DimensionScores (and the other public dataclasses/functions referenced
in the review) replace the current docstring with a NumPy-style docstring
including a short summary line, a Parameters section (with types) for function
args or Attributes for dataclasses, a Returns section for return values (if
any), and an Examples section when useful; ensure triple-quoted docstrings
follow the repository convention and apply this change consistently to the
docstrings around the cited locations (lines ~47-51, 59, 83, 94-98, 115-121,
132-138, 150).
- Around line 82-90: The current _settling_ratio(metric, oracle) treats oracle
<= 0.0 as a neutral ratio (1.0) which hides long candidate settling times;
change that branch to compare metric against an absolute settling threshold
instead of returning 1.0. Specifically, when oracle <= 0.0 (or oracle is
non-positive), compute the ratio as metric / ABS_SETTLING_THRESHOLD (use or add
a constant named ABS_SETTLING_THRESHOLD) and clamp the result to the same max
penalty used for infinite metric (e.g., 10.0) so very large metrics still get
penalized; update the same logic in both occurrences of _settling_ratio (the
function and the duplicated block around lines 102-105).

In `@tests/benchmark/sensor.py`:
- Around line 93-99: The read method currently returns early on dropout,
freezing internal lag state; instead call self._apply_thermal_lag(t_s, T_true_C)
before checking params.dropout_until_t_s so the sensor's internal lag state
advances during the outage, but do not apply bias or drift when returning None;
i.e., move/perform the _apply_thermal_lag call prior to the dropout check in
read and only apply params.bias_K and params.drift_K_per_h when actually
returning a measurement.

In `@tests/benchmark/tests/test_actuator_and_sensor.py`:
- Around line 106-117: The test contains a tautological assertion "assert
s._next_sample_interval_s != 30.0 or True" that never fails; replace it with a
real deterministic check: after creating SensorParams and calling s.read(0.0,
20.0) in test_sensor_jitter_changes_next_interval, either assert
s._next_sample_interval_s != 30.0 (remove the "or True") or, better, assert
equality to the known deterministic value produced by the Sensor RNG (compute or
hard-code the expected numeric interval) so the test actually verifies the
jitter path; reference SensorParams, Sensor, s.read and
s._next_sample_interval_s when making the change.

In `@tests/benchmark/tests/test_indirect_trv.py`:
- Around line 102-115: The test only checks _pending_setpoints length and can
pass even if setpoints are applied immediately; update
test_command_latency_delays_setpoint_change to assert the delayed application:
after creating IndirectTrvParams with command_latency_steps, call
adapter.step(_ctx(...)) repeatedly and for the first command_latency_steps steps
assert the adapter's effective/quantised setpoint (use the adapter method or
property that returns the current applied setpoint—e.g., a quantised_setpoint
getter or the wrapped PID adapter's output) does NOT reflect the new target,
then on the (command_latency_steps+1)th step assert the applied setpoint has
changed to the expected value; continue to also assert the _pending_setpoints
drains as before.
- Around line 157-164: The test test_zero_error_yields_zero_valve currently only
checks range; change the assertion to require near-zero output so regressions
are caught: after creating IndirectTrvAdapter(PidAdapter(), TADO_PARAMS) and
calling adapter.step(_ctx(target=20.0, current=20.0)), replace the loose range
check on out.valve_percent with a strict near-zero assertion (e.g. use
pytest.approx(0.0) or compare abs(out.valve_percent - 0.0) < small_epsilon) to
ensure the internal PID produces effectively 0.0% valve.

In `@tests/benchmark/tests/test_rc3_and_equal_percentage.py`:
- Around line 101-113: Replace the weak sanity check with a direct comparison
between the tau_wall_min==0 fallback path and a known RC2 reference: instantiate
two plants (using TwoStatePlant) — one configured to run the PROFILE_STANDARD
code path but with tau_wall_min forced to 0 (the fallback under test) and
another instantiated as the explicit RC2 reference profile — run identical
step(...) sequences on both, and assert that their trajectories (e.g.,
state.T_room_C and state.T_rad_C after each step or at final time) match within
a tight tolerance; update test_rc2_path_unchanged_when_tau_wall_zero to use
these TwoStatePlant instances and compare step-by-step or final values rather
than a broad range.

In `@tests/benchmark/tests/test_sensor_lag_and_pipe_delay.py`:
- Around line 71-90: The test currently compares plant_a (constructed from
PROFILE_STANDARD) with plant_b (constructed from a different PlantParams),
conflating parameter differences with delay; change the setup so both
TwoStatePlant instances use identical PlantParams except for the
valve_command_delay_s value (e.g., derive delayed by copying PROFILE_STANDARD
and only override valve_command_delay_s=120.0, or set plant_a to use delayed but
with valve_command_delay_s=0.0), keeping PlantState the same for both; update
plant_a/plant_b construction to reference those matching params so the assertion
isolates only the pipe-delay behavior.

In `@tests/benchmark/tests/test_synthetic_dataset.py`:
- Around line 20-29: The helper _fit_one currently pre-filters taus to 30..6000
before the test can assert bounds and ordering; change _fit_one (in
tests/benchmark/tests/test_synthetic_dataset.py) to return all non-None results
from fit_plant._fit_tau_room (i.e., collect taus where
fit_plant._fit_tau_room(w) is not None but do not filter by 30.0 <= tau <=
6000.0), and update the test_fit_recovers_sensible_tau_on_synthetic_data() to
assert that every returned tau lies in 30..6000 and that the kitchen tau is
greater than the living room tau. Ensure references to _fit_one,
fit_plant._fit_tau_room, and the test name are used to locate the changes.

In `@tests/benchmark/weather/synthetic.py`:
- Line 12: The diurnal cosine term is inverted so it produces a minimum at
~15:00 and a maximum overnight; flip the phase/sign of the cosine term used for
the diurnal cycle (i.e., multiply by -1 or add π to the phase) so the waveform
peaks at ~15:00 and has its trough at ~05:00, and update the corresponding
docstring(s) that describe the diurnal cycle to match the corrected phase; apply
the same fix to the other instance of the diurnal cosine term later in the file
(the repeated block around the synthetic weather diurnal calculation).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 1507cbea-5ba3-4233-ba47-41db3a36936e

📥 Commits

Reviewing files that changed from the base of the PR and between 53910d8 and 5a244cc.

📒 Files selected for processing (49)
  • tests/benchmark/README.md
  • tests/benchmark/__init__.py
  • tests/benchmark/actuator.py
  • tests/benchmark/adapters/__init__.py
  • tests/benchmark/adapters/base.py
  • tests/benchmark/adapters/baselines.py
  • tests/benchmark/adapters/heating_power_adapter.py
  • tests/benchmark/adapters/indirect_trv.py
  • tests/benchmark/adapters/mpc_adapter.py
  • tests/benchmark/adapters/passive_modes.py
  • tests/benchmark/adapters/pid_adapter.py
  • tests/benchmark/adapters/tpi_adapter.py
  • tests/benchmark/metrics.py
  • tests/benchmark/multi_trv_plant.py
  • tests/benchmark/multi_trv_runner.py
  • tests/benchmark/plant.py
  • tests/benchmark/plant_fit/.gitignore
  • tests/benchmark/plant_fit/__init__.py
  • tests/benchmark/plant_fit/fit_plant.py
  • tests/benchmark/plant_fit/generate_synthetic_data.py
  • tests/benchmark/reporter.py
  • tests/benchmark/runner.py
  • tests/benchmark/scenarios.py
  • tests/benchmark/schedules.py
  • tests/benchmark/scoring.py
  • tests/benchmark/sensor.py
  • tests/benchmark/tests/__init__.py
  • tests/benchmark/tests/test_actuator_and_sensor.py
  • tests/benchmark/tests/test_adapters.py
  • tests/benchmark/tests/test_fit_plant.py
  • tests/benchmark/tests/test_heating_power_adapter.py
  • tests/benchmark/tests/test_indirect_trv.py
  • tests/benchmark/tests/test_metrics.py
  • tests/benchmark/tests/test_multi_trv.py
  • tests/benchmark/tests/test_passive_modes.py
  • tests/benchmark/tests/test_plant.py
  • tests/benchmark/tests/test_rc3_and_equal_percentage.py
  • tests/benchmark/tests/test_reporter.py
  • tests/benchmark/tests/test_runner_cli.py
  • tests/benchmark/tests/test_runner_smoke.py
  • tests/benchmark/tests/test_schedules.py
  • tests/benchmark/tests/test_scoring.py
  • tests/benchmark/tests/test_sensitivity.py
  • tests/benchmark/tests/test_sensor_lag_and_pipe_delay.py
  • tests/benchmark/tests/test_synthetic_dataset.py
  • tests/benchmark/tests/test_weather_synthetic.py
  • tests/benchmark/weather/__init__.py
  • tests/benchmark/weather/synthetic.py
  • tests/unit/test_sensor.py

Comment on lines +32 to +43
@dataclass(frozen=True)
class BenchmarkOutput:
"""What the controller produced at one step.

Exactly one of ``valve_percent`` / ``setpoint_offset_K`` / ``duty_cycle_pct``
should be set, matching the controller's :attr:`family`.
"""

valve_percent: float | None = None
setpoint_offset_K: float | None = None
duty_cycle_pct: float | None = None
diagnostics: dict[str, Any] = field(default_factory=dict)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win

Enforce BenchmarkOutput's one-field contract.

Right now the "exactly one of these fields should be set" rule is only documented. If an adapter accidentally sets two output families or none, that ambiguity will leak into the runner/reporting layer silently. Add a __post_init__ guard so bad adapters fail at the contract boundary.

Proposed fix
 `@dataclass`(frozen=True)
 class BenchmarkOutput:
     """What the controller produced at one step.
 
     Exactly one of ``valve_percent`` / ``setpoint_offset_K`` / ``duty_cycle_pct``
     should be set, matching the controller's :attr:`family`.
     """
 
     valve_percent: float | None = None
     setpoint_offset_K: float | None = None
     duty_cycle_pct: float | None = None
     diagnostics: dict[str, Any] = field(default_factory=dict)
+
+    def __post_init__(self) -> None:
+        """Validate that exactly one controller output is populated.
+
+        Raises
+        ------
+        ValueError
+            If zero or multiple output families are populated.
+        """
+        populated = sum(
+            value is not None
+            for value in (
+                self.valve_percent,
+                self.setpoint_offset_K,
+                self.duty_cycle_pct,
+            )
+        )
+        if populated != 1:
+            raise ValueError(
+                "BenchmarkOutput requires exactly one of "
+                "valve_percent, setpoint_offset_K, duty_cycle_pct"
+            )
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/benchmark/adapters/base.py` around lines 32 - 43, BenchmarkOutput
currently only documents that exactly one of valve_percent, setpoint_offset_K,
or duty_cycle_pct must be set but doesn't enforce it; add a __post_init__ method
on the BenchmarkOutput dataclass that counts how many of those three fields are
not None and raises a ValueError if the count is not exactly 1, including the
actual values or which fields were set in the error message; because the class
is frozen, do not attempt to mutate fields in __post_init__, only validate and
raise on violation.

Comment on lines +166 to +179
error_K = sp - ctx.current_temp_C
u_fb_pct = max(
-self._feedback_clamp,
min(self._feedback_clamp, error_K * self._feedback_gain),
)

valve = max(0.0, min(100.0, u_ff_pct + u_fb_pct))
return BenchmarkOutput(
valve_percent=valve,
diagnostics={
"u_ff_pct": round(u_ff_pct, 2),
"u_fb_pct": round(u_fb_pct, 2),
"error_K": round(error_K, 3),
},
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Use plant truth for the oracle feedback term.

BenchmarkContext.raw_room_temp_C exists specifically for cheating adapters, and this class advertises itself as the perfect-knowledge upper bound / stabilisation driver. Driving the P term from current_temp_C bakes sensor lag/noise into the oracle, which can depress the normalisation ceiling and skew oracle-normalised scores. Feed back on ctx.raw_room_temp_C here so the oracle remains a real upper bound.

Proposed fix
-        error_K = sp - ctx.current_temp_C
+        error_K = sp - ctx.raw_room_temp_C
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
error_K = sp - ctx.current_temp_C
u_fb_pct = max(
-self._feedback_clamp,
min(self._feedback_clamp, error_K * self._feedback_gain),
)
valve = max(0.0, min(100.0, u_ff_pct + u_fb_pct))
return BenchmarkOutput(
valve_percent=valve,
diagnostics={
"u_ff_pct": round(u_ff_pct, 2),
"u_fb_pct": round(u_fb_pct, 2),
"error_K": round(error_K, 3),
},
error_K = sp - ctx.raw_room_temp_C
u_fb_pct = max(
-self._feedback_clamp,
min(self._feedback_clamp, error_K * self._feedback_gain),
)
valve = max(0.0, min(100.0, u_ff_pct + u_fb_pct))
return BenchmarkOutput(
valve_percent=valve,
diagnostics={
"u_ff_pct": round(u_ff_pct, 2),
"u_fb_pct": round(u_fb_pct, 2),
"error_K": round(error_K, 3),
},
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/benchmark/adapters/baselines.py` around lines 166 - 179, The oracle
feedback P-term is using the noisy sensor value ctx.current_temp_C; change it to
use the plant truth ctx.raw_room_temp_C when computing error_K so the
perfect-knowledge adapter remains an upper bound. Locate the block that computes
error_K, u_fb_pct and returns BenchmarkOutput (referenced by variables error_K,
u_fb_pct, u_ff_pct and the BenchmarkOutput call) and replace ctx.current_temp_C
with ctx.raw_room_temp_C, keeping the clamping/rounding logic unchanged.

Comment on lines +127 to +155
if is_heating and not self._was_heating:
self._cycle_start_temp = ctx.current_temp_C
self._cycle_start_t = ctx.t
self._cycle_peak_temp = None
self._cycle_peak_t = None
elif not is_heating and self._was_heating:
self._cycle_peak_temp = ctx.current_temp_C
self._cycle_peak_t = ctx.t
elif (
not is_heating
and self._cycle_peak_temp is not None
and ctx.current_temp_C > self._cycle_peak_temp
):
self._cycle_peak_temp = ctx.current_temp_C
self._cycle_peak_t = ctx.t

if (
self._cycle_start_temp is not None
and self._cycle_start_t is not None
and self._cycle_peak_temp is not None
and self._cycle_peak_t is not None
):
elapsed_since_peak_min = (ctx.t - self._cycle_peak_t) / 60.0
if (
ctx.current_temp_C < self._cycle_peak_temp
or elapsed_since_peak_min > _FINALIZE_TIMEOUT_MIN
):
self._finalize_cycle()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Finalize the pending cycle before opening a new one.

When the valve closes, _cycle_peak_* starts tracking the post-heat peak. If demand resumes before the room cools below that peak, Lines 127-131 immediately overwrite the cycle state and the finished cycle is lost without updating heating_power. That under-learns in exactly the bounce/recovery cases this benchmark is supposed to compare.

Suggested fix
     def _update_learner(self, ctx: BenchmarkContext, is_heating: bool) -> None:
@@
-        if is_heating and not self._was_heating:
+        if (
+            is_heating
+            and not self._was_heating
+            and self._cycle_start_temp is not None
+            and self._cycle_start_t is not None
+            and self._cycle_peak_temp is not None
+            and self._cycle_peak_t is not None
+        ):
+            self._finalize_cycle()
+
+        if is_heating and not self._was_heating:
             self._cycle_start_temp = ctx.current_temp_C
             self._cycle_start_t = ctx.t
             self._cycle_peak_temp = None
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/benchmark/adapters/heating_power_adapter.py` around lines 127 - 155,
When handling the transition into heating in the block guarded by "if is_heating
and not self._was_heating" ensure any pending cycle is finalized before
resetting cycle state: detect an unfinished cycle (e.g., self._cycle_peak_temp
or self._cycle_peak_t is set) and call self._finalize_cycle() first, then
proceed to set self._cycle_start_temp, self._cycle_start_t and clear peak
fields; this guarantees the previous cycle updates heating_power rather than
being overwritten.

Comment on lines +129 to +133
def reset(self, prior: dict[str, Any] | None = None) -> None:
"""Reset wrapped controller and the TRV layer."""
self.inner.reset(prior)
self._last_quantised_setpoint_C = None
self._pending_setpoints = []
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Make reset(prior) consume the same snapshot shape that export_state() produces.

export_state() nests the child state under "inner" and exposes wrapper cache separately, but reset() passes the whole dict straight into self.inner.reset() and then unconditionally clears the TRV-layer state. adapter.reset(adapter.export_state()) therefore drops both the inner learned state and the wrapper's hysteresis/latency state instead of restoring it.

Suggested fix
     def reset(self, prior: dict[str, Any] | None = None) -> None:
         """Reset wrapped controller and the TRV layer."""
-        self.inner.reset(prior)
-        self._last_quantised_setpoint_C = None
-        self._pending_setpoints = []
+        inner_prior = prior.get("inner") if isinstance(prior, dict) else None
+        self.inner.reset(inner_prior)
+        self._last_quantised_setpoint_C = (
+            prior.get("last_quantised_setpoint_C") if isinstance(prior, dict) else None
+        )
+        pending = prior.get("pending_setpoints") if isinstance(prior, dict) else None
+        self._pending_setpoints = list(pending) if isinstance(pending, list) else []
@@
     def export_state(self) -> dict[str, Any]:
         """Expose inner state plus TRV-layer cache."""
         return {
             "inner": self.inner.export_state(),
             "last_quantised_setpoint_C": self._last_quantised_setpoint_C,
+            "pending_setpoints": list(self._pending_setpoints),
         }

Also applies to: 176-180, 198-203

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/benchmark/adapters/indirect_trv.py` around lines 129 - 133, The
reset(prior) implementation currently forwards the entire exported snapshot to
self.inner.reset and then always clears wrapper state; change it to accept the
same shape export_state() produces by extracting the child snapshot under the
"inner" key (e.g. child_prior = prior.get("inner") if prior else None) and pass
that to self.inner.reset(child_prior), and only clear or restore the TRV-layer
cache (_last_quantised_setpoint_C and _pending_setpoints) from top-level keys in
prior when present (fall back to clearing when prior is None or keys missing);
apply the same extraction/restore pattern to the other reset implementations
that mirror this behavior (the ones near the later duplicated blocks).

Comment on lines +34 to +43
def __init__(
self, params: MpcParams | None = None, key: str = "bench:trv:t0"
) -> None:
self._params = params if params is not None else MpcParams()
self._state: _MpcState = _MpcState()
self._key = key
self._sim_time_s: float = 0.0
self._original_time = mpc_mod.time
# All benchmark adapters share the global _MPC_STATES dict; isolate ours.
mpc_mod._MPC_STATES.pop(self._key, None)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Use per-instance default keys for MPC state isolation.

MpcAdapter also fronts a module-global cache (_MPC_STATES) but gives every instance the same default key. Two MpcAdapter() objects will therefore share and delete the same MPC state entry, so resets and warm state leak across benchmark runs.

Suggested fix
+from itertools import count
+
+_KEY_COUNTER = count()
+
 class MpcAdapter:
@@
-    def __init__(
-        self, params: MpcParams | None = None, key: str = "bench:trv:t0"
-    ) -> None:
+    def __init__(
+        self, params: MpcParams | None = None, key: str | None = None
+    ) -> None:
         self._params = params if params is not None else MpcParams()
         self._state: _MpcState = _MpcState()
-        self._key = key
+        self._key = key or f"bench:trv:{next(_KEY_COUNTER)}"

Also applies to: 51-56, 63-75

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/benchmark/adapters/mpc_adapter.py` around lines 34 - 43, The
constructor currently uses a fixed default key which causes different MpcAdapter
instances to collide in the module-global _MPC_STATES; change __init__ in
MpcAdapter so when key is not explicitly provided it generates a per-instance
unique key (e.g., using uuid4 or combining id(self) with a timestamp/monotonic
counter), assign that to self._key, and then pop only that key from
mpc_mod._MPC_STATES; apply the same unique-key behavior to the other
constructor/initialization paths referenced (the alternative init blocks around
lines 51-56 and 63-75) so every instance isolates its MPC state.

Comment on lines +157 to +164
def test_zero_error_yields_zero_valve():
"""Zero error yields zero valve."""
adapter = IndirectTrvAdapter(PidAdapter(), TADO_PARAMS)
# Inner PID with target == current → zero demand; quantised setpoint
# lands at/near room temp, so internal P-loop produces ~0 %.
out = adapter.step(_ctx(target=20.0, current=20.0))
assert out.valve_percent is not None
assert 0.0 <= out.valve_percent <= 100.0
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Tighten the zero-demand assertion.

The test name/docstring says zero error should yield zero valve, but the assertion only checks that the output is in range. This would still pass if the wrapper opened fully. Assert pytest.approx(0.0) (or a small epsilon) so the regression is actually caught.

Proposed fix
 def test_zero_error_yields_zero_valve():
     """Zero error yields zero valve."""
     adapter = IndirectTrvAdapter(PidAdapter(), TADO_PARAMS)
     # Inner PID with target == current → zero demand; quantised setpoint
     # lands at/near room temp, so internal P-loop produces ~0 %.
     out = adapter.step(_ctx(target=20.0, current=20.0))
     assert out.valve_percent is not None
-    assert 0.0 <= out.valve_percent <= 100.0
+    assert out.valve_percent == pytest.approx(0.0, abs=1e-6)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/benchmark/tests/test_indirect_trv.py` around lines 157 - 164, The test
test_zero_error_yields_zero_valve currently only checks range; change the
assertion to require near-zero output so regressions are caught: after creating
IndirectTrvAdapter(PidAdapter(), TADO_PARAMS) and calling
adapter.step(_ctx(target=20.0, current=20.0)), replace the loose range check on
out.valve_percent with a strict near-zero assertion (e.g. use pytest.approx(0.0)
or compare abs(out.valve_percent - 0.0) < small_epsilon) to ensure the internal
PID produces effectively 0.0% valve.

Comment on lines +101 to +113
def test_rc2_path_unchanged_when_tau_wall_zero():
"""A plant with tau_wall_min == 0 produces identical trajectories to before."""
plant = TwoStatePlant(
PROFILE_STANDARD, # RC2 default
PlantState(T_room_C=20.0, T_rad_C=20.0),
)
for _ in range(60):
plant.step(30.0, 0.5, 5.0)
# Don't assert exact value — just sanity check that RC2 still warms.
assert 20.0 < plant.state.T_room_C < 35.0
# T_wall_C still tracks T_room_C in RC2 mode because step() does not
# update it. That's fine; nothing reads it.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Compare the tau_wall_min=0 path against an actual RC2 reference.

This test never exercises a distinct tau_wall_min == 0 fallback against a baseline; it just runs PROFILE_STANDARD once and checks a broad temperature range. A material regression in the RC2 code path would still pass here.

Suggested direction
-def test_rc2_path_unchanged_when_tau_wall_zero():
-    """A plant with tau_wall_min == 0 produces identical trajectories to before."""
-    plant = TwoStatePlant(
-        PROFILE_STANDARD,  # RC2 default
-        PlantState(T_room_C=20.0, T_rad_C=20.0),
-    )
-    for _ in range(60):
-        plant.step(30.0, 0.5, 5.0)
-    # Don't assert exact value — just sanity check that RC2 still warms.
-    assert 20.0 < plant.state.T_room_C < 35.0
+def test_rc2_path_unchanged_when_tau_wall_zero():
+    """A plant with tau_wall_min == 0 produces identical trajectories to before."""
+    reference = TwoStatePlant(
+        PROFILE_STANDARD,
+        PlantState(T_room_C=20.0, T_rad_C=20.0),
+    )
+    tau_zero = TwoStatePlant(
+        # build the same params with tau_wall_min forced to 0
+        ...,
+        PlantState(T_room_C=20.0, T_rad_C=20.0),
+    )
+    for _ in range(60):
+        reference.step(30.0, 0.5, 5.0)
+        tau_zero.step(30.0, 0.5, 5.0)
+        assert reference.state.T_room_C == tau_zero.state.T_room_C
+        assert reference.state.T_rad_C == tau_zero.state.T_rad_C
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/benchmark/tests/test_rc3_and_equal_percentage.py` around lines 101 -
113, Replace the weak sanity check with a direct comparison between the
tau_wall_min==0 fallback path and a known RC2 reference: instantiate two plants
(using TwoStatePlant) — one configured to run the PROFILE_STANDARD code path but
with tau_wall_min forced to 0 (the fallback under test) and another instantiated
as the explicit RC2 reference profile — run identical step(...) sequences on
both, and assert that their trajectories (e.g., state.T_room_C and state.T_rad_C
after each step or at final time) match within a tight tolerance; update
test_rc2_path_unchanged_when_tau_wall_zero to use these TwoStatePlant instances
and compare step-by-step or final values rather than a broad range.

Comment on lines +71 to +90
delayed = PlantParams(
tau_room_min=60.0,
tau_rad_min=10.0,
gain_heater=2.0,
T_water_C=65.0,
valve_command_delay_s=120.0,
)
plant_a = TwoStatePlant(PROFILE_STANDARD, PlantState(T_room_C=18.0, T_rad_C=18.0))
plant_b = TwoStatePlant(delayed, PlantState(T_room_C=18.0, T_rad_C=18.0))

# Prime plant_b's delay buffer with u=0.
for _ in range(4): # 2 min of u=0
plant_b.step(30.0, 0.0, 5.0)
# Now command u=1.0. The next 2 minutes (4 steps of 30 s) the
# radiator should still see u=0 from the buffer; only after that
# the u=1.0 starts arriving.
plant_a.step(30.0, 1.0, 5.0) # plant_a sees u=1.0 immediately
plant_b.step(30.0, 1.0, 5.0) # plant_b serves the head of the buffer (=0)
# plant_a's radiator should have warmed, plant_b's not at all yet.
assert plant_a.state.T_rad_C > plant_b.state.T_rad_C
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Use the same plant parameters for the delayed and control paths.

plant_a comes from PROFILE_STANDARD, while plant_b uses a custom PlantParams. That means this assertion is measuring both delay and different thermal constants/gains, so it doesn't isolate the pipe-delay behavior you want to prove.

Suggested fix
     delayed = PlantParams(
         tau_room_min=60.0,
         tau_rad_min=10.0,
         gain_heater=2.0,
         T_water_C=65.0,
         valve_command_delay_s=120.0,
     )
-    plant_a = TwoStatePlant(PROFILE_STANDARD, PlantState(T_room_C=18.0, T_rad_C=18.0))
+    no_delay = PlantParams(
+        tau_room_min=delayed.tau_room_min,
+        tau_rad_min=delayed.tau_rad_min,
+        gain_heater=delayed.gain_heater,
+        T_water_C=delayed.T_water_C,
+        valve_command_delay_s=0.0,
+    )
+    plant_a = TwoStatePlant(no_delay, PlantState(T_room_C=18.0, T_rad_C=18.0))
     plant_b = TwoStatePlant(delayed, PlantState(T_room_C=18.0, T_rad_C=18.0))
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/benchmark/tests/test_sensor_lag_and_pipe_delay.py` around lines 71 -
90, The test currently compares plant_a (constructed from PROFILE_STANDARD) with
plant_b (constructed from a different PlantParams), conflating parameter
differences with delay; change the setup so both TwoStatePlant instances use
identical PlantParams except for the valve_command_delay_s value (e.g., derive
delayed by copying PROFILE_STANDARD and only override
valve_command_delay_s=120.0, or set plant_a to use delayed but with
valve_command_delay_s=0.0), keeping PlantState the same for both; update
plant_a/plant_b construction to reference those matching params so the assertion
isolates only the pipe-delay behavior.

Comment on lines +20 to +29
def _fit_one(path: str, sensor_id: str) -> tuple[int, list[float]]:
data = fit_plant._load_csv(path)
aligned = fit_plant._align(data[sensor_id], data[_OUTDOOR_ID])
windows = fit_plant._find_cooling_windows(aligned)
taus = [
tau
for w in windows
if (tau := fit_plant._fit_tau_room(w)) is not None and 30.0 <= tau <= 6000.0
]
return len(windows), taus
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Don't pre-filter failing tau values out of the regression test.

_fit_one() drops any tau outside 30..6000 before test_fit_recovers_sensible_tau_on_synthetic_data() inspects the results, so the later range check cannot catch out-of-bounds fits. The test also says the kitchen should fit slower than the living room, but never asserts that relationship. Return all non-None taus here and enforce the full range and ordering in the test body.

Suggested fix
 def _fit_one(path: str, sensor_id: str) -> tuple[int, list[float]]:
     data = fit_plant._load_csv(path)
     aligned = fit_plant._align(data[sensor_id], data[_OUTDOOR_ID])
     windows = fit_plant._find_cooling_windows(aligned)
     taus = [
         tau
         for w in windows
-        if (tau := fit_plant._fit_tau_room(w)) is not None and 30.0 <= tau <= 6000.0
+        if (tau := fit_plant._fit_tau_room(w)) is not None
     ]
     return len(windows), taus
@@
         assert n_living >= 20 and n_kitchen >= 20
         assert taus_living and taus_kitchen
         for tau in taus_living + taus_kitchen:
             assert 60.0 <= tau <= 6000.0
+        assert sum(taus_kitchen) / len(taus_kitchen) > sum(taus_living) / len(taus_living)

Also applies to: 54-74

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/benchmark/tests/test_synthetic_dataset.py` around lines 20 - 29, The
helper _fit_one currently pre-filters taus to 30..6000 before the test can
assert bounds and ordering; change _fit_one (in
tests/benchmark/tests/test_synthetic_dataset.py) to return all non-None results
from fit_plant._fit_tau_room (i.e., collect taus where
fit_plant._fit_tau_room(w) is not None but do not filter by 30.0 <= tau <=
6000.0), and update the test_fit_recovers_sensible_tau_on_synthetic_data() to
assert that every returned tau lies in 30..6000 and that the kitchen tau is
greater than the living room tau. Ensure references to _fit_one,
fit_plant._fit_tau_room, and the test name are used to locate the changes.

The model:

* Seasonal mean: cosine over the year, minimum at day 14 (mid-January).
* Diurnal cycle: cosine over the day, minimum at 05:00, maximum at 15:00.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fix the diurnal phase; it's currently inverted.

This term makes the outdoor schedule coldest at about 15:00 and warmest overnight, which reverses the daily load pattern the benchmark is supposed to model. Please align the sign/phase with the intended warm-afternoon behavior and update the docstring so it matches the actual waveform.

Suggested fix
-* Diurnal cycle: cosine over the day, minimum at 05:00, maximum at 15:00.
+* Diurnal cycle: cosine over the day, maximum at 15:00 and minimum 12 hours earlier.
@@
-        diurnal = -params.diurnal_amp_C * math.cos(
+        diurnal = params.diurnal_amp_C * math.cos(
             2.0 * math.pi * (hour_of_day - 15.0) / 24.0
         )

Also applies to: 182-184

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/benchmark/weather/synthetic.py` at line 12, The diurnal cosine term is
inverted so it produces a minimum at ~15:00 and a maximum overnight; flip the
phase/sign of the cosine term used for the diurnal cycle (i.e., multiply by -1
or add π to the phase) so the waveform peaks at ~15:00 and has its trough at
~05:00, and update the corresponding docstring(s) that describe the diurnal
cycle to match the corrected phase; apply the same fix to the other instance of
the diurnal cosine term later in the file (the repeated block around the
synthetic weather diurnal calculation).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants