Skip to content

Commit e28981d

Browse files
committed
Polish comparison defaults, trust notes, and reporting metadata
1 parent d4ece14 commit e28981d

9 files changed

Lines changed: 193 additions & 7 deletions

File tree

README.md

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,10 @@
1010
- `p_value` (двусторонний centered paired bootstrap test для `H0: delta = 0`),
1111
- `is_significant` + `significance_rule="centered_paired_bootstrap_p_value_lt_alpha"`.
1212
- В comparison output также включены trust/stability diagnostics (`ESS`, `ESS/N`, replay overlap, weight tails, clip/switch share, warning flags).
13+
- В high-level summary добавлены explicit recommended-default metadata и структурированные note-группы:
14+
- `recommended_defaults` (рекомендуемые режимы по умолчанию),
15+
- `info_notes`, `diagnostic_warnings`, `inference_warnings`, `trust_notes`,
16+
- итоговый `trust_level` и короткая `recommendation`.
1317
- Все основные OPE‑оценщики снабжены подробными docstring на русском (аргументы, возвращаемые значения, интерпретация).
1418

1519
## Установка
@@ -57,6 +61,15 @@ pip install -e .
5761

5862
В `compare_policies(...).to_dict()` и `diagnostics` возвращаются `propensity_source` и `propensity_column` (если применимо).
5963

64+
### Recommended defaults (safe-by-default guidance)
65+
66+
Официальные defaults для общего сценария:
67+
- preferred estimator: `dr`;
68+
- если logged propensity доступна и валидна: `propensity_source="auto"` (предпочтёт logged path);
69+
- если logged propensity недоступна/невалидна: fallback в estimated propensity path;
70+
- `use_crossfit=True` обычно рекомендуется для `dm/dr/sndr/switch_dr`, когда важна bias-hardening устойчивость;
71+
- `trust_level in {"caution", "elevated_concern"}` — сигнал поднимать требования к интерпретации результата.
72+
6073
### Nuisance model diagnostics
6174

6275
В high-level summary добавлен блок `nuisance_diagnostics`:
@@ -304,3 +317,8 @@ jupyter nbconvert --to notebook --execute examples/tutorial.ipynb --inplace
304317
- `p_value` (centered paired bootstrap approximation для `H0: delta = 0`),
305318
- `inference_method`,
306319
- `alpha`.
320+
- Дополнительно для API-polish:
321+
- структурированные note/warning-поля: `info_notes`, `diagnostic_warnings`, `inference_warnings`, `trust_notes`;
322+
- `trust_level` + `recommendation`;
323+
- `recommended_defaults` для явного safe-by-default workflow.
324+
- Поле `notes` сохранено для backward compatibility как объединение структурированных групп.

docs/architecture.md

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -36,6 +36,8 @@
3636
7. **Comparison result**
3737
Сводка по `V_A`, `V_B`, `delta`, CI, `p-value` и диагностике для сравнения A vs B.
3838
Официальный orchestration path: `policyscope.comparison.compare_policies(...)`.
39+
В summary дополнительно нормализованы группы заметок/предупреждений:
40+
`info_notes`, `diagnostic_warnings`, `inference_warnings`, `trust_notes`, и агрегированный `trust_level`.
3941

4042
8. **Scalar target metric (core abstraction)**
4143
Базовая единица оценки — одна скалярная метрика награды. Несколько метрик поддерживаются как повторные запуски оценки для разных target-колонок, а не как native vector-valued reward.
@@ -135,3 +137,14 @@ Harness поддерживает сравнение методов (`replay`, `i
135137
- cross-fit mode: diagnostics отмечаются как OOF (fold-aware provenance).
136138

137139
Этот слой не меняет формулы estimators и служит для trust-quality интерпретации результатов.
140+
141+
142+
## 11) Recommended defaults (API-polish)
143+
144+
Чтобы high-level API был opinionated и безопаснее по умолчанию, в comparison metadata фиксируются рекомендации:
145+
- `preferred_estimator_general_use = "dr"`;
146+
- `preferred_propensity_mode_when_logged_available = "auto"`;
147+
- `preferred_propensity_fallback_when_logged_unavailable = "estimated"`;
148+
- рекомендация cross-fit для `dm/dr/sndr/switch_dr`.
149+
150+
Это guidance-слой и metadata; математика реализованных estimators не меняется.

docs/validation_harness.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,7 @@
1111
- `Delta_CI` coverage (если CI рассчитан);
1212
- частота significance decision (`is_significant`);
1313
- diagnostics-поля (например, `weight_ess_ratio`, `weight_p99`);
14+
- trust metadata (`trust_level`, structured warnings/notes), чтобы видеть когда выводы стоит считать менее надёжными;
1415
- provenance (`propensity_source_used`, `propensity_column_used`);
1516
- nuisance-quality summaries (например behavior log-loss, outcome log-loss/RMSE) для сравнения режимов.
1617

src/policyscope/__init__.py

Lines changed: 14 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,8 +10,21 @@
1010

1111
import logging
1212

13+
from policyscope.comparison import (
14+
RECOMMENDED_CROSSFIT_ESTIMATORS,
15+
RECOMMENDED_ESTIMATOR,
16+
RECOMMENDED_PROPENSITY_SOURCE_FALLBACK,
17+
RECOMMENDED_PROPENSITY_SOURCE_WITH_LOGGED,
18+
)
19+
1320
logging.basicConfig(level=logging.INFO, format="%(message)s")
1421

15-
__all__ = ["__version__"]
22+
__all__ = [
23+
"__version__",
24+
"RECOMMENDED_ESTIMATOR",
25+
"RECOMMENDED_PROPENSITY_SOURCE_WITH_LOGGED",
26+
"RECOMMENDED_PROPENSITY_SOURCE_FALLBACK",
27+
"RECOMMENDED_CROSSFIT_ESTIMATORS",
28+
]
1629

1730
__version__ = "0.1.0"

src/policyscope/comparison.py

Lines changed: 93 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,11 @@
2020
resolve_behavior_predictions,
2121
)
2222

23+
RECOMMENDED_ESTIMATOR = "dr"
24+
RECOMMENDED_PROPENSITY_SOURCE_WITH_LOGGED = "auto"
25+
RECOMMENDED_PROPENSITY_SOURCE_FALLBACK = "estimated"
26+
RECOMMENDED_CROSSFIT_ESTIMATORS = frozenset({"dm", "dr", "sndr", "switch_dr"})
27+
2328

2429
@dataclass(frozen=True)
2530
class PolicyValueResult:
@@ -47,6 +52,12 @@ class PolicyComparisonSummary:
4752
inference_warnings: tuple[str, ...] = field(default_factory=tuple)
4853
diagnostics: PolicyDiagnostics | None = None
4954
notes: tuple[str, ...] = field(default_factory=tuple)
55+
info_notes: tuple[str, ...] = field(default_factory=tuple)
56+
diagnostic_warnings: tuple[str, ...] = field(default_factory=tuple)
57+
trust_notes: tuple[str, ...] = field(default_factory=tuple)
58+
trust_level: str = "ok"
59+
recommendation: Optional[str] = None
60+
recommended_defaults: dict[str, object] = field(default_factory=dict)
5061
propensity_source: Optional[str] = None
5162
propensity_column: Optional[str] = None
5263
nuisance_diagnostics: Optional[NuisanceDiagnostics] = None
@@ -60,7 +71,15 @@ def to_dict(self) -> dict:
6071
"Delta": self.delta,
6172
"diagnostics": self.diagnostics.to_dict() if self.diagnostics is not None else {},
6273
"notes": list(self.notes),
74+
"info_notes": list(self.info_notes),
75+
"diagnostic_warnings": list(self.diagnostic_warnings),
76+
"trust_notes": list(self.trust_notes),
77+
"trust_level": self.trust_level,
6378
}
79+
if self.recommendation is not None:
80+
out["recommendation"] = self.recommendation
81+
if self.recommended_defaults:
82+
out["recommended_defaults"] = self.recommended_defaults
6483
if self.v_a_ci is not None:
6584
out["V_A_CI"] = self.v_a_ci
6685
if self.v_b_ci is not None:
@@ -90,6 +109,49 @@ def to_dict(self) -> dict:
90109
return out
91110

92111

112+
def _recommended_defaults(estimator: str) -> dict[str, object]:
113+
return {
114+
"preferred_estimator_general_use": RECOMMENDED_ESTIMATOR,
115+
"preferred_propensity_mode_when_logged_available": RECOMMENDED_PROPENSITY_SOURCE_WITH_LOGGED,
116+
"preferred_propensity_fallback_when_logged_unavailable": RECOMMENDED_PROPENSITY_SOURCE_FALLBACK,
117+
"crossfit_recommended_for_estimator": estimator in RECOMMENDED_CROSSFIT_ESTIMATORS,
118+
}
119+
120+
121+
def _build_trust_metadata(
122+
*,
123+
estimator: str,
124+
use_crossfit: bool,
125+
propensity_notes: tuple[str, ...],
126+
diagnostic_warnings: tuple[str, ...],
127+
inference_warnings: tuple[str, ...],
128+
) -> tuple[tuple[str, ...], tuple[str, ...], str, Optional[str]]:
129+
info_notes = list(dict.fromkeys(propensity_notes))
130+
trust_notes: list[str] = []
131+
risk_score = 0
132+
if diagnostic_warnings:
133+
risk_score += len(diagnostic_warnings)
134+
trust_notes.append("diagnostics_warnings_present_review_weight_overlap_metrics")
135+
if inference_warnings:
136+
risk_score += len(inference_warnings)
137+
trust_notes.append("inference_warnings_present_ci_and_p_value_less_stable")
138+
if estimator in RECOMMENDED_CROSSFIT_ESTIMATORS and not use_crossfit:
139+
info_notes.append("crossfit_optional_recommendation_for_bias_hardening")
140+
if any(w in {"low_ess_ratio", "heavy_weight_tail", "extreme_max_weight"} for w in diagnostic_warnings):
141+
risk_score += 1
142+
trust_notes.append("trust_elevated_concern_unstable_importance_weights")
143+
144+
trust_level = "ok"
145+
recommendation = None
146+
if risk_score >= 3:
147+
trust_level = "elevated_concern"
148+
recommendation = "Treat comparison as directional; improve overlap/weights or collect more representative logs."
149+
elif risk_score > 0:
150+
trust_level = "caution"
151+
recommendation = "Review diagnostics and inference warnings before making product decisions."
152+
return tuple(info_notes), tuple(trust_notes), trust_level, recommendation
153+
154+
93155
@dataclass(frozen=True)
94156
class MultiMetricComparisonResult:
95157
estimator: str
@@ -263,14 +325,29 @@ def point_on(part: pd.DataFrame) -> float:
263325
)
264326

265327
if not with_ci:
328+
diag_warnings = tuple(diag.warnings)
329+
info_notes, trust_notes, trust_level, recommendation = _build_trust_metadata(
330+
estimator=estimator,
331+
use_crossfit=use_crossfit,
332+
propensity_notes=propensity_notes,
333+
diagnostic_warnings=diag_warnings,
334+
inference_warnings=tuple(),
335+
)
336+
notes = tuple(dict.fromkeys(info_notes + diag_warnings + trust_notes))
266337
return PolicyComparisonSummary(
267338
estimator=estimator,
268339
target=target,
269340
v_a=float(v_a),
270341
v_b=float(v_b),
271342
delta=float(v_b - v_a),
272343
diagnostics=diag,
273-
notes=propensity_notes + tuple(diag.warnings),
344+
notes=notes,
345+
info_notes=info_notes,
346+
diagnostic_warnings=diag_warnings,
347+
trust_notes=trust_notes,
348+
trust_level=trust_level,
349+
recommendation=recommendation,
350+
recommended_defaults=_recommended_defaults(estimator),
274351
propensity_source=diag.propensity_source or resolved_source,
275352
propensity_column=diag.propensity_column or resolved_propensity_col,
276353
nuisance_diagnostics=nuisance_diag,
@@ -291,7 +368,15 @@ def estimator_pair(part: pd.DataFrame):
291368
inference_warnings = tuple(inf.get("inference_warnings", []))
292369
if fallback_triggered["value"]:
293370
inference_warnings = inference_warnings + (external_nuisance_bootstrap_warning,)
294-
notes = propensity_notes + tuple(diag.warnings) + inference_warnings
371+
diag_warnings = tuple(diag.warnings)
372+
info_notes, trust_notes, trust_level, recommendation = _build_trust_metadata(
373+
estimator=estimator,
374+
use_crossfit=use_crossfit,
375+
propensity_notes=propensity_notes,
376+
diagnostic_warnings=diag_warnings,
377+
inference_warnings=inference_warnings,
378+
)
379+
notes = tuple(dict.fromkeys(info_notes + diag_warnings + inference_warnings + trust_notes))
295380
return PolicyComparisonSummary(
296381
estimator=estimator,
297382
target=target,
@@ -310,6 +395,12 @@ def estimator_pair(part: pd.DataFrame):
310395
inference_warnings=inference_warnings,
311396
diagnostics=diag,
312397
notes=notes,
398+
info_notes=info_notes,
399+
diagnostic_warnings=diag_warnings,
400+
trust_notes=trust_notes,
401+
trust_level=trust_level,
402+
recommendation=recommendation,
403+
recommended_defaults=_recommended_defaults(estimator),
313404
propensity_source=diag.propensity_source or resolved_source,
314405
propensity_column=diag.propensity_column or resolved_propensity_col,
315406
nuisance_diagnostics=nuisance_diag,

src/policyscope/report.py

Lines changed: 11 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -48,22 +48,30 @@ def decision_summary(res: Dict, metric_name: str, business_threshold: float = 0.
4848
V_A = res["V_A"]
4949
V_B = res["V_B"]
5050
D = res["Delta"]
51+
alpha = float(res.get("alpha", 0.05))
52+
ci_level = int(round((1.0 - alpha) * 100))
5153
A_lo, A_hi = res["V_A_CI"]
5254
B_lo, B_hi = res["V_B_CI"]
5355
D_lo, D_hi = res["Delta_CI"]
5456

5557
lines = []
5658
lines.append(f"Метрика: {metric_name}")
57-
lines.append(f"V(A) = {V_A:.6f} (95% CI: {A_lo:.6f} .. {A_hi:.6f})")
58-
lines.append(f"V(B) = {V_B:.6f} (95% CI: {B_lo:.6f} .. {B_hi:.6f})")
59-
lines.append(f"Delta (B−A) = {D:.6f} (95% CI: {D_lo:.6f} .. {D_hi:.6f})")
59+
lines.append(f"V(A) = {V_A:.6f} ({ci_level}% CI: {A_lo:.6f} .. {A_hi:.6f})")
60+
lines.append(f"V(B) = {V_B:.6f} ({ci_level}% CI: {B_lo:.6f} .. {B_hi:.6f})")
61+
lines.append(f"Delta (B−A) = {D:.6f} ({ci_level}% CI: {D_lo:.6f} .. {D_hi:.6f})")
6062

6163
if D_lo > business_threshold:
6264
lines.append(f"Решение: модель B лучше A, поскольку нижняя граница CI превышает порог {business_threshold}.")
6365
elif D_hi < -business_threshold:
6466
lines.append(f"Решение: модель A лучше B, поскольку верхняя граница CI ниже -{business_threshold}.")
6567
else:
6668
lines.append("Решение: статистически значимого отличия не обнаружено или эффект слишком мал.")
69+
recommendation = res.get("recommendation")
70+
trust_level = res.get("trust_level")
71+
if trust_level is not None:
72+
lines.append(f"Уровень доверия к оценке: {trust_level}.")
73+
if recommendation:
74+
lines.append(f"Рекомендация: {recommendation}")
6775
return "\n".join(lines)
6876

6977

src/policyscope/validation.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -44,6 +44,7 @@ class ValidationRunRow:
4444
p_value: Optional[float]
4545
propensity_source_used: Optional[str]
4646
propensity_column_used: Optional[str]
47+
trust_level: Optional[str]
4748
ess_ratio: Optional[float]
4849
weight_p99: Optional[float]
4950
behavior_log_loss: Optional[float]
@@ -191,6 +192,7 @@ def run_simulation_validation(
191192
p_value=summary.p_value,
192193
propensity_source_used=summary.propensity_source,
193194
propensity_column_used=summary.propensity_column,
195+
trust_level=summary.trust_level,
194196
ess_ratio=diag.get("weight_ess_ratio"),
195197
weight_p99=diag.get("weight_p99"),
196198
behavior_log_loss=(

tests/test_bootstrap_report.py

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -160,3 +160,20 @@ def test_decision_summary_outcomes():
160160
res_neu = {**base, "Delta": 0.0, "Delta_CI": (-0.03, 0.04)}
161161
txt_neu = decision_summary(res_neu, "metric", business_threshold=0.01)
162162
assert "статистически значимого отличия" in txt_neu
163+
164+
165+
def test_decision_summary_uses_alpha_from_result():
166+
res = {
167+
"V_A": 0.2,
168+
"V_B": 0.25,
169+
"Delta": 0.05,
170+
"V_A_CI": (0.18, 0.22),
171+
"V_B_CI": (0.20, 0.30),
172+
"Delta_CI": (0.01, 0.09),
173+
"alpha": 0.1,
174+
"trust_level": "caution",
175+
"recommendation": "check diagnostics",
176+
}
177+
txt = decision_summary(res, "metric", business_threshold=0.0)
178+
assert "90% CI" in txt
179+
assert "Уровень доверия к оценке: caution." in txt

tests/test_comparison.py

Lines changed: 24 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -46,6 +46,9 @@ def test_official_comparison_entrypoint_shape():
4646
assert "V_A_CI" in d and "Delta_CI" in d
4747
assert "diagnostics" in d and "weight_ess_ratio" in d["diagnostics"]
4848
assert 0.0 <= d["p_value"] <= 1.0
49+
assert d["recommended_defaults"]["preferred_estimator_general_use"] == "dr"
50+
assert "info_notes" in d and "diagnostic_warnings" in d and "trust_notes" in d
51+
assert d["trust_level"] in {"ok", "caution", "elevated_concern"}
4952

5053

5154
def test_multi_target_repeated_scalar_evaluation():
@@ -223,10 +226,30 @@ def test_propensity_source_auto_fallback_and_metadata():
223226
propensity_col="missing_propensity",
224227
)
225228
assert summary.propensity_source == "estimated"
226-
assert any("fallback" in n for n in summary.notes)
229+
assert any("fallback" in n for n in summary.info_notes)
227230
assert summary.to_dict()["diagnostics"]["propensity_source"] == "estimated"
228231

229232

233+
def test_notes_are_structured_and_legacy_notes_remain_compatible():
234+
logs, policyB = _prepare_env(114)
235+
summary = compare_policies(
236+
logs,
237+
policyB,
238+
estimator="dr",
239+
target="accept",
240+
feature_cols=["loyal", "age", "risk", "income"],
241+
action_col="a_A",
242+
with_ci=True,
243+
n_boot=10,
244+
)
245+
assert isinstance(summary.info_notes, tuple)
246+
assert isinstance(summary.diagnostic_warnings, tuple)
247+
assert isinstance(summary.inference_warnings, tuple)
248+
assert isinstance(summary.trust_notes, tuple)
249+
# Legacy combined notes stays available for backward-compatible consumers.
250+
assert set(summary.info_notes).issubset(set(summary.notes))
251+
252+
230253
def test_propensity_source_logged_requires_valid_column():
231254
logs, policyB = _prepare_env(110)
232255
logs = logs.copy()

0 commit comments

Comments
 (0)