Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/test.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@ jobs:
strategy:
fail-fast: false
matrix:
python-version: ["3.7", "3.8", "3.9", "3.10"]
python-version: ["3.8", "3.9", "3.11", "3.12"]

steps:
- uses: actions/checkout@v2
Expand Down
88 changes: 88 additions & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
# Ambrosia

A/B testing framework for experiment design, group splitting, and results evaluation.
Supports both pandas and Spark DataFrames.

## Commands

```bash
make install # create .venv via Poetry (poetry install --all-extras)
make test # run pytest with coverage
make lint # isort + black + pylint + flake8 (checks only)
make autoformat # isort + black (fix in place)
make clean # remove .venv, build artifacts, reports/
```

Single test: `PYTHONPATH=. pytest tests/path/test_file.py::test_fn`

Line length: **120**.

## Architecture

### Three-stage pipeline

`Designer` → `Splitter` → `Tester` are independent, stateless-ish classes.
No shared state between stages; each takes a DataFrame and parameters.

### Pandas/Spark dispatch

Never subclass for pandas vs. Spark. Instead use `DataframeHandler` or the
free function `choose_on_table(alternatives, dataframe)` in
`ambrosia/tools/ab_abstract_component.py`:

```python
choose_on_table([pandas_func, spark_func], dataframe)
```

`DataframeHandler._handle_cases` / `_handle_on_table` wrap this pattern for
method dispatch in handlers (e.g. `TheoryHandler`, `EmpiricHandler`).

### ABMetaClass

`ABMetaClass(ABCMeta, YAMLObjectMetaclass)` in `ab_abstract_component.py`
resolves the metaclass conflict between `ABCMeta` and PyYAML's
`YAMLObjectMetaclass`. Any class that inherits from `ABToolAbstract` **and**
needs YAML serialization must set `metaclass=ABMetaClass`.

### ABToolAbstract._prepare_arguments()

Constructor args are "saved" defaults; `run()` args can override them at
call time. `_prepare_arguments` resolves the priority:
run-time arg → constructor arg → `ValueError` if both are None.

```python
chosen = _prepare_arguments({"alpha": [self._alpha, given_alpha]})
```

### Stat criteria strategy pattern

Hierarchy: `StatCriterion` (abstract, just `calculate_pvalue`) →
`ABStatCriterion` (adds `calculate_effect`, `calculate_conf_interval`,
`get_results`).

Concrete implementations in `ambrosia/tools/stat_criteria.py`:
`TtestIndCriterion`, `TtestRelCriterion`, `MannWhitneyCriterion`,
`WilcoxonCriterion`.

`Tester` dispatches by string alias via `AVAILABLE_AB_CRITERIA` dict — duck
typing, not isinstance checks. To add a criterion: subclass `ABStatCriterion`,
set `alias` and `implemented_effect_types` class attributes, register in the
dict.

### Preprocessor chain

`Preprocessor` (pandas only) uses method chaining — each method returns
`self`. Each step appends a fitted `AbstractFittableTransformer` to
`self.transformers`. The transformer list supports serialization
(`store_transformations` / `load_transformations` → JSON) and replay
(`apply_transformations`) for consistent train/test preprocessing.

### Theoretical vs empirical design

Two design philosophies plug into the same `SimpleDesigner` interface:

- **Theoretical** (`TheoryHandler`): closed-form power/sample-size formulas
- **Empirical** (`EmpiricHandler`): bootstrap/simulation-based estimates

Both implement `size_design`, `effect_design`, `power_design` and dispatch
pandas vs. Spark internally via `DataframeHandler`.
3 changes: 2 additions & 1 deletion ambrosia/preprocessing/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@
from .ml_var_reducer import MLVarianceReducer
from .preprocessor import Preprocessor
from .robust import IQRPreprocessor, RobustPreprocessor
from .transformers import BoxCoxTransformer, LogTransformer
from .transformers import BoxCoxTransformer, LinearizationTransformer, LogTransformer

__all__ = [
"AggregatePreprocessor",
Expand All @@ -32,5 +32,6 @@
"RobustPreprocessor",
"IQRPreprocessor",
"BoxCoxTransformer",
"LinearizationTransformer",
"LogTransformer",
]
46 changes: 45 additions & 1 deletion ambrosia/preprocessing/preprocessor.py
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@
from ambrosia.preprocessing.aggregate import AggregatePreprocessor
from ambrosia.preprocessing.cuped import Cuped, MultiCuped
from ambrosia.preprocessing.robust import IQRPreprocessor, RobustPreprocessor
from ambrosia.preprocessing.transformers import BoxCoxTransformer, LogTransformer
from ambrosia.preprocessing.transformers import BoxCoxTransformer, LinearizationTransformer, LogTransformer


class Preprocessor:
Expand Down Expand Up @@ -378,6 +378,50 @@ def multicuped(
self.transformers.append(transformer)
return self

def linearize(
self,
numerator: types.ColumnNameType,
denominator: types.ColumnNameType,
transformed_name: Optional[types.ColumnNameType] = None,
load_path: Optional[Path] = None,
) -> Preprocessor:
"""
Linearize a ratio metric for use in A/B testing.

Computes a per-unit linearized value that is approximately normally
distributed, enabling correct t-test usage for ratio metrics:

linearized_i = numerator_i - ratio * denominator_i

where ratio = mean(numerator) / mean(denominator) is estimated on
the data passed to this ``Preprocessor`` instance (reference / control data).

Parameters
----------
numerator : ColumnNameType
Column name of the ratio numerator (e.g. ``"revenue"``).
denominator : ColumnNameType
Column name of the ratio denominator (e.g. ``"orders"``).
transformed_name : ColumnNameType, optional
Name for the new linearized column. Defaults to
``"{numerator}_lin"``.
load_path : Path, optional
Path to a json file with pre-fitted parameters.

Returns
-------
self : Preprocessor
Instance object.
"""
transformer = LinearizationTransformer()
if load_path is None:
transformer.fit_transform(self.dataframe, numerator, denominator, transformed_name, inplace=True)
else:
transformer.load_params(load_path)
transformer.transform(self.dataframe, inplace=True)
self.transformers.append(transformer)
return self

def transformations(self) -> List:
"""
List of all transformations which were called.
Expand Down
133 changes: 132 additions & 1 deletion ambrosia/preprocessing/transformers.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@
Module contains tools for metrics transformations during a
preprocessing task.
"""
from typing import Dict, Union
from typing import Dict, Optional, Union

import numpy as np
import pandas as pd
Expand Down Expand Up @@ -386,3 +386,134 @@ def inverse_transform(self, dataframe: pd.DataFrame, inplace: bool = False) -> U
transformed: pd.DataFrame = dataframe if inplace else dataframe.copy()
transformed[self.column_names] = np.exp(transformed[self.column_names].values)
return None if inplace else transformed


class LinearizationTransformer(AbstractFittableTransformer):
"""
Linearization transformer for ratio metrics.

Converts a ratio metric (numerator / denominator) into a per-unit linearized
metric that is approximately normally distributed, enabling correct t-test usage:

linearized_i = numerator_i - ratio * denominator_i

where ratio = mean(numerator) / mean(denominator), estimated on the reference
(control group / historical) data passed to fit().

Parameters
----------
numerator : str
Column name of the ratio numerator (e.g. "revenue").
denominator : str
Column name of the ratio denominator (e.g. "orders").
transformed_name : str, optional
Name for the new column. Defaults to ``"{numerator}_lin"``.

Examples
--------
>>> transformer = LinearizationTransformer()
>>> transformer.fit(control_df, "revenue", "orders", "arpu_lin")
>>> transformer.transform(experiment_df, inplace=True)
"""

def __str__(self) -> str:
return "Linearization transformation"

def __init__(self) -> None:
self.numerator: Optional[str] = None
self.denominator: Optional[str] = None
self.transformed_name: Optional[str] = None
self.ratio: Optional[float] = None
super().__init__()

def get_params_dict(self) -> Dict:
self._check_fitted()
return {
"numerator": self.numerator,
"denominator": self.denominator,
"transformed_name": self.transformed_name,
"ratio": self.ratio,
}

def load_params_dict(self, params: Dict) -> None:
for key in ("numerator", "denominator", "transformed_name", "ratio"):
if key not in params:
raise TypeError(f"params argument must contain: {key}")
setattr(self, key, params[key])
self.fitted = True

def fit(
self,
dataframe: pd.DataFrame,
numerator: str,
denominator: str,
transformed_name: Optional[str] = None,
):
"""
Estimate ratio = mean(numerator) / mean(denominator) on reference data.

Parameters
----------
dataframe : pd.DataFrame
Reference dataframe (typically control group or historical data).
numerator : str
Column name of the ratio numerator.
denominator : str
Column name of the ratio denominator.
transformed_name : str, optional
Name for the linearized column. Defaults to ``"{numerator}_lin"``.
"""
self._check_cols(dataframe, [numerator, denominator])
denom_mean = dataframe[denominator].mean()
if denom_mean == 0:
raise ValueError(f"Mean of denominator column '{denominator}' is zero; cannot compute ratio.")
self.numerator = numerator
self.denominator = denominator
self.transformed_name = transformed_name if transformed_name is not None else f"{numerator}_lin"
self.ratio = dataframe[numerator].mean() / denom_mean
self.fitted = True
return self

def transform(self, dataframe: pd.DataFrame, inplace: bool = False) -> Union[pd.DataFrame, None]:
"""
Apply linearization: transformed = numerator - ratio * denominator.

Parameters
----------
dataframe : pd.DataFrame
Dataframe to transform.
inplace : bool, default: ``False``
If ``True`` modifies dataframe in place, otherwise returns a copy.
"""
self._check_fitted()
self._check_cols(dataframe, [self.numerator, self.denominator])
df = dataframe if inplace else dataframe.copy()
df[self.transformed_name] = df[self.numerator] - self.ratio * df[self.denominator]
return None if inplace else df

def fit_transform(
self,
dataframe: pd.DataFrame,
numerator: str,
denominator: str,
transformed_name: Optional[str] = None,
inplace: bool = False,
) -> Union[pd.DataFrame, None]:
"""
Fit and transform in one step.

Parameters
----------
dataframe : pd.DataFrame
Reference dataframe for fitting and transformation.
numerator : str
Column name of the ratio numerator.
denominator : str
Column name of the ratio denominator.
transformed_name : str, optional
Name for the linearized column.
inplace : bool, default: ``False``
If ``True`` modifies dataframe in place.
"""
self.fit(dataframe, numerator, denominator, transformed_name)
return self.transform(dataframe, inplace)
19 changes: 16 additions & 3 deletions ambrosia/tester/handlers.py
Original file line number Diff line number Diff line change
Expand Up @@ -51,14 +51,23 @@ class SparkCriteria(enum.Enum):

class TheoreticalTesterHandler:
def __init__(
self, group_a, group_b, column: str, alpha: np.ndarray, effect_type: str, criterion: StatCriterion, **kwargs
self,
group_a,
group_b,
column: str,
alpha: np.ndarray,
effect_type: str,
criterion: StatCriterion,
metric_func=None,
**kwargs,
):
self.group_a = group_a
self.group_b = group_b
self.column = column
self.alpha = alpha
self.effect_type = effect_type
self.criterion = criterion
self.metric_func = metric_func
self.kwargs = kwargs

def _correct_criterion(self, criterion: tp.Any) -> bool:
Expand All @@ -79,8 +88,12 @@ def get_criterion(self, criterion: str, data_example: types.SparkOrPandas):

def _set_kwargs(self):
if isinstance(self.group_a, pd.DataFrame):
self.group_a = self.group_a[self.column].values
self.group_b = self.group_b[self.column].values
if self.metric_func is not None:
self.group_a = np.asarray(self.metric_func(self.group_a))
self.group_b = np.asarray(self.metric_func(self.group_b))
else:
self.group_a = self.group_a[self.column].values
self.group_b = self.group_b[self.column].values
elif isinstance(self.group_a, types.SparkDataFrame):
self.kwargs["column"] = self.column
self.kwargs["alpha"] = self.alpha
Expand Down
Loading
Loading