[pipeline] Add polymorphic Aggregate API for custom aggregation opera… by mthrok · Pull Request #1289 · facebookresearch/spdl

mthrok · 2026-02-23T19:01:08Z

Adds support for custom aggregation logic in SPDL pipelines via a new Aggregator abstract base class.

Background:
Previously, Aggregate() only supported fixed-size batching (e.g., aggregate(3) to buffer 3 items). There was no way to implement custom aggregation logic like size-based batching, time-windowed aggregation, or conditional grouping.

New Feature:
The Aggregator ABC enables custom aggregation with two methods:

accumulate(item) - Called for each item; return aggregated result when ready, or None to continue buffering
flush() - Called at stream end to emit remaining buffered items

The drop_last parameter controls end-of-stream behavior:

drop_last=False (default): Calls flush() to emit remaining items
drop_last=True: Skips flush(), dropping incomplete batches

Example:

from spdl.pipeline.defs import Aggregator

class SizeBasedAggregator(Aggregator):
    '''Emit when total string length exceeds threshold.'''
    def __init__(self, threshold: int):
        self.threshold = threshold
        self.buffer: list[str] = []
        self.size = 0

    def accumulate(self, item: str) -> str | None:
        self.buffer.append(item)
        self.size += len(item)
        if self.size >= self.threshold:
            result = "".join(self.buffer)
            self.buffer, self.size = [], 0
            return result
        return None

    def flush(self) -> str | None:
        if self.buffer:
            result = "".join(self.buffer)
            self.buffer, self.size = [], 0
            return result
        return None

# Usage
pipeline = PipelineBuilder().add_source(data).aggregate(SizeBasedAggregator(100)).build()

API (backward compatible):

Aggregate(3) - Fixed-size batching (unchanged)
Aggregate(my_aggregator) - Custom aggregation (new)
PipelineBuilder.aggregate() - Same polymorphic behavior

Implementation:

New Aggregator ABC in defs/_defs.py
_Batch class implements Aggregator for default fixed-size batching
_AggregatorWrapper adapts any Aggregator to internal pipe interface
Aggregate() uses pattern matching to handle int | Aggregator

meta-codesync · 2026-02-23T19:04:27Z

@facebook-github-bot has imported this pull request. If you are a Meta employee, you can view this in D94108462. (Because this pull request was imported automatically, there will not be any future comments.)

…tions Adds support for custom aggregation logic in SPDL pipelines via a new `Aggregator` abstract base class. **Background:** Previously, `Aggregate()` only supported fixed-size batching (e.g., `aggregate(3)` to buffer 3 items). There was no way to implement custom aggregation logic like size-based batching, time-windowed aggregation, or conditional grouping. **New Feature:** The `Aggregator` ABC enables custom aggregation with two methods: - `accumulate(item)` - Called for each item; return aggregated result when ready, or `None` to continue buffering - `flush()` - Called at stream end to emit remaining buffered items The `drop_last` parameter controls end-of-stream behavior: - `drop_last=False` (default): Calls `flush()` to emit remaining items - `drop_last=True`: Skips `flush()`, dropping incomplete batches **Example:** ```python from spdl.pipeline.defs import Aggregator class SizeBasedAggregator(Aggregator): '''Emit when total string length exceeds threshold.''' def __init__(self, threshold: int): self.threshold = threshold self.buffer: list[str] = [] self.size = 0 def accumulate(self, item: str) -> str | None: self.buffer.append(item) self.size += len(item) if self.size >= self.threshold: result = "".join(self.buffer) self.buffer, self.size = [], 0 return result return None def flush(self) -> str | None: if self.buffer: result = "".join(self.buffer) self.buffer, self.size = [], 0 return result return None pipeline = PipelineBuilder().add_source(data).aggregate(SizeBasedAggregator(100)).build() ``` **API (backward compatible):** - `Aggregate(3)` - Fixed-size batching (unchanged) - `Aggregate(my_aggregator)` - Custom aggregation (new) - `PipelineBuilder.aggregate()` - Same polymorphic behavior **Implementation:** - New `Aggregator` ABC in `defs/_defs.py` - `_Batch` class implements `Aggregator` for default fixed-size batching - `_AggregatorWrapper` adapts any `Aggregator` to internal pipe interface - `Aggregate()` uses pattern matching to handle `int | Aggregator`

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Feb 23, 2026

mthrok force-pushed the custom_aggregate branch 4 times, most recently from ea0c163 to f363345 Compare February 24, 2026 00:04

mthrok force-pushed the custom_aggregate branch from f363345 to 3e58fe9 Compare February 24, 2026 02:48

mthrok marked this pull request as ready for review February 24, 2026 04:52

mthrok merged commit b3b2645 into main Feb 24, 2026
107 checks passed

mthrok deleted the custom_aggregate branch February 24, 2026 04:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[pipeline] Add polymorphic Aggregate API for custom aggregation opera…#1289

[pipeline] Add polymorphic Aggregate API for custom aggregation opera…#1289
mthrok merged 1 commit intomainfrom
custom_aggregate

mthrok commented Feb 23, 2026 •

edited

Loading

Uh oh!

meta-codesync bot commented Feb 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mthrok commented Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

meta-codesync bot commented Feb 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mthrok commented Feb 23, 2026 •

edited

Loading