[pipeline] Add polymorphic Aggregate API for custom aggregation opera…#1289
Merged
[pipeline] Add polymorphic Aggregate API for custom aggregation opera…#1289
Conversation
|
@facebook-github-bot has imported this pull request. If you are a Meta employee, you can view this in D94108462. (Because this pull request was imported automatically, there will not be any future comments.) |
ea0c163 to
f363345
Compare
…tions
Adds support for custom aggregation logic in SPDL pipelines via a new `Aggregator` abstract base class.
**Background:**
Previously, `Aggregate()` only supported fixed-size batching (e.g., `aggregate(3)` to buffer 3 items). There was no way to implement custom aggregation logic like size-based batching, time-windowed aggregation, or conditional grouping.
**New Feature:**
The `Aggregator` ABC enables custom aggregation with two methods:
- `accumulate(item)` - Called for each item; return aggregated result when ready, or `None` to continue buffering
- `flush()` - Called at stream end to emit remaining buffered items
The `drop_last` parameter controls end-of-stream behavior:
- `drop_last=False` (default): Calls `flush()` to emit remaining items
- `drop_last=True`: Skips `flush()`, dropping incomplete batches
**Example:**
```python
from spdl.pipeline.defs import Aggregator
class SizeBasedAggregator(Aggregator):
'''Emit when total string length exceeds threshold.'''
def __init__(self, threshold: int):
self.threshold = threshold
self.buffer: list[str] = []
self.size = 0
def accumulate(self, item: str) -> str | None:
self.buffer.append(item)
self.size += len(item)
if self.size >= self.threshold:
result = "".join(self.buffer)
self.buffer, self.size = [], 0
return result
return None
def flush(self) -> str | None:
if self.buffer:
result = "".join(self.buffer)
self.buffer, self.size = [], 0
return result
return None
pipeline = PipelineBuilder().add_source(data).aggregate(SizeBasedAggregator(100)).build()
```
**API (backward compatible):**
- `Aggregate(3)` - Fixed-size batching (unchanged)
- `Aggregate(my_aggregator)` - Custom aggregation (new)
- `PipelineBuilder.aggregate()` - Same polymorphic behavior
**Implementation:**
- New `Aggregator` ABC in `defs/_defs.py`
- `_Batch` class implements `Aggregator` for default fixed-size batching
- `_AggregatorWrapper` adapts any `Aggregator` to internal pipe interface
- `Aggregate()` uses pattern matching to handle `int | Aggregator`
f363345 to
3e58fe9
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds support for custom aggregation logic in SPDL pipelines via a new
Aggregatorabstract base class.Background:
Previously,
Aggregate()only supported fixed-size batching (e.g.,aggregate(3)to buffer 3 items). There was no way to implement custom aggregation logic like size-based batching, time-windowed aggregation, or conditional grouping.New Feature:
The
AggregatorABC enables custom aggregation with two methods:accumulate(item)- Called for each item; return aggregated result when ready, orNoneto continue bufferingflush()- Called at stream end to emit remaining buffered itemsThe
drop_lastparameter controls end-of-stream behavior:drop_last=False(default): Callsflush()to emit remaining itemsdrop_last=True: Skipsflush(), dropping incomplete batchesExample:
API (backward compatible):
Aggregate(3)- Fixed-size batching (unchanged)Aggregate(my_aggregator)- Custom aggregation (new)PipelineBuilder.aggregate()- Same polymorphic behaviorImplementation:
AggregatorABC indefs/_defs.py_Batchclass implementsAggregatorfor default fixed-size batching_AggregatorWrapperadapts anyAggregatorto internal pipe interfaceAggregate()uses pattern matching to handleint | Aggregator