fix: auto-detect Ray fanout stages#2056
Conversation
Signed-off-by: nightcityblade <nightcityblade@gmail.com>
Signed-off-by: nightcityblade <nightcityblade@gmail.com>
Signed-off-by: nightcityblade <nightcityblade@gmail.com>
Signed-off-by: nightcityblade <nightcityblade@gmail.com>
Signed-off-by: nightcityblade <nightcityblade@gmail.com>
Greptile SummaryThis PR eliminates boilerplate
Confidence Score: 5/5The change is safe to merge. The auto-detection is purely additive inference from the return annotation, and any stage that needs different behaviour can still override All affected stages already had No files require special attention. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A["ray_stage_spec() called"] --> B["_process_returns_list()"]
B --> C{"get_type_hints(cls.process) succeeds?"}
C -- "Yes" --> D["return_annotation = hints['return']"]
C -- "No (NameError / TypeError / AttributeError)" --> E["return_annotation = cls.process.__annotations__['return']"]
D --> F["_annotation_includes_list(annotation)"]
E --> F
F --> G{"annotation is str?"}
G -- "Yes" --> H{"starts with 'list[' / 'List[' / 'typing.List['?"}
G -- "No" --> I{"get_origin(annotation) is list?"}
H -- "True" --> J["returns True"]
H -- "False" --> K["returns False"]
I -- "True" --> J
I -- "False" --> K
J --> L["ray_stage_spec returns {is_fanout_stage: True}"]
K --> M["ray_stage_spec returns {}"]
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
A["ray_stage_spec() called"] --> B["_process_returns_list()"]
B --> C{"get_type_hints(cls.process) succeeds?"}
C -- "Yes" --> D["return_annotation = hints['return']"]
C -- "No (NameError / TypeError / AttributeError)" --> E["return_annotation = cls.process.__annotations__['return']"]
D --> F["_annotation_includes_list(annotation)"]
E --> F
F --> G{"annotation is str?"}
G -- "Yes" --> H{"starts with 'list[' / 'List[' / 'typing.List['?"}
G -- "No" --> I{"get_origin(annotation) is list?"}
H -- "True" --> J["returns True"]
H -- "False" --> K["returns False"]
I -- "True" --> J
I -- "False" --> K
J --> L["ray_stage_spec returns {is_fanout_stage: True}"]
K --> M["ray_stage_spec returns {}"]
Reviews (2): Last reviewed commit: "Merge remote-tracking branch 'upstream/m..." | Re-trigger Greptile |
| try: | ||
| return_annotation = get_type_hints(cls.process).get("return") | ||
| except (NameError, TypeError): | ||
| return_annotation = cls.process.__annotations__.get("return") |
There was a problem hiding this comment.
Adding
AttributeError to the except tuple ensures the fallback path is reached when get_type_hints encounters a dotted-attribute annotation referencing a missing symbol (e.g. some_module.MissingType). Without it, the exception propagates out of ray_stage_spec().
| try: | |
| return_annotation = get_type_hints(cls.process).get("return") | |
| except (NameError, TypeError): | |
| return_annotation = cls.process.__annotations__.get("return") | |
| try: | |
| return_annotation = get_type_hints(cls.process).get("return") | |
| except (NameError, TypeError, AttributeError): | |
| return_annotation = cls.process.__annotations__.get("return") |
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
|
Hi @nightcityblade can you resolve the merge conflicts here? |
…-fanout # Conflicts: # nemo_curator/stages/audio/alm/alm_manifest_reader.py # nemo_curator/stages/base.py # nemo_curator/stages/deduplication/semantic/pairwise_io.py # nemo_curator/stages/file_partitioning.py # nemo_curator/stages/video/clipping/clip_extraction_stages.py # tests/stages/audio/alm/test_alm_manifest_reader.py # tests/stages/audio/io/test_convert.py # tests/stages/image/io/test_image_reader.py
|
Resolved the merge conflicts by merging the latest Also addressed the Greptile review feedback by catching Validation:
|
|
Hi @nightcityblade can you check if any of the changes in #2086 are relevant for this PR? Thanks in advance! |
Description
Closes #1613.
Automatically marks Ray Data stages as fanout stages when their
processreturn annotation is a concretelist[...]. This lets stages likeURLGenerationStagerely on the baseProcessingStagedefault instead of manually settingis_fanout_stage.Supersedes #2025 with the same branch after refreshing the DCO sign-offs.
Usage
Checklist
Tests:
python3 -m compileall nemo_curator/stages/base.py nemo_curator/stages/video/clipping/clip_extraction_stages.py tests/stages/video/clipping/test_clip_transcoding_stage.py