fix: auto-detect Ray fanout stages#2025
Conversation
Greptile SummaryThis PR auto-detects Ray Data fanout stages by inspecting the
Confidence Score: 5/5Safe to merge — all five manual ray_stage_spec overrides are covered by auto-detection, the previously incorrect bare-object early return in ClipTranscodingStage is fixed, and the string-annotation fallback is regression-tested. The auto-detection logic is narrow and conservative: it fires only for bare list[T] return annotations (not unions), which matches every removed override exactly. The ClipTranscodingStage early-exit bug is corrected and tested. The get_type_hints fallback for unresolvable string annotations is covered by a dedicated regression test. No existing behaviour is changed for stages that don't return lists. No files require special attention. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A["ray_stage_spec() called"] --> B["_process_returns_list()"]
B --> C{"get_type_hints(cls.process)\nsucceeds?"}
C -- Yes --> D["return_annotation = hints['return']"]
C -- "No (NameError/TypeError)" --> E["return_annotation = cls.process.__annotations__['return']"]
D --> F["_annotation_includes_list(annotation)"]
E --> F
F --> G{"isinstance(annotation, str)?"}
G -- Yes --> H{"annotation.strip().startswith\n('list[', 'List[', 'typing.List[')"}
G -- No --> I{"get_origin(annotation) is list?"}
H -- True --> J["return True → {'is_fanout_stage': True}"]
H -- False --> K["return False → {}"]
I -- True --> J
I -- False --> K
Reviews (5): Last reviewed commit: "fix: return clip fanout tasks consistent..." | Re-trigger Greptile |
| origin = get_origin(annotation) | ||
| if origin is list: | ||
| return True | ||
| if origin in (UnionType, Union) or isinstance(annotation, UnionType): |
There was a problem hiding this comment.
The
isinstance(annotation, UnionType) branch is unreachable: for any X | Y expression, get_origin(annotation) returns types.UnionType, so origin in (UnionType, Union) is already True before isinstance is evaluated. The isinstance guard can be removed without changing behaviour.
| if origin in (UnionType, Union) or isinstance(annotation, UnionType): | |
| if origin in (UnionType, Union): |
There was a problem hiding this comment.
Addressed in 947e029: removed the unreachable isinstance(annotation, UnionType) branch while refining the detection logic.
| try: | ||
| return_annotation = get_type_hints(cls.process).get("return") | ||
| except (NameError, TypeError): | ||
| return_annotation = cls.process.__annotations__.get("return") |
There was a problem hiding this comment.
String annotation fallback silently loses type information
When get_type_hints() raises NameError or TypeError (e.g. when from __future__ import annotations is active and a referenced type is not importable at inspection time), the fallback reads cls.process.__annotations__["return"] which is a raw string such as "list[OutputTask]". get_origin("list[OutputTask]") returns None, so _annotation_includes_list returns False and the stage is not auto-detected as a fanout stage even though it is one. The failure is silent — ray_stage_spec() simply returns {}. Stages that hit this path would need to manually override ray_stage_spec() as before.
There was a problem hiding this comment.
Addressed in 947e029: the fallback now handles raw string annotations like "list[OutputTask]" so list-returning stages are still detected when get_type_hints() cannot resolve the annotation. I added a regression test that forces that fallback path.
sarahyurick
left a comment
There was a problem hiding this comment.
Thanks @nightcityblade !
| @@ -77,11 +77,6 @@ def process(self, task: _EmptyTask) -> list[FileGroupTask]: | |||
| for i, url in enumerate(urls) | |||
| ] | |||
|
|
|||
| def ray_stage_spec(self) -> dict[str, Any]: | |||
| return { | |||
| "is_fanout_stage": True, | |||
There was a problem hiding this comment.
Can you do this for all stages to make sure that this PR works for all of them? And ensure that each existing stage has a pytest to check that it is being set?
There was a problem hiding this comment.
Thanks — I broadened the coverage beyond URLGenerationStage in the follow-up commit. Existing list-returning stages now have pytest coverage for the Ray fanout spec in common/text/audio/image tests (URLGenerationStage and FilePartitioning already had assertions; I added coverage for ALMManifestReaderStage, CreateInitialManifestFleursStage, AudioToDocumentStage, and ImageReaderStage). I also kept existing explicit ray_stage_spec overrides intact.
There was a problem hiding this comment.
Thanks, I see some test edits but let's remove the ray_stage_spec function from the actual stages and make sure the tests still pass. The ones I found:
- https://github.com/NVIDIA-NeMo/Curator/blob/main/nemo_curator/stages/text/download/base/url_generation.py (this file)
- https://github.com/NVIDIA-NeMo/Curator/blob/main/nemo_curator/stages/image/io/image_reader.py
- https://github.com/NVIDIA-NeMo/Curator/blob/main/nemo_curator/stages/audio/datasets/fleurs/create_initial_manifest.py
- https://github.com/NVIDIA-NeMo/Curator/blob/main/nemo_curator/stages/audio/common.py
- https://github.com/NVIDIA-NeMo/Curator/blob/main/nemo_curator/stages/video/clipping/clip_extraction_stages.py
- https://github.com/NVIDIA-NeMo/Curator/blob/main/nemo_curator/stages/audio/alm/pretrain/extraction.py
- https://github.com/NVIDIA-NeMo/Curator/blob/main/nemo_curator/stages/audio/segmentation/speaker_separation.py
- https://github.com/NVIDIA-NeMo/Curator/blob/main/nemo_curator/stages/file_partitioning.py
- https://github.com/NVIDIA-NeMo/Curator/blob/main/nemo_curator/stages/audio/datasets/readspeech/create_initial_manifest.py
- https://github.com/NVIDIA-NeMo/Curator/blob/main/nemo_curator/stages/audio/alm/pretrain/io.py
- https://github.com/NVIDIA-NeMo/Curator/blob/main/nemo_curator/stages/deduplication/semantic/pairwise_io.py
Do all of these work with this PR? Are there any that should be considered exceptions/explicitly keep ray_stage_spec for some reason?
|
|
||
| name = "MaybeFanoutProcessingStage" | ||
|
|
||
| def process(self, task: MockTask) -> MockTask | list[MockTask]: |
There was a problem hiding this comment.
I am undecided what should happen in this case. Maybe it should be up to the user?
There was a problem hiding this comment.
Good point. I changed the inference to be conservative: Task | list[Task] is no longer auto-marked as a Ray fanout stage. If a maybe-fanout stage wants Ray fanout behavior, it can opt in by overriding ray_stage_spec(). I added tests for both the conservative default and explicit opt-in.
|
Thanks, I pushed a focused follow-up in ec6a75f. What changed:
I intentionally did not remove the clip extraction override because that stage's Validation:
|
4e4581c to
ec6a75f
Compare
|
Followed up on the fanout-inference review and pushed one more cleanup commit. What changed:
On the remaining exceptions question: after this change, the stage-specific overrides I found are the ones that still carry non-fanout Ray behavior/config (for example actor-stage / backend-specific settings), rather than plain Validation:
I also tried to run the targeted pytest files, but the local environment here is missing |
|
Thanks, I pushed a small follow-up in effa59a to address the Greptile finding on the clip fanout contract.\n\nWhat changed:\n- Updated the no-clips early-return path in |
Signed-off-by: nightcityblade <nightcityblade@gmail.com>
effa59a to
073b5cb
Compare
|
I refreshed the branch sign-offs and opened a replacement PR here: #2056. This branch now has the same focused 13-file diff with all local commits signed off. |
Description
Closes #1613.
Automatically marks Ray Data stages as fanout stages when their
processreturn annotation islist[...]or a union that includeslist[...]. This lets stages likeURLGenerationStagerely on the baseProcessingStagedefault instead of manually settingis_fanout_stage.Usage
Checklist
Tests:
uv run --python 3.12 ruff check nemo_curator/stages/base.py nemo_curator/stages/text/download/base/url_generation.py tests/stages/common/test_base.pyuv run --python 3.12 --with pytest --with ray --with loguru --with pandas --with pyarrow --with fsspec python -m pytest tests/stages/common/test_base.py tests/stages/text/download/base/test_url_generation.py(blocked on macOS by NeMo-Curator's Linux-only runtime check)