fix: validate pyarrow columns in stage input checks#2156
Conversation
Signed-off-by: nightcityblade <nightcityblade@gmail.com>
Greptile SummaryThis PR fixes a regression in
Confidence Score: 4/5The change is a well-scoped fix to a single method that was silently misclassifying PyArrow columns; existing call sites are unaffected and the new code path is protected by two new tests. The
Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A["_has_data_attr(data, attr)"] --> B{hasattr data attr?}
B -- Yes --> C[return True]
B -- No --> D{hasattr data.column_names?}
D -- Yes --> E{"attr in data.column_names?"}
E -- Yes --> F[return True]
E -- No --> G[return False]
D -- No --> H{hasattr data.columns?}
H -- Yes --> I{"attr in data.columns?"}
I -- Yes --> J[return True]
I -- No --> K[return False]
H -- No --> L{hasattr data.schema.names?}
L -- Yes --> M{"attr in data.schema.names?"}
M -- Yes --> N[return True]
M -- No --> O[return False]
L -- No --> P[return False]
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
A["_has_data_attr(data, attr)"] --> B{hasattr data attr?}
B -- Yes --> C[return True]
B -- No --> D{hasattr data.column_names?}
D -- Yes --> E{"attr in data.column_names?"}
E -- Yes --> F[return True]
E -- No --> G[return False]
D -- No --> H{hasattr data.columns?}
H -- Yes --> I{"attr in data.columns?"}
I -- Yes --> J[return True]
I -- No --> K[return False]
H -- No --> L{hasattr data.schema.names?}
L -- Yes --> M{"attr in data.schema.names?"}
M -- Yes --> N[return True]
M -- No --> O[return False]
L -- No --> P[return False]
Reviews (1): Last reviewed commit: "fix: validate pyarrow columns in stage i..." | Re-trigger Greptile |
| if hasattr(data, "columns"): | ||
| return attr in data.columns |
There was a problem hiding this comment.
columns branch is ambiguous for PyArrow-like types
For pa.Table, data.columns returns a list of ChunkedArray objects, not strings — so attr in data.columns would perform object-identity comparison and always return False for a plain string attribute name. This branch works today only because pa.Table also exposes column_names, which is checked first. If another table type (e.g. a future columnar format) has columns returning arrays but lacks column_names, this branch would silently misreport all columns as missing. A brief inline comment or guard (e.g. checking that the first element is a string) would make the intent and the ordering dependency explicit.
| class TestProcessingStageValidateInput: | ||
| def test_validate_input_accepts_pyarrow_columns(self): | ||
| stage = RequiredColumnStage() | ||
| batch = DocumentBatch( | ||
| task_id="batch", | ||
| dataset_name="dataset", | ||
| data=pa.table({"text": ["hello"]}), | ||
| ) | ||
|
|
||
| assert stage.validate_input(batch) is True | ||
|
|
||
| def test_validate_input_rejects_missing_pyarrow_columns(self): | ||
| stage = RequiredColumnStage() | ||
| batch = DocumentBatch( | ||
| task_id="batch", | ||
| dataset_name="dataset", | ||
| data=pa.table({"other": ["hello"]}), | ||
| ) | ||
|
|
||
| assert stage.validate_input(batch) is False |
There was a problem hiding this comment.
Missing pandas DataFrame coverage
DocumentBatch.data is typed as pa.Table | pd.DataFrame, and pd.DataFrame takes a different code path through _has_data_attr (it lacks column_names, so it falls through to the columns branch). The new test class only exercises the PyArrow path. Adding parallel test_validate_input_accepts_pandas_columns / test_validate_input_rejects_missing_pandas_columns cases would ensure that both union branches stay correct and guard against a future regression in the columns fallback.
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Description
closes #2151
ProcessingStage.validate_input()to recognize column names on table-like task data instead of relying only onhasattr()DocumentBatchtests using PyArrow tablesUsage
Checklist
Validation run locally:
python3 -m ruff check nemo_curator/stages/base.py tests/stages/common/test_base.pypython3 -m pytest --noconftest tests/stages/common/test_base.py -q(blocked in this environment:pyarrowis not installed; the full test setup also requiresrayviatests/conftest.py)