fix: validate pyarrow columns in stage input checks by nightcityblade · Pull Request #2156 · NVIDIA-NeMo/Curator

nightcityblade · 2026-07-02T15:08:18Z

Description

teach ProcessingStage.validate_input() to recognize column names on table-like task data instead of relying only on hasattr()
cover the regression with focused DocumentBatch tests using PyArrow tables

Usage

from nemo_curator.tasks import DocumentBatch
import pyarrow as pa

batch = DocumentBatch(task_id="batch", dataset_name="dataset", data=pa.table({"text": ["hello"]}))
assert stage.validate_input(batch)

Checklist

I am familiar with the Contributing Guide.
New or Existing tests cover these changes.
The documentation is up to date with these changes.

Validation run locally:

python3 -m ruff check nemo_curator/stages/base.py tests/stages/common/test_base.py
python3 -m pytest --noconftest tests/stages/common/test_base.py -q (blocked in this environment: pyarrow is not installed; the full test setup also requires ray via tests/conftest.py)

Signed-off-by: nightcityblade <nightcityblade@gmail.com>

copy-pr-bot · 2026-07-02T15:08:23Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

greptile-apps · 2026-07-02T15:12:51Z

Greptile Summary

This PR fixes a regression in ProcessingStage.validate_input() where PyArrow tables were always treated as missing every required column, because the original check used only hasattr() (which returns False for PyArrow column names). A new _has_data_attr static method probes four distinct column-discovery strategies in priority order, and two focused DocumentBatch tests confirm the fix.

Introduces _has_data_attr that tries hasattr, then column_names (PyArrow), then columns (pandas/cuDF/Polars), then schema.names, with early-return semantics at each step.
Adds TestProcessingStageValidateInput covering both the positive (column present) and negative (column absent) PyArrow paths, though the pd.DataFrame path of DocumentBatch is not yet tested.

Confidence Score: 4/5

The change is a well-scoped fix to a single method that was silently misclassifying PyArrow columns; existing call sites are unaffected and the new code path is protected by two new tests.

The columns branch has a subtle ordering dependency that works today because PyArrow hits column_names first, but could silently mishandle a future type that exposes columns as array objects without column_names. The pd.DataFrame path of DocumentBatch is also untested in the new test class.

nemo_curator/stages/base.py lines 168-169 (the columns branch) and tests/stages/common/test_base.py (missing pandas coverage).

Important Files Changed

Filename	Overview
nemo_curator/stages/base.py	Adds `_has_data_attr` static method to `ProcessingStage` that recognises PyArrow `column_names`, pandas-style `columns`, and `schema.names` in addition to plain `hasattr`; fixes the regression where PyArrow table columns were always treated as missing.
tests/stages/common/test_base.py	Adds `TestProcessingStageValidateInput` with two focused tests covering PyArrow column presence/absence through `validate_input`; does not yet cover the `pd.DataFrame` path of `DocumentBatch`.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["_has_data_attr(data, attr)"] --> B{hasattr data attr?}
    B -- Yes --> C[return True]
    B -- No --> D{hasattr data.column_names?}
    D -- Yes --> E{"attr in data.column_names?"}
    E -- Yes --> F[return True]
    E -- No --> G[return False]
    D -- No --> H{hasattr data.columns?}
    H -- Yes --> I{"attr in data.columns?"}
    I -- Yes --> J[return True]
    I -- No --> K[return False]
    H -- No --> L{hasattr data.schema.names?}
    L -- Yes --> M{"attr in data.schema.names?"}
    M -- Yes --> N[return True]
    M -- No --> O[return False]
    L -- No --> P[return False]

%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
    A["_has_data_attr(data, attr)"] --> B{hasattr data attr?}
    B -- Yes --> C[return True]
    B -- No --> D{hasattr data.column_names?}
    D -- Yes --> E{"attr in data.column_names?"}
    E -- Yes --> F[return True]
    E -- No --> G[return False]
    D -- No --> H{hasattr data.columns?}
    H -- Yes --> I{"attr in data.columns?"}
    I -- Yes --> J[return True]
    I -- No --> K[return False]
    H -- No --> L{hasattr data.schema.names?}
    L -- Yes --> M{"attr in data.schema.names?"}
    M -- Yes --> N[return True]
    M -- No --> O[return False]
    L -- No --> P[return False]

_{Reviews (1): Last reviewed commit: "fix: validate pyarrow columns in stage i..." | Re-trigger Greptile}

greptile-apps · 2026-07-02T15:12:55Z

+        if hasattr(data, "columns"):
+            return attr in data.columns


columns branch is ambiguous for PyArrow-like types

For pa.Table, data.columns returns a list of ChunkedArray objects, not strings — so attr in data.columns would perform object-identity comparison and always return False for a plain string attribute name. This branch works today only because pa.Table also exposes column_names, which is checked first. If another table type (e.g. a future columnar format) has columns returning arrays but lacks column_names, this branch would silently misreport all columns as missing. A brief inline comment or guard (e.g. checking that the first element is a string) would make the intent and the ordering dependency explicit.

greptile-apps · 2026-07-02T15:12:56Z

+class TestProcessingStageValidateInput:
+    def test_validate_input_accepts_pyarrow_columns(self):
+        stage = RequiredColumnStage()
+        batch = DocumentBatch(
+            task_id="batch",
+            dataset_name="dataset",
+            data=pa.table({"text": ["hello"]}),
+        )
+
+        assert stage.validate_input(batch) is True
+
+    def test_validate_input_rejects_missing_pyarrow_columns(self):
+        stage = RequiredColumnStage()
+        batch = DocumentBatch(
+            task_id="batch",
+            dataset_name="dataset",
+            data=pa.table({"other": ["hello"]}),
+        )
+
+        assert stage.validate_input(batch) is False


Missing pandas DataFrame coverage

DocumentBatch.data is typed as pa.Table | pd.DataFrame, and pd.DataFrame takes a different code path through _has_data_attr (it lacks column_names, so it falls through to the columns branch). The new test class only exercises the PyArrow path. Adding parallel test_validate_input_accepts_pandas_columns / test_validate_input_rejects_missing_pandas_columns cases would ensure that both union branches stay correct and guard against a future regression in the columns fallback.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

fix: validate pyarrow columns in stage input checks

2eab4be

Signed-off-by: nightcityblade <nightcityblade@gmail.com>

nightcityblade requested a review from a team as a code owner July 2, 2026 15:08

nightcityblade requested review from oyilmaz-nvidia and removed request for a team July 2, 2026 15:08

github-actions Bot added the community-request label Jul 2, 2026

greptile-apps Bot reviewed Jul 2, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: validate pyarrow columns in stage input checks#2156

fix: validate pyarrow columns in stage input checks#2156
nightcityblade wants to merge 1 commit into
NVIDIA-NeMo:mainfrom
nightcityblade:fix/issue-2151

nightcityblade commented Jul 2, 2026

Uh oh!

copy-pr-bot Bot commented Jul 2, 2026

Uh oh!

greptile-apps Bot commented Jul 2, 2026

Uh oh!

greptile-apps Bot Jul 2, 2026

Uh oh!

greptile-apps Bot Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

nightcityblade commented Jul 2, 2026

Description

Usage

Checklist

Uh oh!

copy-pr-bot Bot commented Jul 2, 2026

Uh oh!

greptile-apps Bot commented Jul 2, 2026

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Flowchart

Uh oh!

greptile-apps Bot Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants