Skip to content

fix: validate pyarrow columns in stage input checks#2156

Open
nightcityblade wants to merge 1 commit into
NVIDIA-NeMo:mainfrom
nightcityblade:fix/issue-2151
Open

fix: validate pyarrow columns in stage input checks#2156
nightcityblade wants to merge 1 commit into
NVIDIA-NeMo:mainfrom
nightcityblade:fix/issue-2151

Conversation

@nightcityblade

Copy link
Copy Markdown
Contributor

Description

closes #2151

  • teach ProcessingStage.validate_input() to recognize column names on table-like task data instead of relying only on hasattr()
  • cover the regression with focused DocumentBatch tests using PyArrow tables

Usage

from nemo_curator.tasks import DocumentBatch
import pyarrow as pa

batch = DocumentBatch(task_id="batch", dataset_name="dataset", data=pa.table({"text": ["hello"]}))
assert stage.validate_input(batch)

Checklist

  • I am familiar with the Contributing Guide.
  • New or Existing tests cover these changes.
  • The documentation is up to date with these changes.

Validation run locally:

  • python3 -m ruff check nemo_curator/stages/base.py tests/stages/common/test_base.py
  • python3 -m pytest --noconftest tests/stages/common/test_base.py -q (blocked in this environment: pyarrow is not installed; the full test setup also requires ray via tests/conftest.py)

Signed-off-by: nightcityblade <nightcityblade@gmail.com>
@nightcityblade nightcityblade requested a review from a team as a code owner July 2, 2026 15:08
@nightcityblade nightcityblade requested review from oyilmaz-nvidia and removed request for a team July 2, 2026 15:08
@copy-pr-bot

copy-pr-bot Bot commented Jul 2, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@greptile-apps

greptile-apps Bot commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR fixes a regression in ProcessingStage.validate_input() where PyArrow tables were always treated as missing every required column, because the original check used only hasattr() (which returns False for PyArrow column names). A new _has_data_attr static method probes four distinct column-discovery strategies in priority order, and two focused DocumentBatch tests confirm the fix.

  • Introduces _has_data_attr that tries hasattr, then column_names (PyArrow), then columns (pandas/cuDF/Polars), then schema.names, with early-return semantics at each step.
  • Adds TestProcessingStageValidateInput covering both the positive (column present) and negative (column absent) PyArrow paths, though the pd.DataFrame path of DocumentBatch is not yet tested.

Confidence Score: 4/5

The change is a well-scoped fix to a single method that was silently misclassifying PyArrow columns; existing call sites are unaffected and the new code path is protected by two new tests.

The columns branch has a subtle ordering dependency that works today because PyArrow hits column_names first, but could silently mishandle a future type that exposes columns as array objects without column_names. The pd.DataFrame path of DocumentBatch is also untested in the new test class.

nemo_curator/stages/base.py lines 168-169 (the columns branch) and tests/stages/common/test_base.py (missing pandas coverage).

Important Files Changed

Filename Overview
nemo_curator/stages/base.py Adds _has_data_attr static method to ProcessingStage that recognises PyArrow column_names, pandas-style columns, and schema.names in addition to plain hasattr; fixes the regression where PyArrow table columns were always treated as missing.
tests/stages/common/test_base.py Adds TestProcessingStageValidateInput with two focused tests covering PyArrow column presence/absence through validate_input; does not yet cover the pd.DataFrame path of DocumentBatch.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["_has_data_attr(data, attr)"] --> B{hasattr data attr?}
    B -- Yes --> C[return True]
    B -- No --> D{hasattr data.column_names?}
    D -- Yes --> E{"attr in data.column_names?"}
    E -- Yes --> F[return True]
    E -- No --> G[return False]
    D -- No --> H{hasattr data.columns?}
    H -- Yes --> I{"attr in data.columns?"}
    I -- Yes --> J[return True]
    I -- No --> K[return False]
    H -- No --> L{hasattr data.schema.names?}
    L -- Yes --> M{"attr in data.schema.names?"}
    M -- Yes --> N[return True]
    M -- No --> O[return False]
    L -- No --> P[return False]
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
    A["_has_data_attr(data, attr)"] --> B{hasattr data attr?}
    B -- Yes --> C[return True]
    B -- No --> D{hasattr data.column_names?}
    D -- Yes --> E{"attr in data.column_names?"}
    E -- Yes --> F[return True]
    E -- No --> G[return False]
    D -- No --> H{hasattr data.columns?}
    H -- Yes --> I{"attr in data.columns?"}
    I -- Yes --> J[return True]
    I -- No --> K[return False]
    H -- No --> L{hasattr data.schema.names?}
    L -- Yes --> M{"attr in data.schema.names?"}
    M -- Yes --> N[return True]
    M -- No --> O[return False]
    L -- No --> P[return False]
Loading

Reviews (1): Last reviewed commit: "fix: validate pyarrow columns in stage i..." | Re-trigger Greptile

Comment on lines +168 to +169
if hasattr(data, "columns"):
return attr in data.columns

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 columns branch is ambiguous for PyArrow-like types

For pa.Table, data.columns returns a list of ChunkedArray objects, not strings — so attr in data.columns would perform object-identity comparison and always return False for a plain string attribute name. This branch works today only because pa.Table also exposes column_names, which is checked first. If another table type (e.g. a future columnar format) has columns returning arrays but lacks column_names, this branch would silently misreport all columns as missing. A brief inline comment or guard (e.g. checking that the first element is a string) would make the intent and the ordering dependency explicit.

Comment on lines +279 to +298
class TestProcessingStageValidateInput:
def test_validate_input_accepts_pyarrow_columns(self):
stage = RequiredColumnStage()
batch = DocumentBatch(
task_id="batch",
dataset_name="dataset",
data=pa.table({"text": ["hello"]}),
)

assert stage.validate_input(batch) is True

def test_validate_input_rejects_missing_pyarrow_columns(self):
stage = RequiredColumnStage()
batch = DocumentBatch(
task_id="batch",
dataset_name="dataset",
data=pa.table({"other": ["hello"]}),
)

assert stage.validate_input(batch) is False

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Missing pandas DataFrame coverage

DocumentBatch.data is typed as pa.Table | pd.DataFrame, and pd.DataFrame takes a different code path through _has_data_attr (it lacks column_names, so it falls through to the columns branch). The new test class only exercises the PyArrow path. Adding parallel test_validate_input_accepts_pandas_columns / test_validate_input_rejects_missing_pandas_columns cases would ensure that both union branches stay correct and guard against a future regression in the columns fallback.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ProcessingStage input validation fails when the input task is DocumentBatch and the data type is pyarrow

2 participants