Skip to content

Add option to drop deduplication id field#2078

Open
nightcityblade wants to merge 3 commits into
NVIDIA-NeMo:mainfrom
nightcityblade:fix/issue-1580
Open

Add option to drop deduplication id field#2078
nightcityblade wants to merge 3 commits into
NVIDIA-NeMo:mainfrom
nightcityblade:fix/issue-1580

Conversation

@nightcityblade

Copy link
Copy Markdown
Contributor

Description

Closes #1580.

Adds an optional drop_id_field flag to the text duplicate removal stage/workflow so generated deduplication IDs can be removed from final outputs. The semantic deduplication workflow enables it by default when using generated IDs and no explicit output_fields are provided.

Usage

workflow = TextDuplicatesRemovalWorkflow(
    input_path="input",
    ids_to_remove_path="duplicates",
    output_path="output",
    drop_id_field=True,
)

Checklist

  • I am familiar with the Contributing Guide.
  • New or Existing tests cover these changes.
  • The documentation is up to date with these changes.

Testing

  • uv run ruff check nemo_curator/stages/text/deduplication/removal.py nemo_curator/stages/text/deduplication/removal_workflow.py nemo_curator/stages/text/deduplication/semantic.py tests/stages/text/deduplication/test_removal_workflow.py
  • uv run pytest tests/stages/text/deduplication/test_removal_workflow.py -q (blocked on macOS: ValueError: NeMo-Curator currently only supports Linux systems, while the current machine has a darwin system.)

Signed-off-by: nightcityblade <nightcityblade@gmail.com>
@nightcityblade nightcityblade requested a review from a team as a code owner June 16, 2026 03:16
@nightcityblade nightcityblade requested review from sarahyurick and removed request for a team June 16, 2026 03:16
@copy-pr-bot

copy-pr-bot Bot commented Jun 16, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@greptile-apps

greptile-apps Bot commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR adds an optional drop_id_field flag to TextDuplicatesRemovalStage and TextDuplicatesRemovalWorkflow, and enables it automatically in TextSemanticDeduplicationWorkflow when the ID generator is used without a custom output_fields list — preventing the generated _curator_dedup_id from leaking into final outputs.

  • removal.py: Column is dropped immediately after the duplicate-filter step in process(), before the output DocumentBatch is constructed — ordering is correct.
  • removal_workflow.py: A __post_init__ guard raises ValueError when drop_id_field=True conflicts with the user explicitly listing id_field in output_fields, surfacing the misconfiguration early.
  • semantic.py: Uses self.use_id_generator and self.output_fields is None — safe guard that avoids a double-drop scenario when a caller sets output_fields explicitly.

Confidence Score: 5/5

Safe to merge — the change is narrowly scoped, defaults to off, and the new auto-enable path in the semantic workflow is guarded by two conditions.

The feature defaults to False, so existing callers are unaffected. The only new runtime path (dropping a column) runs after the id-based filter and before the writer, which is the correct position. The conflict validation catches the one dangerous misconfiguration early with a clear message. Tests cover the stage directly, the conflict guard, and the wiring through the workflow.

No files require special attention.

Important Files Changed

Filename Overview
nemo_curator/stages/text/deduplication/removal.py Adds drop_id_field: bool = False to TextDuplicatesRemovalStage and drops the column post-filter in process(). Logic is correct and ordering is safe (drop occurs after the id-based filter).
nemo_curator/stages/text/deduplication/removal_workflow.py Adds drop_id_field to the workflow dataclass and wires it through to the removal stage. Includes a post_init guard that raises ValueError when drop_id_field=True conflicts with id_field being explicitly listed in output_fields.
nemo_curator/stages/text/deduplication/semantic.py Wires drop_id_field=self.use_id_generator and self.output_fields is None into TextDuplicatesRemovalWorkflow — correct guard: only auto-drops when the ID was machine-generated and no custom output schema is set.
tests/stages/text/deduplication/test_removal_workflow.py Adds a stage-level unit test verifying the column is dropped and absent from output, a conflict-validation test, and assertions in the existing stage-wiring tests. Coverage is appropriate.
tutorials/text/deduplication/semantic/semantic_e2e.ipynb Updates a single Markdown cell in the tutorial notebook to document the new automatic drop-on-generate-ID behavior.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[TextSemanticDeduplicationWorkflow] -->|use_id_generator AND output_fields is None| B{drop_id_field = True}
    A -->|otherwise| C{drop_id_field = False}

    B --> D[TextDuplicatesRemovalWorkflow]
    C --> D

    D -->|__post_init__ check| E{drop_id_field=True AND id_field in output_fields?}
    E -->|Yes| F[raise ValueError]
    E -->|No| G[_generate_stages]

    G --> H[FilePartitioningStage]
    H --> I[ReaderStage]
    I --> J[TextDuplicatesRemovalStage]
    J -->|drop_id_field=True| K[df.drop columns id_field]
    K --> L[WriterStage]
    J -->|drop_id_field=False| L
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
    A[TextSemanticDeduplicationWorkflow] -->|use_id_generator AND output_fields is None| B{drop_id_field = True}
    A -->|otherwise| C{drop_id_field = False}

    B --> D[TextDuplicatesRemovalWorkflow]
    C --> D

    D -->|__post_init__ check| E{drop_id_field=True AND id_field in output_fields?}
    E -->|Yes| F[raise ValueError]
    E -->|No| G[_generate_stages]

    G --> H[FilePartitioningStage]
    H --> I[ReaderStage]
    I --> J[TextDuplicatesRemovalStage]
    J -->|drop_id_field=True| K[df.drop columns id_field]
    K --> L[WriterStage]
    J -->|drop_id_field=False| L
Loading

Reviews (3): Last reviewed commit: "docs: align semantic dedup tutorial outp..." | Re-trigger Greptile

output_kwargs: dict[str, Any] | None = None
output_fields: list[str] | None = None
output_mode: Literal["ignore", "overwrite", "append", "error"] | None = None
drop_id_field: bool = False

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Conflicting drop_id_field + output_fields not validated

When a caller sets drop_id_field=True and also includes id_field in output_fields, the removal stage will have already dropped that column by the time the writer stage tries to select it, producing a KeyError at runtime. The semantic workflow avoids this with an explicit self.output_fields is None guard, but the base TextDuplicatesRemovalWorkflow has no equivalent check. Adding a __post_init__ guard (after the id_generator warning) like if self.drop_id_field and self.output_fields and self.id_field in self.output_fields: raise ValueError(...) would surface this misconfiguration early with a clear message.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 I think this is decent feedback. WDYT @nightcityblade ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call — I added an early __post_init__ validation that raises a clear ValueError when drop_id_field=True conflicts with output_fields containing the id field, plus a focused unit test for that configuration.

Pushed: nightcityblade/Curator@2dbe71d
Validation: uv run ruff check nemo_curator/stages/text/deduplication/removal_workflow.py tests/stages/text/deduplication/test_removal_workflow.py and python3 -m py_compile ... passed locally. Targeted pytest collection is blocked on this macOS host by Curator’s import-time Linux-only guard.

output_kwargs: dict[str, Any] | None = None
output_fields: list[str] | None = None
output_mode: Literal["ignore", "overwrite", "append", "error"] | None = None
drop_id_field: bool = False

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 I think this is decent feedback. WDYT @nightcityblade ?


result = stage.process(task).to_pandas()

assert result.to_dict(orient="list") == {"text": ["keep"]}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I misread this test at first and was confused why we were checking that this text was kept. Can you add another assertion that explicitly checks that CURATOR_DEDUP_ID_STR is not in the result?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added — the test now explicitly asserts that CURATOR_DEDUP_ID_STR is absent from the result columns, in addition to checking the remaining row contents.

Pushed in nightcityblade/Curator@2dbe71d.

@svcnvidia-nemo-ci svcnvidia-nemo-ci added the waiting-on-customer Waiting on the original author to respond label Jun 16, 2026
@svcnvidia-nemo-ci svcnvidia-nemo-ci added waiting-on-maintainers Waiting on maintainers to respond and removed waiting-on-customer Waiting on the original author to respond labels Jun 17, 2026
@@ -374,6 +369,7 @@ def _run_duplicate_removal(self, executor: BaseExecutor) -> WorkflowRunResult |
output_kwargs=self.write_kwargs,
output_fields=self.output_fields,
output_mode="ignore",
drop_id_field=self.use_id_generator and self.output_fields is None,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I am on the fence about whether to silently drop the IDs generated by use_id_generator in this case (since output_fields default is none). I think it is fine but the information in the tutorial https://github.com/NVIDIA-NeMo/Curator/blob/main/tutorials/text/deduplication/semantic/semantic_e2e.ipynb might need to be updated. Can you check?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checked and updated the tutorial in 3ff725a.

The semantic dedup notebook previously said _curator_dedup_id would appear in the final deduplicated output when use_id_generator=True. That is now outdated with this PR's default drop_id_field behavior, so I revised the note to explain that the generated ID is dropped by default and can be preserved only by including it explicitly in output_fields.

Validation: the notebook still parses as JSON after the edit.

@svcnvidia-nemo-ci svcnvidia-nemo-ci added waiting-on-customer Waiting on the original author to respond and removed waiting-on-maintainers Waiting on maintainers to respond labels Jun 26, 2026
@svcnvidia-nemo-ci svcnvidia-nemo-ci removed the waiting-on-customer Waiting on the original author to respond label Jun 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Allow the textRemovalWorkflow to skip writing the id columns/_curator_dedup_id column on removal

4 participants