Add option to drop deduplication id field by nightcityblade · Pull Request #2078 · NVIDIA-NeMo/Curator

nightcityblade · 2026-06-16T03:16:25Z

Description

Closes #1580.

Adds an optional drop_id_field flag to the text duplicate removal stage/workflow so generated deduplication IDs can be removed from final outputs. The semantic deduplication workflow enables it by default when using generated IDs and no explicit output_fields are provided.

Usage

workflow = TextDuplicatesRemovalWorkflow(
    input_path="input",
    ids_to_remove_path="duplicates",
    output_path="output",
    drop_id_field=True,
)

Checklist

I am familiar with the Contributing Guide.
New or Existing tests cover these changes.
The documentation is up to date with these changes.

Testing

uv run ruff check nemo_curator/stages/text/deduplication/removal.py nemo_curator/stages/text/deduplication/removal_workflow.py nemo_curator/stages/text/deduplication/semantic.py tests/stages/text/deduplication/test_removal_workflow.py
uv run pytest tests/stages/text/deduplication/test_removal_workflow.py -q (blocked on macOS: ValueError: NeMo-Curator currently only supports Linux systems, while the current machine has a darwin system.)

Signed-off-by: nightcityblade <nightcityblade@gmail.com>

copy-pr-bot · 2026-06-16T03:16:29Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

greptile-apps · 2026-06-16T03:19:09Z

Greptile Summary

This PR adds an optional drop_id_field flag to TextDuplicatesRemovalStage and TextDuplicatesRemovalWorkflow, and enables it automatically in TextSemanticDeduplicationWorkflow when the ID generator is used without a custom output_fields list — preventing the generated _curator_dedup_id from leaking into final outputs.

removal.py: Column is dropped immediately after the duplicate-filter step in process(), before the output DocumentBatch is constructed — ordering is correct.
removal_workflow.py: A __post_init__ guard raises ValueError when drop_id_field=True conflicts with the user explicitly listing id_field in output_fields, surfacing the misconfiguration early.
semantic.py: Uses self.use_id_generator and self.output_fields is None — safe guard that avoids a double-drop scenario when a caller sets output_fields explicitly.

Confidence Score: 5/5

Safe to merge — the change is narrowly scoped, defaults to off, and the new auto-enable path in the semantic workflow is guarded by two conditions.

The feature defaults to False, so existing callers are unaffected. The only new runtime path (dropping a column) runs after the id-based filter and before the writer, which is the correct position. The conflict validation catches the one dangerous misconfiguration early with a clear message. Tests cover the stage directly, the conflict guard, and the wiring through the workflow.

No files require special attention.

Important Files Changed

Filename	Overview
nemo_curator/stages/text/deduplication/removal.py	Adds drop_id_field: bool = False to TextDuplicatesRemovalStage and drops the column post-filter in process(). Logic is correct and ordering is safe (drop occurs after the id-based filter).
nemo_curator/stages/text/deduplication/removal_workflow.py	Adds drop_id_field to the workflow dataclass and wires it through to the removal stage. Includes a post_init guard that raises ValueError when drop_id_field=True conflicts with id_field being explicitly listed in output_fields.
nemo_curator/stages/text/deduplication/semantic.py	Wires drop_id_field=self.use_id_generator and self.output_fields is None into TextDuplicatesRemovalWorkflow — correct guard: only auto-drops when the ID was machine-generated and no custom output schema is set.
tests/stages/text/deduplication/test_removal_workflow.py	Adds a stage-level unit test verifying the column is dropped and absent from output, a conflict-validation test, and assertions in the existing stage-wiring tests. Coverage is appropriate.
tutorials/text/deduplication/semantic/semantic_e2e.ipynb	Updates a single Markdown cell in the tutorial notebook to document the new automatic drop-on-generate-ID behavior.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[TextSemanticDeduplicationWorkflow] -->|use_id_generator AND output_fields is None| B{drop_id_field = True}
    A -->|otherwise| C{drop_id_field = False}

    B --> D[TextDuplicatesRemovalWorkflow]
    C --> D

    D -->|__post_init__ check| E{drop_id_field=True AND id_field in output_fields?}
    E -->|Yes| F[raise ValueError]
    E -->|No| G[_generate_stages]

    G --> H[FilePartitioningStage]
    H --> I[ReaderStage]
    I --> J[TextDuplicatesRemovalStage]
    J -->|drop_id_field=True| K[df.drop columns id_field]
    K --> L[WriterStage]
    J -->|drop_id_field=False| L

%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
    A[TextSemanticDeduplicationWorkflow] -->|use_id_generator AND output_fields is None| B{drop_id_field = True}
    A -->|otherwise| C{drop_id_field = False}

    B --> D[TextDuplicatesRemovalWorkflow]
    C --> D

    D -->|__post_init__ check| E{drop_id_field=True AND id_field in output_fields?}
    E -->|Yes| F[raise ValueError]
    E -->|No| G[_generate_stages]

    G --> H[FilePartitioningStage]
    H --> I[ReaderStage]
    I --> J[TextDuplicatesRemovalStage]
    J -->|drop_id_field=True| K[df.drop columns id_field]
    K --> L[WriterStage]
    J -->|drop_id_field=False| L

_{Reviews (3): Last reviewed commit: "docs: align semantic dedup tutorial outp..." | Re-trigger Greptile}

greptile-apps · 2026-06-16T03:19:13Z

    output_kwargs: dict[str, Any] | None = None
    output_fields: list[str] | None = None
    output_mode: Literal["ignore", "overwrite", "append", "error"] | None = None
+    drop_id_field: bool = False


Conflicting drop_id_field + output_fields not validated

When a caller sets drop_id_field=True and also includes id_field in output_fields, the removal stage will have already dropped that column by the time the writer stage tries to select it, producing a KeyError at runtime. The semantic workflow avoids this with an explicit self.output_fields is None guard, but the base TextDuplicatesRemovalWorkflow has no equivalent check. Adding a __post_init__ guard (after the id_generator warning) like if self.drop_id_field and self.output_fields and self.id_field in self.output_fields: raise ValueError(...) would surface this misconfiguration early with a clear message.

+1 I think this is decent feedback. WDYT @nightcityblade ?

Good call — I added an early __post_init__ validation that raises a clear ValueError when drop_id_field=True conflicts with output_fields containing the id field, plus a focused unit test for that configuration.

Pushed: nightcityblade/Curator@2dbe71d
Validation: uv run ruff check nemo_curator/stages/text/deduplication/removal_workflow.py tests/stages/text/deduplication/test_removal_workflow.py and python3 -m py_compile ... passed locally. Targeted pytest collection is blocked on this macOS host by Curator’s import-time Linux-only guard.

sarahyurick · 2026-06-16T17:07:16Z

    output_kwargs: dict[str, Any] | None = None
    output_fields: list[str] | None = None
    output_mode: Literal["ignore", "overwrite", "append", "error"] | None = None
+    drop_id_field: bool = False


+1 I think this is decent feedback. WDYT @nightcityblade ?

sarahyurick · 2026-06-16T17:09:10Z

+
+    result = stage.process(task).to_pandas()
+
+    assert result.to_dict(orient="list") == {"text": ["keep"]}


I misread this test at first and was confused why we were checking that this text was kept. Can you add another assertion that explicitly checks that CURATOR_DEDUP_ID_STR is not in the result?

Added — the test now explicitly asserts that CURATOR_DEDUP_ID_STR is absent from the result columns, in addition to checking the remaining row contents.

Pushed in nightcityblade/Curator@2dbe71d.

sarahyurick · 2026-06-26T18:04:27Z

@@ -374,6 +369,7 @@ def _run_duplicate_removal(self, executor: BaseExecutor) -> WorkflowRunResult |
            output_kwargs=self.write_kwargs,
            output_fields=self.output_fields,
            output_mode="ignore",
+            drop_id_field=self.use_id_generator and self.output_fields is None,


I guess I am on the fence about whether to silently drop the IDs generated by use_id_generator in this case (since output_fields default is none). I think it is fine but the information in the tutorial https://github.com/NVIDIA-NeMo/Curator/blob/main/tutorials/text/deduplication/semantic/semantic_e2e.ipynb might need to be updated. Can you check?

Checked and updated the tutorial in 3ff725a.

The semantic dedup notebook previously said _curator_dedup_id would appear in the final deduplicated output when use_id_generator=True. That is now outdated with this PR's default drop_id_field behavior, so I revised the note to explain that the generated ID is dropped by default and can be preserved only by including it explicitly in output_fields.

Validation: the notebook still parses as JSON after the edit.

Add option to drop deduplication id field

efc4cbe

Signed-off-by: nightcityblade <nightcityblade@gmail.com>

nightcityblade requested a review from a team as a code owner June 16, 2026 03:16

nightcityblade requested review from sarahyurick and removed request for a team June 16, 2026 03:16

github-actions Bot added the community-request label Jun 16, 2026

greptile-apps Bot reviewed Jun 16, 2026

View reviewed changes

sarahyurick reviewed Jun 16, 2026

View reviewed changes

svcnvidia-nemo-ci added the waiting-on-customer Waiting on the original author to respond label Jun 16, 2026

Validate dropped id output field conflict

2dbe71d

svcnvidia-nemo-ci added waiting-on-maintainers Waiting on maintainers to respond and removed waiting-on-customer Waiting on the original author to respond labels Jun 17, 2026

sarahyurick reviewed Jun 26, 2026

View reviewed changes

svcnvidia-nemo-ci added waiting-on-customer Waiting on the original author to respond and removed waiting-on-maintainers Waiting on maintainers to respond labels Jun 26, 2026

docs: align semantic dedup tutorial output

3ff725a

svcnvidia-nemo-ci removed the waiting-on-customer Waiting on the original author to respond label Jun 27, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add option to drop deduplication id field#2078

Add option to drop deduplication id field#2078
nightcityblade wants to merge 3 commits into
NVIDIA-NeMo:mainfrom
nightcityblade:fix/issue-1580

nightcityblade commented Jun 16, 2026

Uh oh!

copy-pr-bot Bot commented Jun 16, 2026

Uh oh!

greptile-apps Bot commented Jun 16, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot Jun 16, 2026

Uh oh!

sarahyurick Jun 16, 2026

Uh oh!

nightcityblade Jun 17, 2026

Uh oh!

sarahyurick Jun 16, 2026

Uh oh!

sarahyurick Jun 16, 2026

Uh oh!

nightcityblade Jun 17, 2026

Uh oh!

sarahyurick Jun 26, 2026

Uh oh!

Monkiia Jun 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants


		result = stage.process(task).to_pandas()

		assert result.to_dict(orient="list") == {"text": ["keep"]}

Uh oh!

Conversation

nightcityblade commented Jun 16, 2026

Description

Usage

Checklist

Testing

Uh oh!

copy-pr-bot Bot commented Jun 16, 2026

Uh oh!

greptile-apps Bot commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

greptile-apps Bot Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

sarahyurick Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

nightcityblade Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

sarahyurick Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

sarahyurick Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

nightcityblade Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

sarahyurick Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

Monkiia Jun 27, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

greptile-apps Bot commented Jun 16, 2026 •

edited

Loading