Add option to drop deduplication id field#2078
Conversation
Signed-off-by: nightcityblade <nightcityblade@gmail.com>
Greptile SummaryThis PR adds an optional
Confidence Score: 5/5Safe to merge — the change is narrowly scoped, defaults to off, and the new auto-enable path in the semantic workflow is guarded by two conditions. The feature defaults to False, so existing callers are unaffected. The only new runtime path (dropping a column) runs after the id-based filter and before the writer, which is the correct position. The conflict validation catches the one dangerous misconfiguration early with a clear message. Tests cover the stage directly, the conflict guard, and the wiring through the workflow. No files require special attention. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[TextSemanticDeduplicationWorkflow] -->|use_id_generator AND output_fields is None| B{drop_id_field = True}
A -->|otherwise| C{drop_id_field = False}
B --> D[TextDuplicatesRemovalWorkflow]
C --> D
D -->|__post_init__ check| E{drop_id_field=True AND id_field in output_fields?}
E -->|Yes| F[raise ValueError]
E -->|No| G[_generate_stages]
G --> H[FilePartitioningStage]
H --> I[ReaderStage]
I --> J[TextDuplicatesRemovalStage]
J -->|drop_id_field=True| K[df.drop columns id_field]
K --> L[WriterStage]
J -->|drop_id_field=False| L
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
A[TextSemanticDeduplicationWorkflow] -->|use_id_generator AND output_fields is None| B{drop_id_field = True}
A -->|otherwise| C{drop_id_field = False}
B --> D[TextDuplicatesRemovalWorkflow]
C --> D
D -->|__post_init__ check| E{drop_id_field=True AND id_field in output_fields?}
E -->|Yes| F[raise ValueError]
E -->|No| G[_generate_stages]
G --> H[FilePartitioningStage]
H --> I[ReaderStage]
I --> J[TextDuplicatesRemovalStage]
J -->|drop_id_field=True| K[df.drop columns id_field]
K --> L[WriterStage]
J -->|drop_id_field=False| L
Reviews (3): Last reviewed commit: "docs: align semantic dedup tutorial outp..." | Re-trigger Greptile |
| output_kwargs: dict[str, Any] | None = None | ||
| output_fields: list[str] | None = None | ||
| output_mode: Literal["ignore", "overwrite", "append", "error"] | None = None | ||
| drop_id_field: bool = False |
There was a problem hiding this comment.
Conflicting
drop_id_field + output_fields not validated
When a caller sets drop_id_field=True and also includes id_field in output_fields, the removal stage will have already dropped that column by the time the writer stage tries to select it, producing a KeyError at runtime. The semantic workflow avoids this with an explicit self.output_fields is None guard, but the base TextDuplicatesRemovalWorkflow has no equivalent check. Adding a __post_init__ guard (after the id_generator warning) like if self.drop_id_field and self.output_fields and self.id_field in self.output_fields: raise ValueError(...) would surface this misconfiguration early with a clear message.
There was a problem hiding this comment.
+1 I think this is decent feedback. WDYT @nightcityblade ?
There was a problem hiding this comment.
Good call — I added an early __post_init__ validation that raises a clear ValueError when drop_id_field=True conflicts with output_fields containing the id field, plus a focused unit test for that configuration.
Pushed: nightcityblade/Curator@2dbe71d
Validation: uv run ruff check nemo_curator/stages/text/deduplication/removal_workflow.py tests/stages/text/deduplication/test_removal_workflow.py and python3 -m py_compile ... passed locally. Targeted pytest collection is blocked on this macOS host by Curator’s import-time Linux-only guard.
| output_kwargs: dict[str, Any] | None = None | ||
| output_fields: list[str] | None = None | ||
| output_mode: Literal["ignore", "overwrite", "append", "error"] | None = None | ||
| drop_id_field: bool = False |
There was a problem hiding this comment.
+1 I think this is decent feedback. WDYT @nightcityblade ?
|
|
||
| result = stage.process(task).to_pandas() | ||
|
|
||
| assert result.to_dict(orient="list") == {"text": ["keep"]} |
There was a problem hiding this comment.
I misread this test at first and was confused why we were checking that this text was kept. Can you add another assertion that explicitly checks that CURATOR_DEDUP_ID_STR is not in the result?
There was a problem hiding this comment.
Added — the test now explicitly asserts that CURATOR_DEDUP_ID_STR is absent from the result columns, in addition to checking the remaining row contents.
Pushed in nightcityblade/Curator@2dbe71d.
| @@ -374,6 +369,7 @@ def _run_duplicate_removal(self, executor: BaseExecutor) -> WorkflowRunResult | | |||
| output_kwargs=self.write_kwargs, | |||
| output_fields=self.output_fields, | |||
| output_mode="ignore", | |||
| drop_id_field=self.use_id_generator and self.output_fields is None, | |||
There was a problem hiding this comment.
I guess I am on the fence about whether to silently drop the IDs generated by use_id_generator in this case (since output_fields default is none). I think it is fine but the information in the tutorial https://github.com/NVIDIA-NeMo/Curator/blob/main/tutorials/text/deduplication/semantic/semantic_e2e.ipynb might need to be updated. Can you check?
There was a problem hiding this comment.
Checked and updated the tutorial in 3ff725a.
The semantic dedup notebook previously said _curator_dedup_id would appear in the final deduplicated output when use_id_generator=True. That is now outdated with this PR's default drop_id_field behavior, so I revised the note to explain that the generated ID is dropped by default and can be preserved only by including it explicitly in output_fields.
Validation: the notebook still parses as JSON after the edit.
Description
Closes #1580.
Adds an optional
drop_id_fieldflag to the text duplicate removal stage/workflow so generated deduplication IDs can be removed from final outputs. The semantic deduplication workflow enables it by default when using generated IDs and no explicitoutput_fieldsare provided.Usage
Checklist
Testing
uv run ruff check nemo_curator/stages/text/deduplication/removal.py nemo_curator/stages/text/deduplication/removal_workflow.py nemo_curator/stages/text/deduplication/semantic.py tests/stages/text/deduplication/test_removal_workflow.pyuv run pytest tests/stages/text/deduplication/test_removal_workflow.py -q(blocked on macOS:ValueError: NeMo-Curator currently only supports Linux systems, while the current machine has a darwin system.)