-
Notifications
You must be signed in to change notification settings - Fork 292
Add option to drop deduplication id field #2078
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 2 commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -369,6 +369,7 @@ def _run_duplicate_removal(self, executor: BaseExecutor) -> WorkflowRunResult | | |
| output_kwargs=self.write_kwargs, | ||
| output_fields=self.output_fields, | ||
| output_mode="ignore", | ||
| drop_id_field=self.use_id_generator and self.output_fields is None, | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I guess I am on the fence about whether to silently drop the IDs generated by There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Checked and updated the tutorial in The semantic dedup notebook previously said Validation: the notebook still parses as JSON after the edit. |
||
| ) | ||
|
|
||
| return workflow.run(executor=executor) | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -281,7 +281,39 @@ def test_initial_tasks_partitioning(self, test_config: "TestTextDuplicateRemoval | |
| assert workflow_output.get_metadata("num_duplicates_removed") == expected_removed | ||
|
|
||
|
|
||
|
|
||
| def test_removal_stage_can_drop_id_field(tmp_path: Path): | ||
| ids_to_remove_path = tmp_path / "ids_to_remove.parquet" | ||
| pd.DataFrame({"id": [1]}).to_parquet(ids_to_remove_path, index=False) | ||
| task = DocumentBatch( | ||
| task_id="task", | ||
| dataset_name="dataset", | ||
| data=pd.DataFrame({CURATOR_DEDUP_ID_STR: [1, 2], "text": ["drop", "keep"]}), | ||
| ) | ||
|
|
||
| stage = TextDuplicatesRemovalStage( | ||
| ids_to_remove_path=str(ids_to_remove_path), | ||
| id_field=CURATOR_DEDUP_ID_STR, | ||
| drop_id_field=True, | ||
| ) | ||
|
|
||
| result = stage.process(task).to_pandas() | ||
|
|
||
| assert result.to_dict(orient="list") == {"text": ["keep"]} | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I misread this test at first and was confused why we were checking that this text was kept. Can you add another assertion that explicitly checks that
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Added — the test now explicitly asserts that Pushed in nightcityblade/Curator@2dbe71d. |
||
| assert CURATOR_DEDUP_ID_STR not in result.columns | ||
|
|
||
|
|
||
| class TestTextDuplicatesRemovalWorkflowGenerateStages: | ||
| def test_drop_id_field_conflicts_with_output_fields(self): | ||
| with pytest.raises(ValueError, match="Cannot drop id_field"): | ||
| TextDuplicatesRemovalWorkflow( | ||
| input_path="input_path", | ||
| ids_to_remove_path="ids_to_remove_path", | ||
| output_path="output_path", | ||
| output_fields=["text", CURATOR_DEDUP_ID_STR], | ||
| drop_id_field=True, | ||
| ) | ||
|
|
||
| def test_invalid_filetypes(self): | ||
| read_invalid_file_type_workflow = TextDuplicatesRemovalWorkflow( | ||
| input_path="input_path", | ||
|
|
@@ -340,6 +372,7 @@ def test_reader_stage(self, input_filetype: str, id_generator_path: str | None): | |
| assert stages[2].id_field == CURATOR_DEDUP_ID_STR | ||
| assert stages[2].duplicate_id_field == "id" | ||
| assert stages[2].read_kwargs == {} | ||
| assert not stages[2].drop_id_field | ||
|
|
||
| # test for writer stage (stages[3]) - default output_filetype is parquet | ||
| assert isinstance(stages[3], ParquetWriter) | ||
|
|
@@ -352,13 +385,15 @@ def test_writer_stage(self, output_filetype: str): | |
| output_path="output_path", | ||
| output_filetype=output_filetype, | ||
| id_generator_path=None, | ||
| drop_id_field=True, | ||
| ) | ||
| stages = workflow._generate_stages(initial_tasks=None) | ||
| assert len(stages) == 4 | ||
| assert isinstance(stages[0], FilePartitioningStage) | ||
| # reader stage | ||
| assert isinstance(stages[1], ParquetReaderStage) # Default input_filetype is parquet | ||
| assert isinstance(stages[2], TextDuplicatesRemovalStage) | ||
| assert stages[2].drop_id_field | ||
| expected_write_stage = ParquetWriter if output_filetype == "parquet" else JsonlWriter | ||
| assert isinstance(stages[3], expected_write_stage) | ||
|
|
||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
drop_id_field+output_fieldsnot validatedWhen a caller sets
drop_id_field=Trueand also includesid_fieldinoutput_fields, the removal stage will have already dropped that column by the time the writer stage tries to select it, producing aKeyErrorat runtime. The semantic workflow avoids this with an explicitself.output_fields is Noneguard, but the baseTextDuplicatesRemovalWorkflowhas no equivalent check. Adding a__post_init__guard (after the id_generator warning) likeif self.drop_id_field and self.output_fields and self.id_field in self.output_fields: raise ValueError(...)would surface this misconfiguration early with a clear message.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 I think this is decent feedback. WDYT @nightcityblade ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call — I added an early
__post_init__validation that raises a clearValueErrorwhendrop_id_field=Trueconflicts withoutput_fieldscontaining the id field, plus a focused unit test for that configuration.Pushed: nightcityblade/Curator@2dbe71d
Validation:
uv run ruff check nemo_curator/stages/text/deduplication/removal_workflow.py tests/stages/text/deduplication/test_removal_workflow.pyandpython3 -m py_compile ...passed locally. Targeted pytest collection is blocked on this macOS host by Curator’s import-time Linux-only guard.