[codex] docs: clarify deduplication input discovery by lbliii · Pull Request #2130 · NVIDIA-NeMo/Curator

lbliii · 2026-06-29T19:51:17Z

What changed

document the Parquet and JSONL extension defaults shared by exact, fuzzy, semantic, and duplicate-removal workflows
explain None, empty-list, and explicit extension override behavior
document recursive discovery for a directory string and top-level discovery for listed directories
add runnable JSONL-default and custom Parquet-suffix examples
require matching discovery and partition settings between identification and removal when IDs are generated
correct obsolete duplicate-removal argument names in the touched examples

Why

PR #2045 changed deduplication workflows to derive input_file_extensions from input_filetype, but the published Fern pages did not describe the defaults or override semantics. Users could also copy removal examples that referenced constructor arguments no longer present in TextDuplicatesRemovalWorkflow.

User impact

Users can now predict which files each workflow discovers, safely use custom suffixes such as .pq, and preserve generated-ID consistency between identification and removal.

Validation

npm run check from fern/: 0 errors
fern docs broken-links: no errors in changed pages; 22 existing errors remain in older API-reference pages
git diff --check

Closes #2127
Parent tracking issue: #2118

Signed-off-by: Lawrence Lane <llane@nvidia.com>

copy-pr-bot · 2026-06-29T19:51:20Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

greptile-apps · 2026-07-02T15:00:21Z

Greptile Summary

This PR adds documentation for the input_file_extensions parameter and file-discovery semantics shared by the exact, fuzzy, semantic, and duplicate-removal deduplication workflows, and fixes obsolete constructor argument names (input_id_field→id_field, ids_to_remove_duplicate_id_field→duplicate_id_field) in all touched examples.

Adds a new "Input File Discovery" section to index.mdx covering extension defaults, None/empty-list/override behavior, and single-directory-recursive vs. listed-directory-top-level discovery rules, verified against FilePartitioningStage and _gather_file_records.
Corrects TextDuplicatesRemovalWorkflow argument names across all five files to match the current dataclass signature in removal_workflow.py.
Extends the semdedup.mdx parameter table with input_path, input_filetype, and input_file_extensions rows and adds input_filetype="jsonl" to the runnable example.

Confidence Score: 5/5

Documentation-only change with no runtime impact; all documented behaviors match the current source code.

Every argument name correction and behavioral claim in the updated docs was verified against the Python source (removal_workflow.py, file_partitioning.py, file_utils.py, and the exact/fuzzy/semantic workflow modules). No code changes are included.

No files require special attention.

Important Files Changed

Filename	Overview
fern/versions/main/pages/curate-text/process-data/deduplication/index.mdx	Adds a new "Input File Discovery" section documenting extension defaults, override semantics, and recursion rules; fixes obsolete argument names in the removal workflow example.
fern/versions/main/pages/curate-text/process-data/deduplication/exact.mdx	Corrects `input_id_field`→`id_field` and `ids_to_remove_duplicate_id_field`→`duplicate_id_field` in both removal examples; adds `input_file_extensions` row to config table and a link to the discovery section.
fern/versions/main/pages/curate-text/process-data/deduplication/fuzzy.mdx	Same argument-name corrections as exact.mdx plus adds `input_file_extensions` row and discovery cross-link to the fuzzy dedup config table.
fern/versions/main/pages/curate-text/process-data/deduplication/semdedup.mdx	Adds `input_path`, `input_filetype`, and `input_file_extensions` rows to the Key Parameters table and appends `input_filetype="jsonl"` to the runnable example.
fern/versions/main/pages/about/concepts/text/data-processing-concepts.mdx	Fixes the obsolete `ids_to_remove_duplicate_id_field` to the correct `duplicate_id_field` argument in the fuzzy-removal workflow example.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[input_path type?] -->|Single string directory| B[recurse_subdirectories=True]
    A -->|List of paths| C[recurse_subdirectories=False per path]
    B --> D{input_file_extensions?}
    C --> D
    D -->|None or empty list| E["Use input_filetype defaults<br/>parquet: [.parquet]<br/>jsonl: [.jsonl, .json]"]
    D -->|Non-empty list| F["Override with supplied extensions<br/>Leading dot optional, case-insensitive"]
    E --> G[FilePartitioningStage filters files]
    F --> G
    G --> H["Reader selected by input_filetype<br/>(e.g. .pq files read as Parquet)"]

%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
    A[input_path type?] -->|Single string directory| B[recurse_subdirectories=True]
    A -->|List of paths| C[recurse_subdirectories=False per path]
    B --> D{input_file_extensions?}
    C --> D
    D -->|None or empty list| E["Use input_filetype defaults<br/>parquet: [.parquet]<br/>jsonl: [.jsonl, .json]"]
    D -->|Non-empty list| F["Override with supplied extensions<br/>Leading dot optional, case-insensitive"]
    E --> G[FilePartitioningStage filters files]
    F --> G
    G --> H["Reader selected by input_filetype<br/>(e.g. .pq files read as Parquet)"]

_{Reviews (3): Last reviewed commit: "docs: fix remaining dedup removal parame..." | Re-trigger Greptile}

Signed-off-by: Lawrence Lane <llane@nvidia.com>

docs: clarify deduplication input discovery

0158884

Signed-off-by: Lawrence Lane <llane@nvidia.com>

lbliii mentioned this pull request Jun 30, 2026

[codex] publish 26.06 release notes and migration checklist #2143

Open

lbliii marked this pull request as ready for review July 2, 2026 14:53

lbliii requested a review from a team as a code owner July 2, 2026 14:53

lbliii requested review from praateekmahajan and removed request for a team July 2, 2026 14:53

lbliii added 2 commits July 2, 2026 11:18

Merge branch 'main' into codex/docs-dedup-input-defaults

c301112

docs: fix remaining dedup removal parameter

be25573

Signed-off-by: Lawrence Lane <llane@nvidia.com>

lbliii added this to the 25.11 milestone Jul 2, 2026

lbliii mentioned this pull request Jul 2, 2026

[Docs] Integrate 26.06 documentation PRs and stage the version train #2160

Open

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[codex] docs: clarify deduplication input discovery#2130

[codex] docs: clarify deduplication input discovery#2130
lbliii wants to merge 3 commits into
NVIDIA-NeMo:mainfrom
lbliii:codex/docs-dedup-input-defaults

lbliii commented Jun 29, 2026

Uh oh!

copy-pr-bot Bot commented Jun 29, 2026

Uh oh!

greptile-apps Bot commented Jul 2, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

lbliii commented Jun 29, 2026

What changed

Why

User impact

Validation

Uh oh!

copy-pr-bot Bot commented Jun 29, 2026

Uh oh!

greptile-apps Bot commented Jul 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

greptile-apps Bot commented Jul 2, 2026 •

edited

Loading