Skip to content

[codex] docs: clarify deduplication input discovery#2130

Open
lbliii wants to merge 3 commits into
NVIDIA-NeMo:mainfrom
lbliii:codex/docs-dedup-input-defaults
Open

[codex] docs: clarify deduplication input discovery#2130
lbliii wants to merge 3 commits into
NVIDIA-NeMo:mainfrom
lbliii:codex/docs-dedup-input-defaults

Conversation

@lbliii

@lbliii lbliii commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

What changed

  • document the Parquet and JSONL extension defaults shared by exact, fuzzy, semantic, and duplicate-removal workflows
  • explain None, empty-list, and explicit extension override behavior
  • document recursive discovery for a directory string and top-level discovery for listed directories
  • add runnable JSONL-default and custom Parquet-suffix examples
  • require matching discovery and partition settings between identification and removal when IDs are generated
  • correct obsolete duplicate-removal argument names in the touched examples

Why

PR #2045 changed deduplication workflows to derive input_file_extensions from input_filetype, but the published Fern pages did not describe the defaults or override semantics. Users could also copy removal examples that referenced constructor arguments no longer present in TextDuplicatesRemovalWorkflow.

User impact

Users can now predict which files each workflow discovers, safely use custom suffixes such as .pq, and preserve generated-ID consistency between identification and removal.

Validation

  • npm run check from fern/: 0 errors
  • fern docs broken-links: no errors in changed pages; 22 existing errors remain in older API-reference pages
  • git diff --check

Closes #2127
Parent tracking issue: #2118

Signed-off-by: Lawrence Lane <llane@nvidia.com>
@copy-pr-bot

copy-pr-bot Bot commented Jun 29, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@lbliii lbliii marked this pull request as ready for review July 2, 2026 14:53
@lbliii lbliii requested a review from a team as a code owner July 2, 2026 14:53
@lbliii lbliii requested review from praateekmahajan and removed request for a team July 2, 2026 14:53
@greptile-apps

greptile-apps Bot commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR adds documentation for the input_file_extensions parameter and file-discovery semantics shared by the exact, fuzzy, semantic, and duplicate-removal deduplication workflows, and fixes obsolete constructor argument names (input_id_fieldid_field, ids_to_remove_duplicate_id_fieldduplicate_id_field) in all touched examples.

  • Adds a new "Input File Discovery" section to index.mdx covering extension defaults, None/empty-list/override behavior, and single-directory-recursive vs. listed-directory-top-level discovery rules, verified against FilePartitioningStage and _gather_file_records.
  • Corrects TextDuplicatesRemovalWorkflow argument names across all five files to match the current dataclass signature in removal_workflow.py.
  • Extends the semdedup.mdx parameter table with input_path, input_filetype, and input_file_extensions rows and adds input_filetype="jsonl" to the runnable example.

Confidence Score: 5/5

Documentation-only change with no runtime impact; all documented behaviors match the current source code.

Every argument name correction and behavioral claim in the updated docs was verified against the Python source (removal_workflow.py, file_partitioning.py, file_utils.py, and the exact/fuzzy/semantic workflow modules). No code changes are included.

No files require special attention.

Important Files Changed

Filename Overview
fern/versions/main/pages/curate-text/process-data/deduplication/index.mdx Adds a new "Input File Discovery" section documenting extension defaults, override semantics, and recursion rules; fixes obsolete argument names in the removal workflow example.
fern/versions/main/pages/curate-text/process-data/deduplication/exact.mdx Corrects input_id_fieldid_field and ids_to_remove_duplicate_id_fieldduplicate_id_field in both removal examples; adds input_file_extensions row to config table and a link to the discovery section.
fern/versions/main/pages/curate-text/process-data/deduplication/fuzzy.mdx Same argument-name corrections as exact.mdx plus adds input_file_extensions row and discovery cross-link to the fuzzy dedup config table.
fern/versions/main/pages/curate-text/process-data/deduplication/semdedup.mdx Adds input_path, input_filetype, and input_file_extensions rows to the Key Parameters table and appends input_filetype="jsonl" to the runnable example.
fern/versions/main/pages/about/concepts/text/data-processing-concepts.mdx Fixes the obsolete ids_to_remove_duplicate_id_field to the correct duplicate_id_field argument in the fuzzy-removal workflow example.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[input_path type?] -->|Single string directory| B[recurse_subdirectories=True]
    A -->|List of paths| C[recurse_subdirectories=False per path]
    B --> D{input_file_extensions?}
    C --> D
    D -->|None or empty list| E["Use input_filetype defaults<br/>parquet: [.parquet]<br/>jsonl: [.jsonl, .json]"]
    D -->|Non-empty list| F["Override with supplied extensions<br/>Leading dot optional, case-insensitive"]
    E --> G[FilePartitioningStage filters files]
    F --> G
    G --> H["Reader selected by input_filetype<br/>(e.g. .pq files read as Parquet)"]
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
    A[input_path type?] -->|Single string directory| B[recurse_subdirectories=True]
    A -->|List of paths| C[recurse_subdirectories=False per path]
    B --> D{input_file_extensions?}
    C --> D
    D -->|None or empty list| E["Use input_filetype defaults<br/>parquet: [.parquet]<br/>jsonl: [.jsonl, .json]"]
    D -->|Non-empty list| F["Override with supplied extensions<br/>Leading dot optional, case-insensitive"]
    E --> G[FilePartitioningStage filters files]
    F --> G
    G --> H["Reader selected by input_filetype<br/>(e.g. .pq files read as Parquet)"]
Loading

Reviews (3): Last reviewed commit: "docs: fix remaining dedup removal parame..." | Re-trigger Greptile

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Docs] Document filetype-aware input-extension defaults in deduplication workflows

1 participant