[codex] docs: clarify deduplication input discovery#2130
Conversation
Signed-off-by: Lawrence Lane <llane@nvidia.com>
Greptile SummaryThis PR adds documentation for the
Confidence Score: 5/5Documentation-only change with no runtime impact; all documented behaviors match the current source code. Every argument name correction and behavioral claim in the updated docs was verified against the Python source (removal_workflow.py, file_partitioning.py, file_utils.py, and the exact/fuzzy/semantic workflow modules). No code changes are included. No files require special attention. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[input_path type?] -->|Single string directory| B[recurse_subdirectories=True]
A -->|List of paths| C[recurse_subdirectories=False per path]
B --> D{input_file_extensions?}
C --> D
D -->|None or empty list| E["Use input_filetype defaults<br/>parquet: [.parquet]<br/>jsonl: [.jsonl, .json]"]
D -->|Non-empty list| F["Override with supplied extensions<br/>Leading dot optional, case-insensitive"]
E --> G[FilePartitioningStage filters files]
F --> G
G --> H["Reader selected by input_filetype<br/>(e.g. .pq files read as Parquet)"]
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
A[input_path type?] -->|Single string directory| B[recurse_subdirectories=True]
A -->|List of paths| C[recurse_subdirectories=False per path]
B --> D{input_file_extensions?}
C --> D
D -->|None or empty list| E["Use input_filetype defaults<br/>parquet: [.parquet]<br/>jsonl: [.jsonl, .json]"]
D -->|Non-empty list| F["Override with supplied extensions<br/>Leading dot optional, case-insensitive"]
E --> G[FilePartitioningStage filters files]
F --> G
G --> H["Reader selected by input_filetype<br/>(e.g. .pq files read as Parquet)"]
Reviews (3): Last reviewed commit: "docs: fix remaining dedup removal parame..." | Re-trigger Greptile |
Signed-off-by: Lawrence Lane <llane@nvidia.com>
What changed
None, empty-list, and explicit extension override behaviorWhy
PR #2045 changed deduplication workflows to derive
input_file_extensionsfrominput_filetype, but the published Fern pages did not describe the defaults or override semantics. Users could also copy removal examples that referenced constructor arguments no longer present inTextDuplicatesRemovalWorkflow.User impact
Users can now predict which files each workflow discovers, safely use custom suffixes such as
.pq, and preserve generated-ID consistency between identification and removal.Validation
npm run checkfromfern/: 0 errorsfern docs broken-links: no errors in changed pages; 22 existing errors remain in older API-reference pagesgit diff --checkCloses #2127
Parent tracking issue: #2118