Conversation
davidzyx
previously approved these changes
Dec 29, 2025
snova-bol
previously approved these changes
Dec 29, 2025
regex>=2025.2.10 requests>=2.32.4 torch>=2.6.0 transformers>=4.53.0 urllib3>=2.5.0 for the pre commit condition
acdf1c7
snova-mengz
previously approved these changes
Jan 1, 2026
snova-bol
previously approved these changes
Jan 14, 2026
1068b7a
snova-bol
approved these changes
Feb 3, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Missing encoding specifications
encoding="utf-8"caused encoding errors on Windowsencoding="utf-8"to all file operations in:__main__.py(categories file reading)pipeline.py(metadata, category files, error logs)utils.py(SHA256 file operations)add_metadata_to_dataset.py(metadata file operations)Counter logic bug
total_processed % 100 == 0instead of(i != 0 and i % 100 == 0), ensuring all articles are counted accuratelyInaccurate total article count
estimate_total_num_articles()only counted the first file, leading to inaccurate progress trackingcount_exact_total_num_articles()that counts all non-empty lines across all files for exact totalsUnicode encoding errors in logging
Missing data utilization validation
Progress bar accuracy: Switched to manual mode with fraction-based updates to prevent percentages exceeding 100%
Cross-platform compatibility: Added Windows multiprocessing support, cross-platform file splitting (split_file_round_robin), and automatic worker reduction on Windows.
Encoding fixes: Added encoding="utf-8" to all file operations to prevent encoding errors on Windows and with non-ASCII data.
Data utilization validation: Added exact article counting (count_exact_total_num_articles) and validation reporting to ensure 100% data utilization with clear success/warning messages.
Changes Made
data_prep/pipeline.py:count_exact_total_num_articles()for accurate countingdata_prep/data_prep.py:__main__.py,utils.py,add_metadata_to_dataset.py:encoding="utf-8"to all file operations