Add Windows Support and Fix Cross-Platform Compatibility Issues by rajratanmore-debug · Pull Request #140 · sambanova/generative_data_prep

rajratanmore-debug · 2025-12-23T12:26:58Z

Missing encoding specifications
- Problem: File operations without explicit encoding="utf-8" caused encoding errors on Windows
- Solution: Added encoding="utf-8" to all file operations in:
  - __main__.py (categories file reading)
  - pipeline.py (metadata, category files, error logs)
  - utils.py (SHA256 file operations)
  - add_metadata_to_dataset.py (metadata file operations)
Counter logic bug
- Problem: Article counter undercounted processed articles due to incorrect batch counting logic
- Solution: Fixed counter to use total_processed % 100 == 0 instead of (i != 0 and i % 100 == 0), ensuring all articles are counted accurately
Inaccurate total article count
- Problem: estimate_total_num_articles() only counted the first file, leading to inaccurate progress tracking
- Solution: Added count_exact_total_num_articles() that counts all non-empty lines across all files for exact totals
Unicode encoding errors in logging
- Replaced Unicode characters with ASCII-safe characters
Missing data utilization validation
- Problem: No validation or reporting to confirm all input data was processed
- Solution: Added validation that:
  - Compares processed articles vs. expected total
  - Accounts for skipped articles (format errors)
  - Shows clear success/error messages
  - Reports data utilization percentage
Progress bar accuracy: Switched to manual mode with fraction-based updates to prevent percentages exceeding 100%
Cross-platform compatibility: Added Windows multiprocessing support, cross-platform file splitting (split_file_round_robin), and automatic worker reduction on Windows.
Encoding fixes: Added encoding="utf-8" to all file operations to prevent encoding errors on Windows and with non-ASCII data.
Data utilization validation: Added exact article counting (count_exact_total_num_articles) and validation reporting to ensure 100% data utilization with clear success/warning messages.

Changes Made

data_prep/pipeline.py:
- Added count_exact_total_num_articles() for accurate counting
- Switched progress bar to manual mode with fraction-based updates
- Added data utilization validation and reporting
- Fixed all file operations to use UTF-8 encoding
- Added progress bar update tracker to prevent exceeding 100%
data_prep/data_prep.py:
- Fixed counter logic to accurately count all processed articles
- Added proper handling for empty files
- Improved remainder calculation for batch updates
__main__.py, utils.py, add_metadata_to_dataset.py:
- Added encoding="utf-8" to all file operations

regex>=2025.2.10 requests>=2.32.4 torch>=2.6.0 transformers>=4.53.0 urllib3>=2.5.0 for the pre commit condition

rajratanmore-debug and others added 2 commits December 23, 2025 11:52

windows compatible file

437017b

updated the progress bar

2ecad93

rajratanmore-debug requested a review from a team as a code owner December 23, 2025 12:26

rajratanmore-debug requested a review from snova-bol December 23, 2025 12:26

Rajrantan More added 3 commits December 23, 2025 18:21

fixed pre commit issues

5bc74d9

flake8 errors are fixed.

4a2ade3

pre-commit issues are fixed

f97e77b

rajratanmore-debug requested a review from davidzyx December 23, 2025 15:31

davidzyx previously approved these changes Dec 29, 2025

View reviewed changes

snova-bol previously approved these changes Dec 29, 2025

View reviewed changes

changed the versions of the jinja2>=3.1.6

acdf1c7

regex>=2025.2.10 requests>=2.32.4 torch>=2.6.0 transformers>=4.53.0 urllib3>=2.5.0 for the pre commit condition

rajratanmore-debug dismissed stale reviews from snova-bol and davidzyx via acdf1c7 December 29, 2025 10:50

rajratanmore-debug requested a review from a team as a code owner December 29, 2025 10:50

rajratanmore-debug requested a review from snova-mengz December 29, 2025 10:50

rajratanmore-debug added 2 commits December 29, 2025 14:20

precommit

c049314

removed changes made for pre commit

09971b2

snova-mengz previously approved these changes Jan 1, 2026

View reviewed changes

snova-bol previously approved these changes Jan 14, 2026

View reviewed changes

Edited README.md file

1068b7a

rajratanmore-debug dismissed stale reviews from snova-bol and snova-mengz via 1068b7a January 15, 2026 06:38

snova-bol approved these changes Feb 3, 2026

View reviewed changes

snova-bol merged commit 2397428 into main Feb 4, 2026
3 of 4 checks passed

snova-bol deleted the dev branch February 4, 2026 19:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Windows Support and Fix Cross-Platform Compatibility Issues #140

Add Windows Support and Fix Cross-Platform Compatibility Issues #140
snova-bol merged 9 commits intomainfrom
dev

rajratanmore-debug commented Dec 23, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

rajratanmore-debug commented Dec 23, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants