Skip to content

Add Windows Support and Fix Cross-Platform Compatibility Issues #140

Merged
snova-bol merged 9 commits intomainfrom
dev
Feb 4, 2026
Merged

Add Windows Support and Fix Cross-Platform Compatibility Issues #140
snova-bol merged 9 commits intomainfrom
dev

Conversation

@rajratanmore-debug
Copy link
Copy Markdown
Collaborator

  1. Missing encoding specifications

    • Problem: File operations without explicit encoding="utf-8" caused encoding errors on Windows
    • Solution: Added encoding="utf-8" to all file operations in:
      • __main__.py (categories file reading)
      • pipeline.py (metadata, category files, error logs)
      • utils.py (SHA256 file operations)
      • add_metadata_to_dataset.py (metadata file operations)
  2. Counter logic bug

    • Problem: Article counter undercounted processed articles due to incorrect batch counting logic
    • Solution: Fixed counter to use total_processed % 100 == 0 instead of (i != 0 and i % 100 == 0), ensuring all articles are counted accurately
  3. Inaccurate total article count

    • Problem: estimate_total_num_articles() only counted the first file, leading to inaccurate progress tracking
    • Solution: Added count_exact_total_num_articles() that counts all non-empty lines across all files for exact totals
  4. Unicode encoding errors in logging

    • Replaced Unicode characters with ASCII-safe characters
  5. Missing data utilization validation

    • Problem: No validation or reporting to confirm all input data was processed
    • Solution: Added validation that:
      • Compares processed articles vs. expected total
      • Accounts for skipped articles (format errors)
      • Shows clear success/error messages
      • Reports data utilization percentage
  6. Progress bar accuracy: Switched to manual mode with fraction-based updates to prevent percentages exceeding 100%

  7. Cross-platform compatibility: Added Windows multiprocessing support, cross-platform file splitting (split_file_round_robin), and automatic worker reduction on Windows.

  8. Encoding fixes: Added encoding="utf-8" to all file operations to prevent encoding errors on Windows and with non-ASCII data.

  9. Data utilization validation: Added exact article counting (count_exact_total_num_articles) and validation reporting to ensure 100% data utilization with clear success/warning messages.

Changes Made

  • data_prep/pipeline.py:

    • Added count_exact_total_num_articles() for accurate counting
    • Switched progress bar to manual mode with fraction-based updates
    • Added data utilization validation and reporting
    • Fixed all file operations to use UTF-8 encoding
    • Added progress bar update tracker to prevent exceeding 100%
  • data_prep/data_prep.py:

    • Fixed counter logic to accurately count all processed articles
    • Added proper handling for empty files
    • Improved remainder calculation for batch updates
  • __main__.py, utils.py, add_metadata_to_dataset.py:

    • Added encoding="utf-8" to all file operations

davidzyx
davidzyx previously approved these changes Dec 29, 2025
snova-bol
snova-bol previously approved these changes Dec 29, 2025
regex>=2025.2.10
requests>=2.32.4
torch>=2.6.0
transformers>=4.53.0
urllib3>=2.5.0
 for the pre commit condition
snova-mengz
snova-mengz previously approved these changes Jan 1, 2026
snova-bol
snova-bol previously approved these changes Jan 14, 2026
@snova-bol snova-bol merged commit 2397428 into main Feb 4, 2026
3 of 4 checks passed
@snova-bol snova-bol deleted the dev branch February 4, 2026 19:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants