Skip to content

Conversation

@jlamypoirier
Copy link
Collaborator

@jlamypoirier jlamypoirier commented Jan 9, 2025

✨ Description

  • Option to shuffle epochs independently, which means we can resume with more training samples (epochs) without messing up the ordering. (opt-in for backward compatibility, to be the default eventually if we like it)
  • Distributed dataset sampling/preparation. Split the task between the devices to make it a lot faster. Should basically make it num_gpus times faster for the current scheme (excluding blending), but we might lose the benefit if/once we replace blending of dataset shards with concatenation.
  • Trim sampling indices for the last epoch. This will reduce disk usage and speed up writing, especially when num_epochs <<1.
  • Skip build_sample_idx entirely. Instead, we use the cumsum of documents sizes to calculate the sample index on the fly. Actual impact on performance TDB, since I don't know how much of the time is spent on this.
  • TODO: skip pre-computation of blending indices. These are deterministic and not too hard to compute on the fly.

🔍 Type of change

Select all that apply:

  • 🐛 Bug fix (non-breaking change that addresses a specific issue)
  • 🚀 New feature (non-breaking change that adds functionality)
  • ⚠️ Breaking change (a change that could affect existing functionality)
  • 📈 Performance improvement/optimization (improves speed, memory usage, or efficiency)
  • 🛠️ Code refactor (non-functional changes that improve code readability, structure, etc.)
  • 📦 Dependency bump (updates dependencies, including Dockerfile or package changes)
  • 📝 Documentation change (updates documentation, including new content or typo fixes)
  • 🔧 Infrastructure/Build change (affects build process, CI/CD, or dependencies)

@jlamypoirier jlamypoirier changed the base branch from main to modular_dataset January 9, 2025 00:16
@jlamypoirier jlamypoirier changed the title Dataset improvements Fast and extendable dataset sampling Jan 9, 2025
Base automatically changed from modular_dataset to main January 22, 2025 05:29
@jlamypoirier jlamypoirier changed the base branch from main to modular_dataset January 23, 2025 21:04
@jlamypoirier jlamypoirier changed the base branch from modular_dataset to main January 28, 2025 00:50
@jlamypoirier jlamypoirier deleted the dataset_improvements branch February 7, 2025 04:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants