Fast and extendable dataset sampling #110

jlamypoirier · 2025-01-09T00:16:05Z

✨ Description

Option to shuffle epochs independently, which means we can resume with more training samples (epochs) without messing up the ordering. (opt-in for backward compatibility, to be the default eventually if we like it)
Distributed dataset sampling/preparation. Split the task between the devices to make it a lot faster. Should basically make it num_gpus times faster for the current scheme (excluding blending), but we might lose the benefit if/once we replace blending of dataset shards with concatenation.
Trim sampling indices for the last epoch. This will reduce disk usage and speed up writing, especially when num_epochs <<1.
Skip build_sample_idx entirely. Instead, we use the cumsum of documents sizes to calculate the sample index on the fly. Actual impact on performance TDB, since I don't know how much of the time is spent on this.
TODO: skip pre-computation of blending indices. These are deterministic and not too hard to compute on the fly.

🔍 Type of change

Select all that apply:

🐛 Bug fix (non-breaking change that addresses a specific issue)
🚀 New feature (non-breaking change that adds functionality)
⚠️ Breaking change (a change that could affect existing functionality)
📈 Performance improvement/optimization (improves speed, memory usage, or efficiency)
🛠️ Code refactor (non-functional changes that improve code readability, structure, etc.)
📦 Dependency bump (updates dependencies, including Dockerfile or package changes)
📝 Documentation change (updates documentation, including new content or typo fixes)
🔧 Infrastructure/Build change (affects build process, CI/CD, or dependencies)

…provements

rge branch 'modular_dataset' into dataset_improvements

jlamypoirier added 4 commits January 6, 2025 16:06

Modular dataset configuration

147e33b

fixes

c41a2c5

fix

e013ba2

Dataset improvements

06eaaa9

jlamypoirier changed the base branch from main to modular_dataset January 9, 2025 00:16

jlamypoirier changed the title ~~Dataset improvements~~ Fast and extendable dataset sampling Jan 9, 2025

tscholak and others added 11 commits January 9, 2025 12:17

Merge branch 'main' into modular_dataset

6b45944

Merge branch 'main' into modular_dataset

952a03d

Merge remote-tracking branch 'origin/modular_dataset' into dataset_im…

6d05fff

…provements

fix

3992df7

Generalize indexed

82285ae

Merge branch 'modular_dataset' into dataset_improvements

3d3d119

fix

7011ca3

Modularize fim, decouple data from dataset, basic tests, misc

9574715

Make tests pass

5532b97

Remove split datasets

5d5e0ab

Make tests pass

baacc4e

jlamypoirier mentioned this pull request Jan 13, 2025

[meta] Fast-LLM Improvements Tracker 🌟 #100

Closed

jlamypoirier added 7 commits January 13, 2025 14:31

misc

09640d8

misc

a73acf6

Fix merge

bb1b87f

Type hints

148b448

Mg

4877b10

rge branch 'modular_dataset' into dataset_improvements

fix

c51a5d8

Merge branch 'main' into dataset_improvements

6763d3e

Base automatically changed from modular_dataset to main January 22, 2025 05:29

jlamypoirier changed the base branch from main to modular_dataset January 23, 2025 21:04

jlamypoirier added 2 commits January 27, 2025 19:47

Merge branch 'main' into dataset_improvements

8d89f06

Fix merge

3403034

jlamypoirier changed the base branch from modular_dataset to main January 28, 2025 00:50

jlamypoirier added 3 commits January 27, 2025 19:51

Fix merge

f6c4b56

Fix merge

158a6f7

Fix merge

a4e288b

jlamypoirier mentioned this pull request Jan 28, 2025

[feat] Speed up dataset sampling #132

Closed

jlamypoirier closed this Feb 7, 2025

jlamypoirier deleted the dataset_improvements branch February 7, 2025 04:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fast and extendable dataset sampling #110

Fast and extendable dataset sampling #110

Uh oh!

jlamypoirier commented Jan 9, 2025 •

edited

Loading

Uh oh!

Uh oh!

Fast and extendable dataset sampling #110

Fast and extendable dataset sampling #110

Uh oh!

Conversation

jlamypoirier commented Jan 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✨ Description

🔍 Type of change

Uh oh!

Uh oh!

jlamypoirier commented Jan 9, 2025 •

edited

Loading