-
Notifications
You must be signed in to change notification settings - Fork 38
Synchronize end of loader for repeat=False #146
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Changes from 5 commits
c12a188
476c739
a0402fc
7ec2298
64b7f6b
6a8d94d
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -25,20 +25,23 @@ splits: | |
| split_part: val | ||
| ``` | ||
|
|
||
| Now, the call to `get_train_dataset` requires the additional parameter `repeat=False` to interrupt iterating after one epoch: | ||
| Now, the call to `get_train_dataset` requires the additional parameter `repeat=False` to interrupt iterating after one epoch and use the `RedistributeLoader` to synchronize the last samples when some ranks have already exhausted their data: | ||
|
|
||
| ```py | ||
| from megatron.energon import get_train_dataset, get_loader, WorkerConfig | ||
| from megatron.energon.sync_end import RedistributeLoader | ||
|
|
||
|
|
||
| loader = get_loader(get_train_dataset( | ||
| # Set repeat=False to avoid repeating the dataset. | ||
| # Also add RedistributeLoader to synchronize the end of rank exhaustion. Only works with initialized torch distributed. | ||
| loader = RedistributeLoader(get_loader(get_train_dataset( | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think it would be better if this was transparent to the user, so make I.e. blend_epochized can either be a list as before (chooses default
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Okay, we can make this be handled in Regarding moving the configuration to the metadataset:
Thus voting for keeping this in code, not in the metadataset config. |
||
| 'metadataset.yaml', | ||
| batch_size=2, | ||
| shuffle_buffer_size=100, | ||
| max_samples_per_sequence=100, | ||
| worker_config=WorkerConfig.default_worker_config(), | ||
| repeat=False, | ||
| )) | ||
| ))) | ||
|
|
||
| # This will now stop iterating after the datasets have been iterated (coco 5 times, coyo-train 2 | ||
| # times and coyo-val 1 times). Of course, the data is still being shuffled between all those | ||
|
|
@@ -54,3 +57,6 @@ for batch in loader: | |
|
|
||
| If used as dataset for `get_val_dataset`, the `repetitions` are ignored. | ||
| The metadataset would also work without setting `repeat=False`, but then the shuffle buffer will shuffle samples across bounderies of epochs. | ||
| There are two available end of iteration synchronizers: | ||
voegtlel marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| * `RedistributeLoader`: Redistributes samples when a rank is exhausted before other ranks. On next epoch, includes the incomplete batches' samples. | ||
| * `StopFirstLoader`: Stop as soon as the first rank is exhausted. The next epoch will iterate until the next loader stops, restarting all ranks once. | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,21 @@ | ||
| # Copyright (c) 2025, NVIDIA CORPORATION. | ||
| # SPDX-License-Identifier: BSD-3-Clause | ||
| from megatron.energon.sync_end.redistribute import RedistributeDataLoaderState, RedistributeLoader | ||
| from megatron.energon.sync_end.stop_first_end import StopFirstDataLoaderState, StopFirstLoader | ||
|
|
||
| """ | ||
| Provides wrappers for the dataset loaders that allow for synchronization at the end of the dataset. | ||
| I.e. if running a training with repeat=False, the loaders will typically exhaust at different times, which may require | ||
| synchronization across ranks. | ||
|
|
||
| The wrappers are: | ||
| - RedistributeLoader: Redistributes the last samples to the ranks that are not exhausted. | ||
| - StopFirstLoader: Stops iterating as soon as the first rank is exhausted. | ||
| """ | ||
|
|
||
| __all__ = [ | ||
| "RedistributeLoader", | ||
| "RedistributeDataLoaderState", | ||
| "StopFirstLoader", | ||
| "StopFirstDataLoaderState", | ||
| ] |
Uh oh!
There was an error while loading. Please reload this page.