-
Notifications
You must be signed in to change notification settings - Fork 32
Fix sampling, add timeouts for test suprocess and data loaders #221
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
fast_llm/data/dataset/gpt/sampled.py
Outdated
if unshuffled_tokens := data.get("unshuffled_tokens") is not None: | ||
self._unshuffled_tokens = unshuffled_tokens | ||
else: | ||
self._unshuffled_tokens = data["unshuffled_epochs"] * data["dataset"]["tokens_per_epoch"] | ||
if "unshuffled_tokens" in data: | ||
self._unshuffled_tokens = data["unshuffled_tokens"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we care about backwards compatibility?
We could throw an error if unshuffled_tokens
is not present in loaded_yaml_data
. And speaking of backwards compatibility, I think we should also add truncate_documents: True
in loaded_yaml_data
if not present, else it will fail loading previous datasets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The if
was because of the method being called before "unshuffled_tokens"
is added to data, I reorganized things so it's not needed.
We don't really care about backward compatibility for dataset cache, but I added a simple one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for catching it, LGTM! Just one comment (above) about backwards compatibility
✨ Description
#186 Broke sampling because of an incorrect walrus which always set
unshuffled_tokens
to 1 (True
). I fixed it and simplified related code.This bug only showed in slow tests (@sohamparikh please make sure these pass before merging) which didn't terminate because of a data loader crash somehow not being caught. I added some timeouts to ensure this doesn't happen again.
🔍 Type of change
Select all that apply: