Skip to content

Fix sampling, add timeouts for test suprocess and data loaders #221

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Apr 3, 2025

Conversation

jlamypoirier
Copy link
Collaborator

✨ Description

#186 Broke sampling because of an incorrect walrus which always set unshuffled_tokens to 1 (True). I fixed it and simplified related code.

This bug only showed in slow tests (@sohamparikh please make sure these pass before merging) which didn't terminate because of a data loader crash somehow not being caught. I added some timeouts to ensure this doesn't happen again.

🔍 Type of change

Select all that apply:

  • 🐛 Bug fix (non-breaking change that addresses a specific issue)
  • 🚀 New feature (non-breaking change that adds functionality)
  • ⚠️ Breaking change (a change that could affect existing functionality)
  • 📈 Performance improvement/optimization (improves speed, memory usage, or efficiency)
  • 🛠️ Code refactor (non-functional changes that improve code readability, structure, etc.)
  • 📦 Dependency bump (updates dependencies, including Dockerfile or package changes)
  • 📝 Documentation change (updates documentation, including new content or typo fixes)
  • 🔧 Infrastructure/Build change (affects build process, CI/CD, or dependencies)

@jlamypoirier jlamypoirier marked this pull request as ready for review April 3, 2025 05:56
@jlamypoirier jlamypoirier requested review from tscholak and sohamparikh and removed request for tscholak April 3, 2025 05:56
if unshuffled_tokens := data.get("unshuffled_tokens") is not None:
self._unshuffled_tokens = unshuffled_tokens
else:
self._unshuffled_tokens = data["unshuffled_epochs"] * data["dataset"]["tokens_per_epoch"]
if "unshuffled_tokens" in data:
self._unshuffled_tokens = data["unshuffled_tokens"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we care about backwards compatibility?

We could throw an error if unshuffled_tokens is not present in loaded_yaml_data. And speaking of backwards compatibility, I think we should also add truncate_documents: True in loaded_yaml_data if not present, else it will fail loading previous datasets.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The if was because of the method being called before "unshuffled_tokens" is added to data, I reorganized things so it's not needed.
We don't really care about backward compatibility for dataset cache, but I added a simple one.

Copy link
Member

@sohamparikh sohamparikh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching it, LGTM! Just one comment (above) about backwards compatibility

@jlamypoirier jlamypoirier requested a review from sohamparikh April 3, 2025 22:22
@jlamypoirier jlamypoirier merged commit 8ccf58d into main Apr 3, 2025
4 checks passed
@jlamypoirier jlamypoirier deleted the fix_sampling branch April 3, 2025 22:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants