Fix sampling, add timeouts for test suprocess and data loaders #221

jlamypoirier · 2025-04-03T05:56:33Z

✨ Description

#186 Broke sampling because of an incorrect walrus which always set unshuffled_tokens to 1 (True). I fixed it and simplified related code.

This bug only showed in slow tests (@sohamparikh please make sure these pass before merging) which didn't terminate because of a data loader crash somehow not being caught. I added some timeouts to ensure this doesn't happen again.

🔍 Type of change

Select all that apply:

🐛 Bug fix (non-breaking change that addresses a specific issue)
🚀 New feature (non-breaking change that adds functionality)
⚠️ Breaking change (a change that could affect existing functionality)
📈 Performance improvement/optimization (improves speed, memory usage, or efficiency)
🛠️ Code refactor (non-functional changes that improve code readability, structure, etc.)
📦 Dependency bump (updates dependencies, including Dockerfile or package changes)
📝 Documentation change (updates documentation, including new content or typo fixes)
🔧 Infrastructure/Build change (affects build process, CI/CD, or dependencies)

sohamparikh · 2025-04-03T07:30:17Z

fast_llm/data/dataset/gpt/sampled.py

-        if unshuffled_tokens := data.get("unshuffled_tokens") is not None:
-            self._unshuffled_tokens = unshuffled_tokens
-        else:
-            self._unshuffled_tokens = data["unshuffled_epochs"] * data["dataset"]["tokens_per_epoch"]
+        if "unshuffled_tokens" in data:
+            self._unshuffled_tokens = data["unshuffled_tokens"]


should we care about backwards compatibility?

We could throw an error if unshuffled_tokens is not present in loaded_yaml_data. And speaking of backwards compatibility, I think we should also add truncate_documents: True in loaded_yaml_data if not present, else it will fail loading previous datasets.

The if was because of the method being called before "unshuffled_tokens" is added to data, I reorganized things so it's not needed.
We don't really care about backward compatibility for dataset cache, but I added a simple one.

sohamparikh

Thanks for catching it, LGTM! Just one comment (above) about backwards compatibility

jlamypoirier added 3 commits April 3, 2025 01:24

Fix sampling

e0eec53

misc

3fe02e1

fix

e347aa5

jlamypoirier marked this pull request as ready for review April 3, 2025 05:56

jlamypoirier requested review from tscholak and sohamparikh and removed request for tscholak April 3, 2025 05:56

fix

86818f3

sohamparikh reviewed Apr 3, 2025

View reviewed changes

sohamparikh approved these changes Apr 3, 2025

View reviewed changes

jlamypoirier added 4 commits April 3, 2025 17:57

misc

2a7e01d

fix

b10f6e3

fix

3c3879b

fix

fe8a916

jlamypoirier requested a review from sohamparikh April 3, 2025 22:22

jlamypoirier merged commit 8ccf58d into main Apr 3, 2025
4 checks passed

jlamypoirier deleted the fix_sampling branch April 3, 2025 22:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix sampling, add timeouts for test suprocess and data loaders #221

Fix sampling, add timeouts for test suprocess and data loaders #221

Uh oh!

jlamypoirier commented Apr 3, 2025

Uh oh!

sohamparikh Apr 3, 2025

Uh oh!

jlamypoirier Apr 3, 2025

Uh oh!

sohamparikh left a comment

Uh oh!

Uh oh!

Uh oh!

Fix sampling, add timeouts for test suprocess and data loaders #221

Fix sampling, add timeouts for test suprocess and data loaders #221

Uh oh!

Conversation

jlamypoirier commented Apr 3, 2025

✨ Description

🔍 Type of change

Uh oh!

sohamparikh Apr 3, 2025

Choose a reason for hiding this comment

Uh oh!

jlamypoirier Apr 3, 2025

Choose a reason for hiding this comment

Uh oh!

sohamparikh left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!