Sampling parameters, generalize batch config. #230

jlamypoirier · 2025-04-15T20:32:27Z

✨ Description

Deal with some structural issues and technical debt causing trouble for ongoing work.

Make a SamplingParameter structure for holding the data sampling parameters that come from the trainer (batch config, model, etc.). This is needed as an alternative to the fast-growing argument list to GPTData.__init__ and the associated bloat. (ex. optionally prevent cross-document attention #177, improvements to MTP implementation #218, DPO #223)
Generalize BatchConfig, extract the model-specific parameters into GPTBatchConfig.
Rename num_micro_sequences -> micro_batch_splits. Since the generic batch config and schedule runner shouldn't have to know about model-specific sequences.
Move use_loss_masking_spans to the batch config. This will make it easier to know if loss masking is enabled, ex. to prevent it in Knowledge distillation, fix and improve cross-entropy #229). @sohamparikh this may require some config changes. I added backward compatibility, but it will only work if set globally (data.sampling.use_loss_masking_spans)
Trying out cached_property in a few places following the discussion in Make the specified config parameters update the pretrained config #211. If it work we can use it all over the place to simplify derived fields.

🔍 Type of change

Select all that apply:

🐛 Bug fix (non-breaking change that addresses a specific issue)
🚀 New feature (non-breaking change that adds functionality)
⚠️ Breaking change (a change that could affect existing functionality)
📈 Performance improvement/optimization (improves speed, memory usage, or efficiency)
🛠️ Code refactor (non-functional changes that improve code readability, structure, etc.)
📦 Dependency bump (updates dependencies, including Dockerfile or package changes)
📝 Documentation change (updates documentation, including new content or typo fixes)
🔧 Infrastructure/Build change (affects build process, CI/CD, or dependencies)

tscholak · 2025-04-16T01:44:36Z

I think the most interesting and valuable change here is actually that with this we can now set these so-called sampling parameters differently for different datasets.

tscholak

LGTM

jlamypoirier added 3 commits April 15, 2025 13:36

Sampling parameters

ac2473e

Generalize batch config

ff56e62

Fixes, bw compatibility

84a0336

jlamypoirier marked this pull request as ready for review April 15, 2025 20:32

jlamypoirier requested review from RaymondLi0 and bigximik April 15, 2025 20:32

jlamypoirier mentioned this pull request Apr 15, 2025

Knowledge distillation, fix and improve cross-entropy #229

Merged

12 tasks

jlamypoirier requested a review from tscholak April 16, 2025 14:45

tscholak approved these changes Apr 16, 2025

View reviewed changes

jlamypoirier merged commit 01b71c9 into main Apr 16, 2025
4 checks passed

jlamypoirier deleted the sampling_parameters branch April 16, 2025 15:52

This was referenced Apr 16, 2025

improvements to MTP implementation #218

Merged

DPO #223

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sampling parameters, generalize batch config. #230

Sampling parameters, generalize batch config. #230

jlamypoirier commented Apr 15, 2025

tscholak commented Apr 16, 2025

tscholak left a comment

Sampling parameters, generalize batch config. #230

Sampling parameters, generalize batch config. #230

Conversation

jlamypoirier commented Apr 15, 2025

✨ Description

🔍 Type of change

tscholak commented Apr 16, 2025

tscholak left a comment

Choose a reason for hiding this comment