changing params to go fast on H200s by daanelson · Pull Request #64 · replicate/flux-fine-tuner

daanelson · 2025-01-23T04:07:40Z

In order to speed up training (losslessly!) in default situations on Replicate, I'm making a few tradeoffs for increased speed at the cost of increased memory usage. specifically:

turning off gradient checkpointing
not quantizing the flux DiT

Important

Optimize training speed on H200s by disabling gradient checkpointing and model quantization under specific conditions in train.py.

Behavior:
- Disable gradient checkpointing by default in train() to increase training speed.
- Automatically enable gradient checkpointing if GPU memory < 100GB, batch size > 1, or resolution > 1024.
- Disable model quantization by default; enable if GPU memory < 100GB.
Parameters:
- Add gradient_checkpointing parameter to train() with default False.
- Set quantize to False by default in train().
Misc:
- Refactor resolution parsing in train() to use resolutions list.

^{This description was created by}^{for db81504. It will automatically update as commits are pushed.}

train.py

daanelson added 2 commits January 23, 2025 01:13

changing params to go fast on H200s

026b011

resolution guard

878886a

ellipsis-dev bot reviewed Jan 23, 2025

View reviewed changes

train.py Outdated Show resolved Hide resolved

train.py Outdated Show resolved Hide resolved

daanelson added 2 commits January 23, 2025 04:09

order of operations

e09ddd5

typo

c56413e

daanelson requested a review from a team January 23, 2025 04:48

daanelson added 2 commits January 23, 2025 04:52

lint

d59d187

more lint

db81504

daanelson merged commit 9e68dd2 into main Jan 23, 2025
2 checks passed

daanelson deleted the smart-checkpoint branch January 23, 2025 17:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

changing params to go fast on H200s#64

changing params to go fast on H200s#64
daanelson merged 6 commits intomainfrom
smart-checkpoint

daanelson commented Jan 23, 2025 •

edited by ellipsis-dev bot

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

daanelson commented Jan 23, 2025 • edited by ellipsis-dev bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

daanelson commented Jan 23, 2025 •

edited by ellipsis-dev bot

Loading