[docs] Memory optims #11385

stevhliu · 2025-04-22T18:05:07Z

Refactors the memory optimization docs and combines it with working with big models (distributed setups).

Let me know if I'm missing anything!

HuggingFaceDocBuilderDev · 2025-04-22T18:12:10Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Heasterian · 2025-04-24T09:41:23Z

AutoencoderKLWan and AsymmetricAutoencoderKL does not support tiling or slicing (asymmetric have just unused flags), it most likely should be mentioned.

sayakpaul

Thanks for the initiative! I left some minor comments, let me know if they make sense.

docs/source/en/optimization/memory.md

stevhliu · 2025-04-29T19:52:49Z

Thanks for the feedback on the memory doc! I also updated the inference speed doc, so please feel free to check it out and leave some feedback ❤️

sayakpaul

Thank you for the further updates. Left some further comments. LMK if they make sense.

docs/source/en/optimization/fp16.md

sayakpaul · 2025-04-30T11:23:04Z

docs/source/en/optimization/fp16.md

+prompt = "slice of delicious New York-style cheesecake topped with berries, mint, chocolate crumble"
+pipeline(prompt, num_inference_steps=50).images[0]


I think we should also have an advanced guide on accelerating inference that make use of other techniques like SAGE, multi-GPU inference (context-parallel), etc. @a-r-r-o-w WDYT?

Happy to add in a separate PR!

SGTM. Not sure how we plan to show context parallel + Sage because the implementation is based on PR in finetrainers. Do we want to showcase that in the docs?

Or do you want me to take just the relevant parts and create a minimal attention processor to showcase the example? I can send over a rough writeup to Steven for whichever we choose to do

Note that the latter will only yield a small amount of speedup compared to using the current finetrainers implementation (because it shards the sequence dim early on instead of just for the attention, which is not possible to showcase in only-diffusers example because it either involves rewriting the forward method OR adds the overhead of explaining model hooks to reader and making that part of the example)

Hmm got it. Sorry for being unclear.

If it's not too much would it be possible to send over a write-up covering both cases? And we can then further brainstorm how to best disseminate the information to the readers? But really no rush here. Do it whenever you get time.

docs/source/en/optimization/memory.md

sayakpaul

Looks crisp!

sayakpaul · 2025-05-01T05:35:58Z

docs/source/en/optimization/fp16.md

-</div>
+## Distilled models
+
+Another option for accelerating inference is to use a smaller distilled model if it's available. During distillation, many of the UNet's residual and attention blocks are discarded to reduce model size and improve latency. A distilled model is faster and uses less memory without compromising quality compared to a full-sized model.


@stevhliu I thought we were removing these sections?

My bad, thought I already had! Should be gone now :)

docs/source/en/optimization/memory.md

a-r-r-o-w · 2025-05-01T06:36:14Z

docs/source/en/optimization/memory.md

+
+The `offload_type` parameter can be set to `block_level` or `leaf_level`.
+
+- `block_level` offloads groups of layers based on the `num_blocks_per_group` parameter. For example, if `num_blocks_per_group=2` on a model with 40 layers, 2 layers are onloaded and offloaded at a time (2o total onloads/offloads). This drastically reduces memory requirements.


Suggested change

- `block_level` offloads groups of layers based on the `num_blocks_per_group` parameter. For example, if `num_blocks_per_group=2` on a model with 40 layers, 2 layers are onloaded and offloaded at a time (2o total onloads/offloads). This drastically reduces memory requirements.

- `block_level` offloads groups of layers based on the `num_blocks_per_group` parameter. For example, if `num_blocks_per_group=2` on a model with 40 layers, 2 layers are onloaded and offloaded at a time (20 total onloads/offloads). This drastically reduces memory requirements.

Lol good eyes! 👀

docs/source/en/optimization/memory.md

a-r-r-o-w

Thanks, my earlier comments have been addressed so LGTM apart from remaining Sayak's comments!

stevhliu requested a review from sayakpaul April 22, 2025 23:16

stevhliu mentioned this pull request Apr 23, 2025

[docs] Model cards #11112

Merged

stevhliu marked this pull request as ready for review April 23, 2025 21:13

sayakpaul reviewed Apr 25, 2025

View reviewed changes

a-r-r-o-w reviewed Apr 28, 2025

View reviewed changes

docs/source/en/optimization/memory.md Outdated Show resolved Hide resolved

docs/source/en/optimization/memory.md Outdated Show resolved Hide resolved

docs/source/en/optimization/memory.md Outdated Show resolved Hide resolved

docs/source/en/optimization/memory.md Show resolved Hide resolved

sayakpaul reviewed Apr 30, 2025

View reviewed changes

stevhliu force-pushed the memory-optims branch from 322c8a4 to 7594fe0 Compare April 30, 2025 19:34

sayakpaul approved these changes May 1, 2025

View reviewed changes

sayakpaul requested a review from a-r-r-o-w May 1, 2025 05:42

a-r-r-o-w reviewed May 1, 2025

View reviewed changes

docs/source/en/optimization/memory.md Outdated Show resolved Hide resolved

a-r-r-o-w approved these changes May 1, 2025

View reviewed changes

stevhliu added 8 commits May 1, 2025 11:07

reformat

8eb6230

initial

e7e9a24

fin

68c5d3c

review

425a725

inference

c7f02c2

feedback

8a8b4ce

feedback

f8f45ba

feedback

118b2c3

stevhliu force-pushed the memory-optims branch from 4179aa3 to 118b2c3 Compare May 1, 2025 18:07

stevhliu merged commit b848d47 into huggingface:main May 1, 2025
1 check passed

stevhliu deleted the memory-optims branch May 1, 2025 18:22

		prompt = "slice of delicious New York-style cheesecake topped with berries, mint, chocolate crumble"
		pipeline(prompt, num_inference_steps=50).images[0]


		The `offload_type` parameter can be set to `block_level` or `leaf_level`.

		- `block_level` offloads groups of layers based on the `num_blocks_per_group` parameter. For example, if `num_blocks_per_group=2` on a model with 40 layers, 2 layers are onloaded and offloaded at a time (2o total onloads/offloads). This drastically reduces memory requirements.

[docs] Memory optims #11385

[docs] Memory optims #11385

Uh oh!

Conversation

stevhliu commented Apr 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Apr 22, 2025

Uh oh!

Heasterian commented Apr 24, 2025

Uh oh!

sayakpaul left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

stevhliu commented Apr 29, 2025

Uh oh!

sayakpaul left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sayakpaul Apr 30, 2025

Choose a reason for hiding this comment

Uh oh!

stevhliu Apr 30, 2025

Choose a reason for hiding this comment

Uh oh!

a-r-r-o-w May 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sayakpaul May 1, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sayakpaul left a comment

Choose a reason for hiding this comment

Uh oh!

sayakpaul May 1, 2025

Choose a reason for hiding this comment

Uh oh!

stevhliu May 1, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

a-r-r-o-w May 1, 2025

Choose a reason for hiding this comment

Uh oh!

stevhliu May 1, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

a-r-r-o-w left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

stevhliu commented Apr 22, 2025 •

edited

Loading

a-r-r-o-w May 1, 2025 •

edited

Loading