PR: Fix memory leaks and reduce GPU memory requirements #17

yang999876 · 2025-04-16T09:33:25Z

作者你好，我是一名独立游戏开发者，我很喜欢你们的IDOL模型，但是我在使用它的时候发现了一些内存泄漏问题，这导致模型推理时内存和显存需求非常大，即使整个模型只有8G左右，推理仍然需要24G以上的显存和内存，连4090都无法cover。究其根本，是python的垃圾回收机制在torch中可能失效的问题，如果不手动回收一些内存，内存（显存）垃圾会一直累计。这是我作出的一些简单修改：

修改了模型load的代码，原本的instantiate_from_config会创建一个实例，但是后续用load_from_checkpoint的时候又会创建一个实例，前一个实例没有被正确垃圾回收，变成了内存垃圾，在一直在内存里占用8g以上的空间，我删去了instantiate_from_config的部分，仅使用load_from_checkpoint就可正确实例化。
原本的model.encoder.to(torch.bfloat16)，把encoder转化为f16精度的代码，由于模型变成内存垃圾，这一行并没有起任何作用，如果load_from_checkpoint后再转化精度，有部分模块精度有强制要求fp32，所以不能运行，我索性直接去掉了。
我添加了一个参数，--low_ram，并在forward_image_to_uv和decoder._decode_feature这两个显存消耗最大的模块使用，当变量用完后，手动删除并清理，可以大幅度降低显存需求。优化过后显存峰值在10g以下，实测11g的1080ti可以完成推理工作，这是我修改后的代码的cuda内存报告。

|===========================================================================|
|                  PyTorch CUDA memory summary, device ID 0                 |
|---------------------------------------------------------------------------|
|            CUDA OOMs: 0            |        cudaMalloc retries: 0         |
|===========================================================================|
|        Metric         | Cur Usage  | Peak Usage | Tot Alloc  | Tot Freed  |
|---------------------------------------------------------------------------|
| Allocated memory      |   1597 MiB |   8799 MiB | 635356 MiB | 633759 MiB |
|       from large pool |   1591 MiB |   8782 MiB | 634313 MiB | 632721 MiB |
|       from small pool |      6 MiB |     18 MiB |   1043 MiB |   1037 MiB |
|---------------------------------------------------------------------------|
| Active memory         |   1597 MiB |   8799 MiB | 635356 MiB | 633759 MiB |
|       from large pool |   1591 MiB |   8782 MiB | 634313 MiB | 632721 MiB |
|       from small pool |      6 MiB |     18 MiB |   1043 MiB |   1037 MiB |
|---------------------------------------------------------------------------|
| Requested memory      |   1593 MiB |   8724 MiB | 634551 MiB | 632958 MiB |
|       from large pool |   1587 MiB |   8707 MiB | 633510 MiB | 631923 MiB |
|       from small pool |      6 MiB |     17 MiB |   1040 MiB |   1034 MiB |
|---------------------------------------------------------------------------|
| GPU reserved memory   |   1708 MiB |   9524 MiB | 403248 MiB | 401540 MiB |
|       from large pool |   1698 MiB |   9514 MiB | 403190 MiB | 401492 MiB |
|       from small pool |     10 MiB |     20 MiB |     58 MiB |     48 MiB |
|---------------------------------------------------------------------------|
| Non-releasable memory | 112956 KiB |   3464 MiB | 180727 MiB | 180617 MiB |
|       from large pool | 108924 KiB |   3459 MiB | 179653 MiB | 179547 MiB |
|       from small pool |   4032 KiB |      5 MiB |   1074 MiB |   1070 MiB |
|---------------------------------------------------------------------------|
| Allocations           |     137    |     984    |   15332    |   15195    |
|       from large pool |      40    |     348    |    6467    |    6427    |
|       from small pool |      97    |     639    |    8865    |    8768    |
|---------------------------------------------------------------------------|
| Active allocs         |     137    |     984    |   15332    |   15195    |
|       from large pool |      40    |     348    |    6467    |    6427    |
|       from small pool |      97    |     639    |    8865    |    8768    |
|---------------------------------------------------------------------------|
| GPU reserved segments |      24    |     254    |     698    |     674    |
|       from large pool |      19    |     244    |     669    |     650    |
|       from small pool |       5    |      10    |      29    |      24    |
|---------------------------------------------------------------------------|
| Non-releasable allocs |      24    |      52    |    5786    |    5762    |
|       from large pool |       9    |      47    |    2669    |    2660    |
|       from small pool |      15    |      19    |    3117    |    3102    |
|---------------------------------------------------------------------------|
| Oversize allocations  |       0    |       0    |       0    |       0    |
|---------------------------------------------------------------------------|
| Oversize GPU segments |       0    |       0    |       0    |       0    |
|===========================================================================|

更多优化建议：

如果你已经使用load_from_checkpoint加载模型，可以把子模型初始化时，从文件加载模型的方式，改成创建一个空的类实例。比如SapiensWrapper_ts类中，使用self.model = torch.jit.load(model_path)加载模型和其参数，但是加载的参数会被随后的load_from_checkpoint覆盖，加载的时候内存里会临时有两份模型，虽然随后会被gc清理掉，但是这种情况会增加峰值内存，优化后本模型内存需求可以降低到16g。
我读取了IDOL模型的ckpt，发现如下内容：

[DECODER]
Submodule                                Total Size (MB)
-------------------------------------------------------
decoder.upsample_conv                              89.70
decoder.deformer                                   80.16
decoder.select_coord                                1.53
decoder.color_net                                   0.08
decoder.base_net                                    0.04
decoder.base_bn                                     0.00
decoder.offset_net                                  0.00
decoder.density_net                                 0.00
decoder.renderer                                    0.00

[ENCODER]
Submodule                                Total Size (MB)
-------------------------------------------------------
encoder.model                                    4460.29

[LPIPS]
Submodule                                Total Size (MB)
-------------------------------------------------------
lpips.net                                          56.14

[NECK]
Submodule                                Total Size (MB)
-------------------------------------------------------
neck.decoder_blocks_depart                       1729.22
neck.decoder_embed                                144.09
neck.decoder_pos_embed                             54.00
neck.decoder_pred                                   3.00
neck.decoder_norm                                   0.01
neck.mask_token                                     0.01

Checkpoint File Size: 8422.90 MB

[Category Breakdown]
CATEGORY                     SIZE (MB)    PERCENT
MODEL_PARAMETERS               4388.13      52.1%
OPTIMIZER                      4034.30      47.9%
TRAINING_STATE                    0.00       0.0%
FRAMEWORK_METADATA                0.00       0.0%
OTHER                             0.00       0.0%

我猜测，ckpt中的encoder是半精度，也就是fp16，模型总参数其实只有4g左右，去掉推理不使用的optimizer，可以再降低显存需求。

总结：本次修复了加载时的内存泄漏，优化了推理时的显存占用，使得小显存的显卡也能完成推理工作，显存峰值降低到9524 MiB。

IDOL模型相当优秀，我们正在探索把它加入到我们新游戏开发的工作流中。感谢你们对AIGC技术的贡献。

yang999876 added 2 commits April 14, 2025 17:49

PR: Fix memory leaks and reduce GPU memory requirements

20ab69c

update README.md

c64369d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

PR: Fix memory leaks and reduce GPU memory requirements #17

PR: Fix memory leaks and reduce GPU memory requirements #17

Uh oh!

yang999876 commented Apr 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

PR: Fix memory leaks and reduce GPU memory requirements #17

Are you sure you want to change the base?

PR: Fix memory leaks and reduce GPU memory requirements #17

Uh oh!

Conversation

yang999876 commented Apr 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant