Pr adapt flex checkpoint #11065

zty-king · 2025-09-03T14:55:19Z

PR types

New features

PR changes

Others

Description

适配flex_checkpoint，修复dp hang住的bug

paddle-bot · 2025-09-03T14:55:24Z

Thanks for your contribution!

xingmingyyj · 2025-09-04T07:25:41Z

LGTM

From00 · 2025-09-11T11:50:00Z

paddlenlp/trainer/trainer.py

@@ -949,14 +949,36 @@ def train(
                if delay_optimizer_creation:
                    self.create_optimizer_and_scheduler(num_training_steps=max_steps)
                self._load_optimizer_and_scheduler(resume_from_checkpoint)
-            else:
+            elif not self.args.using_flex_checkpoint:


不要使用elif not这种反逻辑做分支判断，后续增加其它分支会变得复杂。可以:

elif self.args.using_flex_checkpoint: load_from_flex_checkpoint else: load_from_default

From00 · 2025-09-11T11:52:43Z

paddlenlp/trainer/trainer_utils.py

@@ -53,6 +53,21 @@
 from ..utils.pdc_sdk import PDCErrorCode, PDCErrorMessageMap, pdc_tool
 from .utils.helper import distributed_file

+try:


为什么需要try逻辑？

仔细斟酌了一下，发现这里不需要try，已去掉

From00 · 2025-09-11T11:53:42Z

paddlenlp/trainer/training_args.py

@@ -407,6 +407,10 @@ class TrainingArguments:
            Whether to release gradients during training. Default is `False`.
        ckpt_quant_stage (`str`, *optional*):
            Whether activate checkpoint quantization. O0: deactivate, O1: Int8 compression, O2: Int4 compression. (default: O0).
+        using_flex_checkpoint(`bool`, *optional*):


考虑与sharding_io的互转，开关是否应该区分save和load?

xingmingyyj · 2025-09-17T09:07:47Z

/re-run all-failed

paddle-bot bot added the contributor label Sep 3, 2025

paddle-bot bot assigned KB-Ding Sep 3, 2025

From00 reviewed Sep 11, 2025

View reviewed changes

zty-king force-pushed the Pr_adapt_flex_checkpoint branch 2 times, most recently from 40c375d to cfc3e7f Compare September 17, 2025 12:33

adapt_flex_checkpoint

92d9b66

zty-king force-pushed the Pr_adapt_flex_checkpoint branch from cfc3e7f to 92d9b66 Compare September 17, 2025 13:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Pr adapt flex checkpoint #11065

Pr adapt flex checkpoint #11065

Uh oh!

zty-king commented Sep 3, 2025

Uh oh!

paddle-bot bot commented Sep 3, 2025

Uh oh!

xingmingyyj commented Sep 4, 2025

Uh oh!

From00 Sep 11, 2025

Uh oh!

zty-king Sep 17, 2025

Uh oh!

From00 Sep 11, 2025

Uh oh!

zty-king Sep 17, 2025

Uh oh!

From00 Sep 11, 2025

Uh oh!

zty-king Sep 17, 2025

Uh oh!

xingmingyyj commented Sep 17, 2025

Uh oh!

Uh oh!

Pr adapt flex checkpoint #11065

Are you sure you want to change the base?

Pr adapt flex checkpoint #11065

Uh oh!

Conversation

zty-king commented Sep 3, 2025

PR types

PR changes

Description

Uh oh!

paddle-bot bot commented Sep 3, 2025

Uh oh!

xingmingyyj commented Sep 4, 2025

Uh oh!

From00 Sep 11, 2025

Choose a reason for hiding this comment

Uh oh!

zty-king Sep 17, 2025

Choose a reason for hiding this comment

Uh oh!

From00 Sep 11, 2025

Choose a reason for hiding this comment

Uh oh!

zty-king Sep 17, 2025

Choose a reason for hiding this comment

Uh oh!

From00 Sep 11, 2025

Choose a reason for hiding this comment

Uh oh!

zty-king Sep 17, 2025

Choose a reason for hiding this comment

Uh oh!

xingmingyyj commented Sep 17, 2025

Uh oh!

Uh oh!