Skip to content

Conversation

zty-king
Copy link
Contributor

@zty-king zty-king commented Sep 3, 2025

PR types

New features

PR changes

Others

Description

适配flex_checkpoint,修复dp hang住的bug

Copy link

paddle-bot bot commented Sep 3, 2025

Thanks for your contribution!

@xingmingyyj
Copy link
Contributor

LGTM

@@ -949,14 +949,36 @@ def train(
if delay_optimizer_creation:
self.create_optimizer_and_scheduler(num_training_steps=max_steps)
self._load_optimizer_and_scheduler(resume_from_checkpoint)
else:
elif not self.args.using_flex_checkpoint:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

不要使用elif not这种反逻辑做分支判断,后续增加其它分支会变得复杂。可以:

elif self.args.using_flex_checkpoint:
    load_from_flex_checkpoint
else:
    load_from_default

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -53,6 +53,21 @@
from ..utils.pdc_sdk import PDCErrorCode, PDCErrorMessageMap, pdc_tool
from .utils.helper import distributed_file

try:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

为什么需要try逻辑?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

仔细斟酌了一下,发现这里不需要try,已去掉

@@ -407,6 +407,10 @@ class TrainingArguments:
Whether to release gradients during training. Default is `False`.
ckpt_quant_stage (`str`, *optional*):
Whether activate checkpoint quantization. O0: deactivate, O1: Int8 compression, O2: Int4 compression. (default: O0).
using_flex_checkpoint(`bool`, *optional*):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

考虑与sharding_io的互转,开关是否应该区分save和load?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@xingmingyyj
Copy link
Contributor

/re-run all-failed

@zty-king zty-king force-pushed the Pr_adapt_flex_checkpoint branch 2 times, most recently from 40c375d to cfc3e7f Compare September 17, 2025 12:33
@zty-king zty-king force-pushed the Pr_adapt_flex_checkpoint branch from cfc3e7f to 92d9b66 Compare September 17, 2025 13:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants