Skip to content

Conversation

@suncade
Copy link
Contributor

@suncade suncade commented Dec 10, 2025

📌 PR 内容 / PR Description

  • 支持微调任务暂停/恢复:新增 /v1/finetuneTasks/{job_id}:pause/v1/finetuneTasks/{job_id}:resume API 接口
  • 增加断点续训功能:新增断点续训逻辑,支持从指定 checkpoint 恢复训练
  • 训练进度优化:增加和改进微调进度跟踪和状态管理机制
  • 支持微调阶段: sft, pt, dpo; support 支持微调类型: lora, qlora, full

✅ 变更类型 / Type of Change

  • 修复 Bug / Bug fix (non-breaking change that fixes an issue)
  • [ ✅ ] 新功能 / New feature (non-breaking change that adds functionality)
  • 重构 / Refactor (no functionality change, code structure optimized)
  • 重大变更 / Breaking change (fix or feature that would cause existing functionality to change)
  • 文档更新 / Documentation update (changes to docs only)
  • 性能优化 / Performance optimization

@suncade
Copy link
Contributor Author

suncade commented Dec 10, 2025

support stages: sft, pt, dpo
support finetuning_type: lora, qlora, full

@JingofXin
Copy link
Collaborator

PR描述需要补充完整。

@JingofXin
Copy link
Collaborator

JingofXin commented Dec 18, 2025

本地验证了两个llamafactory的测试,纯语言模型+多模态模型,训练和推理都正常。
多模态正常:
企业微信截图_17660266608771
纯语言模型正常:
企业微信截图_17660268404909


lazyllm.config.add('trainable_module_config_map_path', str, '', 'TRAINABLE_MODULE_CONFIG_MAP_PATH',
description='The default path for trainable module config map.')
lazyllm.config.add(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

不要做无关的lint

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

恢复原来状态和格式

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

代码合并时自动格式换行,现恢复原状


kw = kw or self._get_train_or_deploy_args(mode, disable=['base_model', 'target_path'])
# Get default args and merge with user-provided kw, with kw taking precedence
default_args = self._get_train_or_deploy_args(mode, disable=['base_model', 'target_path'])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里会改变原来的逻辑,正常情况下是用户传kw,就完全使用用户给的;否则使用默认的;现在行为变成了合并参数,即用户传入的覆盖默认的。
这个kw主要用于_async_finetune,其余场景都没用上,需要确认一下_async_finetune场景这样做是否符合预期

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

自测时发现不同微调场景用户可能只会传部分参数,但还有些参数是必须的。所以改成了当前逻辑,即优先使用用户传的参数+其它用户没传又需要的参数(使用默认值)。

kw = kw or self._get_train_or_deploy_args(mode, disable=['base_model', 'target_path'])
# Get default args and merge with user-provided kw, with kw taking precedence
default_args = self._get_train_or_deploy_args(mode, disable=['base_model', 'target_path'])
if kw:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kw = {**default_args, **kw} if kw else default_args

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

接受更新为此行更简洁的写法

@JingofXin
Copy link
Collaborator

PR描述中,没涉及的内容要清除掉。

@suncade suncade changed the title Add finetune 1stop 2pause 3resume 4finetune_cost 5finetune_progress and 6multi-gpu training or fine-tuning logic Add finetune stop pause resume finetune_cost finetune_progress multi-gpu training and fine-tuning logic etc. Dec 30, 2025
@suncade suncade changed the title Add finetune stop pause resume finetune_cost finetune_progress multi-gpu training and fine-tuning logic etc. Add finetune stop pause resume finetune_cost finetune_progress multi-gpu training and enhance model infer logic etc. Dec 30, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants