Skip to content

[BUG] 提示batch_size问题,但是应该不是这个问题 #1761

Open
@lesilel

Description

@lesilel

Bug summary

Description

INFO:dpgen:start running
INFO:dpgen:continue from iter 000 task 00
INFO:dpgen:=============================iter.000000==============================
INFO:dpgen:-------------------------iter.000000 task 01--------------------------
2025-06-04 11:00:58,402 - INFO : info:check_all_finished: False
2025-06-04 11:00:58,409 - INFO : job: 55508939e039b34e0180ca7ccf84c91e7df26386 submit; job_id is 9092
2025-06-04 11:00:58,440 - INFO : job: 9889ffcb54cf23c89ec013644acac5d619823298 submit; job_id is 9099
2025-06-04 11:00:58,481 - INFO : job: 72ea3707f864606967efe33d818604f94df90405 submit; job_id is 9102
2025-06-04 11:00:58,520 - INFO : job: 28f9b672a7330198890741b76b901b066291088a submit; job_id is 9108
2025-06-04 11:01:30,467 - INFO : job: 55508939e039b34e0180ca7ccf84c91e7df26386 9092 terminated; fail_cout is 1; resubmitting job
2025-06-04 11:01:30,500 - INFO : job:55508939e039b34e0180ca7ccf84c91e7df26386 re-submit after terminated; new job_id is 9282
2025-06-04 11:01:30,840 - INFO : job:55508939e039b34e0180ca7ccf84c91e7df26386 job_id:9282 after re-submitting; the state now is <JobStatus.running: 3>
2025-06-04 11:01:30,841 - INFO : job: 9889ffcb54cf23c89ec013644acac5d619823298 9099 terminated; fail_cout is 1; resubmitting job
2025-06-04 11:01:30,877 - INFO : job:9889ffcb54cf23c89ec013644acac5d619823298 re-submit after terminated; new job_id is 9313
2025-06-04 11:01:31,265 - INFO : job:9889ffcb54cf23c89ec013644acac5d619823298 job_id:9313 after re-submitting; the state now is <JobStatus.running: 3>
2025-06-04 11:01:31,265 - INFO : job: 72ea3707f864606967efe33d818604f94df90405 9102 terminated; fail_cout is 1; resubmitting job
2025-06-04 11:01:31,300 - INFO : job:72ea3707f864606967efe33d818604f94df90405 re-submit after terminated; new job_id is 9345
2025-06-04 11:01:31,655 - INFO : job:72ea3707f864606967efe33d818604f94df90405 job_id:9345 after re-submitting; the state now is <JobStatus.running: 3>
2025-06-04 11:01:31,655 - INFO : job: 28f9b672a7330198890741b76b901b066291088a 9108 terminated; fail_cout is 1; resubmitting job
2025-06-04 11:01:31,699 - INFO : job:28f9b672a7330198890741b76b901b066291088a re-submit after terminated; new job_id is 9377
2025-06-04 11:01:32,065 - INFO : job:28f9b672a7330198890741b76b901b066291088a job_id:9377 after re-submitting; the state now is <JobStatus.running: 3>
2025-06-04 11:02:02,382 - INFO : job: 55508939e039b34e0180ca7ccf84c91e7df26386 9282 terminated; fail_cout is 2; resubmitting job
2025-06-04 11:02:02,420 - INFO : job:55508939e039b34e0180ca7ccf84c91e7df26386 re-submit after terminated; new job_id is 9453
2025-06-04 11:02:02,790 - INFO : job:55508939e039b34e0180ca7ccf84c91e7df26386 job_id:9453 after re-submitting; the state now is <JobStatus.running: 3>
2025-06-04 11:02:02,791 - INFO : job: 9889ffcb54cf23c89ec013644acac5d619823298 9313 terminated; fail_cout is 2; resubmitting job
2025-06-04 11:02:02,820 - INFO : job:9889ffcb54cf23c89ec013644acac5d619823298 re-submit after terminated; new job_id is 9465
2025-06-04 11:02:03,188 - INFO : job:9889ffcb54cf23c89ec013644acac5d619823298 job_id:9465 after re-submitting; the state now is <JobStatus.running: 3>
2025-06-04 11:02:03,189 - INFO : job: 72ea3707f864606967efe33d818604f94df90405 9345 terminated; fail_cout is 2; resubmitting job
2025-06-04 11:02:03,215 - INFO : job:72ea3707f864606967efe33d818604f94df90405 re-submit after terminated; new job_id is 9515
2025-06-04 11:02:03,573 - INFO : job:72ea3707f864606967efe33d818604f94df90405 job_id:9515 after re-submitting; the state now is <JobStatus.running: 3>
2025-06-04 11:02:03,574 - INFO : job: 28f9b672a7330198890741b76b901b066291088a 9377 terminated; fail_cout is 2; resubmitting job
2025-06-04 11:02:03,600 - INFO : job:28f9b672a7330198890741b76b901b066291088a re-submit after terminated; new job_id is 9529
2025-06-04 11:02:03,974 - INFO : job:28f9b672a7330198890741b76b901b066291088a job_id:9529 after re-submitting; the state now is <JobStatus.running: 3>
2025-06-04 11:02:34,323 - INFO : job: 55508939e039b34e0180ca7ccf84c91e7df26386 9453 terminated; fail_cout is 3; resubmitting job
Traceback (most recent call last):
File "/home/customer/anaconda3/envs/deepmd/lib/python3.12/site-packages/dpdispatcher/submission.py", line 356, in handle_unexpected_submission_state
job.handle_unexpected_job_state()
File "/home/customer/anaconda3/envs/deepmd/lib/python3.12/site-packages/dpdispatcher/submission.py", line 855, in handle_unexpected_job_state
raise RuntimeError(err_msg)
RuntimeError: job:55508939e039b34e0180ca7ccf84c91e7df26386 9453 failed 3 times.
Possible remote error message: ==> /home/customer/lq/dpgenwoork/50ab964acc7d497baa825ced5280623d5832f81d/002/train.log <==
3/envs/deepmd/lib/python3.12/site-packages/dargs/dargs.py", line 361, in traverse_value
self._traverse_sub(
File "/home/customer/anaconda3/envs/deepmd/lib/python3.12/site-packages/dargs/dargs.py", line 406, in _traverse_sub
subarg.traverse(value, key_hook, value_hook, sub_hook, variant_hook, path)
File "/home/customer/anaconda3/envs/deepmd/lib/python3.12/site-packages/dargs/dargs.py", line 343, in traverse
self.traverse_value(
File "/home/customer/anaconda3/envs/deepmd/lib/python3.12/site-packages/dargs/dargs.py", line 361, in traverse_value
self._traverse_sub(
File "/home/customer/anaconda3/envs/deepmd/lib/python3.12/site-packages/dargs/dargs.py", line 402, in _traverse_sub
sub_hook(self, value, path)
File "/home/customer/anaconda3/envs/deepmd/lib/python3.12/site-packages/dargs/dargs.py", line 500, in _check_strict
raise ArgumentKeyError(
dargs.dargs.ArgumentKeyError: [at location training] undefined key batch_size is not allowed in strict mode.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/home/customer/anaconda3/envs/deepmd/bin/dpgen", line 10, in
sys.exit(main())
^^^^^^
File "/home/customer/anaconda3/envs/deepmd/lib/python3.12/site-packages/dpgen/main.py", line 255, in main
args.func(args)
File "/home/customer/anaconda3/envs/deepmd/lib/python3.12/site-packages/dpgen/generator/run.py", line 5474, in gen_run
run_iter(args.PARAM, args.MACHINE)
File "/home/customer/anaconda3/envs/deepmd/lib/python3.12/site-packages/dpgen/generator/run.py", line 4805, in run_iter
run_train(ii, jdata, mdata)
File "/home/customer/anaconda3/envs/deepmd/lib/python3.12/site-packages/dpgen/generator/run.py", line 724, in run_train
return run_train_dp(iter_index, jdata, mdata)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/customer/anaconda3/envs/deepmd/lib/python3.12/site-packages/dpgen/generator/run.py", line 927, in run_train_dp
submission.run_submission()
File "/home/customer/anaconda3/envs/deepmd/lib/python3.12/site-packages/dpdispatcher/submission.py", line 260, in run_submission
self.handle_unexpected_submission_state()
File "/home/customer/anaconda3/envs/deepmd/lib/python3.12/site-packages/dpdispatcher/submission.py", line 360, in handle_unexpected_submission_state
raise RuntimeError(
RuntimeError: Meet errors will handle unexpected submission state.
Debug information: remote_root==/home/customer/lq/dpgenwoork/50ab964acc7d497baa825ced5280623d5832f81d.
Debug information: submission_hash==50ab964acc7d497baa825ced5280623d5832f81d.
Please check error messages above and in remote_root. The submission information is saved in /home/customer/.dpdispatcher/submission/50ab964acc7d497baa825ced5280623d5832f81d.json.
For furthur actions, run the following command with proper flags: dpdisp submission 50ab964acc7d497baa825ced5280623d5832f81d

DP-GEN Version

0.13.1

Platform, Python Version, Remote Platform, etc

No response

Input Files, Running Commands, Error Log, etc.

param.json

Steps to Reproduce

dargs.dargs.ArgumentKeyError: [at location training] undefined key batch_size is not allowed in strict mode.

Further Information, Files, and Links

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions