Description
Bug summary
Description
INFO:dpgen:start running
INFO:dpgen:continue from iter 000 task 00
INFO:dpgen:=============================iter.000000==============================
INFO:dpgen:-------------------------iter.000000 task 01--------------------------
2025-06-04 11:00:58,402 - INFO : info:check_all_finished: False
2025-06-04 11:00:58,409 - INFO : job: 55508939e039b34e0180ca7ccf84c91e7df26386 submit; job_id is 9092
2025-06-04 11:00:58,440 - INFO : job: 9889ffcb54cf23c89ec013644acac5d619823298 submit; job_id is 9099
2025-06-04 11:00:58,481 - INFO : job: 72ea3707f864606967efe33d818604f94df90405 submit; job_id is 9102
2025-06-04 11:00:58,520 - INFO : job: 28f9b672a7330198890741b76b901b066291088a submit; job_id is 9108
2025-06-04 11:01:30,467 - INFO : job: 55508939e039b34e0180ca7ccf84c91e7df26386 9092 terminated; fail_cout is 1; resubmitting job
2025-06-04 11:01:30,500 - INFO : job:55508939e039b34e0180ca7ccf84c91e7df26386 re-submit after terminated; new job_id is 9282
2025-06-04 11:01:30,840 - INFO : job:55508939e039b34e0180ca7ccf84c91e7df26386 job_id:9282 after re-submitting; the state now is <JobStatus.running: 3>
2025-06-04 11:01:30,841 - INFO : job: 9889ffcb54cf23c89ec013644acac5d619823298 9099 terminated; fail_cout is 1; resubmitting job
2025-06-04 11:01:30,877 - INFO : job:9889ffcb54cf23c89ec013644acac5d619823298 re-submit after terminated; new job_id is 9313
2025-06-04 11:01:31,265 - INFO : job:9889ffcb54cf23c89ec013644acac5d619823298 job_id:9313 after re-submitting; the state now is <JobStatus.running: 3>
2025-06-04 11:01:31,265 - INFO : job: 72ea3707f864606967efe33d818604f94df90405 9102 terminated; fail_cout is 1; resubmitting job
2025-06-04 11:01:31,300 - INFO : job:72ea3707f864606967efe33d818604f94df90405 re-submit after terminated; new job_id is 9345
2025-06-04 11:01:31,655 - INFO : job:72ea3707f864606967efe33d818604f94df90405 job_id:9345 after re-submitting; the state now is <JobStatus.running: 3>
2025-06-04 11:01:31,655 - INFO : job: 28f9b672a7330198890741b76b901b066291088a 9108 terminated; fail_cout is 1; resubmitting job
2025-06-04 11:01:31,699 - INFO : job:28f9b672a7330198890741b76b901b066291088a re-submit after terminated; new job_id is 9377
2025-06-04 11:01:32,065 - INFO : job:28f9b672a7330198890741b76b901b066291088a job_id:9377 after re-submitting; the state now is <JobStatus.running: 3>
2025-06-04 11:02:02,382 - INFO : job: 55508939e039b34e0180ca7ccf84c91e7df26386 9282 terminated; fail_cout is 2; resubmitting job
2025-06-04 11:02:02,420 - INFO : job:55508939e039b34e0180ca7ccf84c91e7df26386 re-submit after terminated; new job_id is 9453
2025-06-04 11:02:02,790 - INFO : job:55508939e039b34e0180ca7ccf84c91e7df26386 job_id:9453 after re-submitting; the state now is <JobStatus.running: 3>
2025-06-04 11:02:02,791 - INFO : job: 9889ffcb54cf23c89ec013644acac5d619823298 9313 terminated; fail_cout is 2; resubmitting job
2025-06-04 11:02:02,820 - INFO : job:9889ffcb54cf23c89ec013644acac5d619823298 re-submit after terminated; new job_id is 9465
2025-06-04 11:02:03,188 - INFO : job:9889ffcb54cf23c89ec013644acac5d619823298 job_id:9465 after re-submitting; the state now is <JobStatus.running: 3>
2025-06-04 11:02:03,189 - INFO : job: 72ea3707f864606967efe33d818604f94df90405 9345 terminated; fail_cout is 2; resubmitting job
2025-06-04 11:02:03,215 - INFO : job:72ea3707f864606967efe33d818604f94df90405 re-submit after terminated; new job_id is 9515
2025-06-04 11:02:03,573 - INFO : job:72ea3707f864606967efe33d818604f94df90405 job_id:9515 after re-submitting; the state now is <JobStatus.running: 3>
2025-06-04 11:02:03,574 - INFO : job: 28f9b672a7330198890741b76b901b066291088a 9377 terminated; fail_cout is 2; resubmitting job
2025-06-04 11:02:03,600 - INFO : job:28f9b672a7330198890741b76b901b066291088a re-submit after terminated; new job_id is 9529
2025-06-04 11:02:03,974 - INFO : job:28f9b672a7330198890741b76b901b066291088a job_id:9529 after re-submitting; the state now is <JobStatus.running: 3>
2025-06-04 11:02:34,323 - INFO : job: 55508939e039b34e0180ca7ccf84c91e7df26386 9453 terminated; fail_cout is 3; resubmitting job
Traceback (most recent call last):
File "/home/customer/anaconda3/envs/deepmd/lib/python3.12/site-packages/dpdispatcher/submission.py", line 356, in handle_unexpected_submission_state
job.handle_unexpected_job_state()
File "/home/customer/anaconda3/envs/deepmd/lib/python3.12/site-packages/dpdispatcher/submission.py", line 855, in handle_unexpected_job_state
raise RuntimeError(err_msg)
RuntimeError: job:55508939e039b34e0180ca7ccf84c91e7df26386 9453 failed 3 times.
Possible remote error message: ==> /home/customer/lq/dpgenwoork/50ab964acc7d497baa825ced5280623d5832f81d/002/train.log <==
3/envs/deepmd/lib/python3.12/site-packages/dargs/dargs.py", line 361, in traverse_value
self._traverse_sub(
File "/home/customer/anaconda3/envs/deepmd/lib/python3.12/site-packages/dargs/dargs.py", line 406, in _traverse_sub
subarg.traverse(value, key_hook, value_hook, sub_hook, variant_hook, path)
File "/home/customer/anaconda3/envs/deepmd/lib/python3.12/site-packages/dargs/dargs.py", line 343, in traverse
self.traverse_value(
File "/home/customer/anaconda3/envs/deepmd/lib/python3.12/site-packages/dargs/dargs.py", line 361, in traverse_value
self._traverse_sub(
File "/home/customer/anaconda3/envs/deepmd/lib/python3.12/site-packages/dargs/dargs.py", line 402, in _traverse_sub
sub_hook(self, value, path)
File "/home/customer/anaconda3/envs/deepmd/lib/python3.12/site-packages/dargs/dargs.py", line 500, in _check_strict
raise ArgumentKeyError(
dargs.dargs.ArgumentKeyError: [at location training
] undefined key batch_size
is not allowed in strict mode.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/customer/anaconda3/envs/deepmd/bin/dpgen", line 10, in
sys.exit(main())
^^^^^^
File "/home/customer/anaconda3/envs/deepmd/lib/python3.12/site-packages/dpgen/main.py", line 255, in main
args.func(args)
File "/home/customer/anaconda3/envs/deepmd/lib/python3.12/site-packages/dpgen/generator/run.py", line 5474, in gen_run
run_iter(args.PARAM, args.MACHINE)
File "/home/customer/anaconda3/envs/deepmd/lib/python3.12/site-packages/dpgen/generator/run.py", line 4805, in run_iter
run_train(ii, jdata, mdata)
File "/home/customer/anaconda3/envs/deepmd/lib/python3.12/site-packages/dpgen/generator/run.py", line 724, in run_train
return run_train_dp(iter_index, jdata, mdata)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/customer/anaconda3/envs/deepmd/lib/python3.12/site-packages/dpgen/generator/run.py", line 927, in run_train_dp
submission.run_submission()
File "/home/customer/anaconda3/envs/deepmd/lib/python3.12/site-packages/dpdispatcher/submission.py", line 260, in run_submission
self.handle_unexpected_submission_state()
File "/home/customer/anaconda3/envs/deepmd/lib/python3.12/site-packages/dpdispatcher/submission.py", line 360, in handle_unexpected_submission_state
raise RuntimeError(
RuntimeError: Meet errors will handle unexpected submission state.
Debug information: remote_root==/home/customer/lq/dpgenwoork/50ab964acc7d497baa825ced5280623d5832f81d.
Debug information: submission_hash==50ab964acc7d497baa825ced5280623d5832f81d.
Please check error messages above and in remote_root. The submission information is saved in /home/customer/.dpdispatcher/submission/50ab964acc7d497baa825ced5280623d5832f81d.json.
For furthur actions, run the following command with proper flags: dpdisp submission 50ab964acc7d497baa825ced5280623d5832f81d
DP-GEN Version
0.13.1
Platform, Python Version, Remote Platform, etc
No response
Input Files, Running Commands, Error Log, etc.
Steps to Reproduce
dargs.dargs.ArgumentKeyError: [at location training
] undefined key batch_size
is not allowed in strict mode.
Further Information, Files, and Links
No response