Description
Bug summary
Dear DeePMD community,
I'm encountering an issue while using the DP-GEN workflow with the DPA-2 model and PyTorch backend. Here are the details:
Environment:
DeePMD-kit version: 3.0.0b4-GPU-py3.9-cuda120
Model: DPA-2
Backend: PyTorch
Workflow control: DP-GEN
Issue Description:
In my machine.json file, I'm using parallel training with the following command: "command": "torchrun --nnodes=1 --nproc_per_node=auto dp --pt
"
The training phase completes successfully for all four models. Each model directory contains the expected output files, including "*_task_tag_finished
" and "frozen_model.pth
".
├── 000
│ ├── checkpoint
│ ├── dpa2.hdf5
│ ├── f74eaa2be2cab187505b354f787e5e5530d141f4_task_tag_finished
│ ├── frozen_model.pth
│ ├── input.json -> /expanse/projects/qstore/mia344/cjiang1/NNPs-traing-DeepMD-kit/DP-GEN/run_test_DPA2_2/iter.000000/00.train/000/input.json
│ ├── input_v2_compat.json
│ ├── lcurve.out
│ ├── model.ckpt-100.pt
│ ├── model.ckpt-200.pt
│ ├── model.ckpt-300.pt
│ ├── model.ckpt.pt -> model.ckpt-300.pt
│ ├── out.json
│ └── train.log
├── 001
│ ├── 84f1c8acd2f9dc640b2fea97f8aad68396a0fc93_task_tag_finished
│ ├── checkpoint
│ ├── dpa2.hdf5
│ ├── frozen_model.pth
│ ├── input.json -> /expanse/projects/qstore/mia344/cjiang1/NNPs-traing-DeepMD-kit/DP-GEN/run_test_DPA2_2/iter.000000/00.train/001/input.json
│ ├── input_v2_compat.json
│ ├── lcurve.out
│ ├── model.ckpt-100.pt
│ ├── model.ckpt-200.pt
│ ├── model.ckpt-300.pt
│ ├── model.ckpt.pt -> model.ckpt-300.pt
│ ├── out.json
│ └── train.log
├── 002
│ ├── checkpoint
│ ├── dpa2.hdf5
│ ├── dpdispatcher.log
│ ├── e193485d0db3952cdb32f6406c9580c43f010989_task_tag_finished
│ ├── frozen_model.pth
│ ├── input.json -> /expanse/projects/qstore/mia344/cjiang1/NNPs-traing-DeepMD-kit/DP-GEN/run_test_DPA2_2/iter.000000/00.train/002/input.json
│ ├── input_v2_compat.json
│ ├── lcurve.out
│ ├── model.ckpt-100.pt
│ ├── model.ckpt-200.pt
│ ├── model.ckpt-300.pt
│ ├── model.ckpt.pt -> model.ckpt-300.pt
│ ├── out.json
│ └── train.log
├── 003
│ ├── 19f28cb5828301f7434aaed206c3956f6890eb78_task_tag_finished
│ ├── checkpoint
│ ├── dpa2.hdf5
│ ├── frozen_model.pth
│ ├── input.json -> /expanse/projects/qstore/mia344/cjiang1/NNPs-traing-DeepMD-kit/DP-GEN/run_test_DPA2_2/iter.000000/00.train/003/input.json
│ ├── input_v2_compat.json
│ ├── lcurve.out
│ ├── model.ckpt-100.pt
│ ├── model.ckpt-200.pt
│ ├── model.ckpt-300.pt
│ ├── model.ckpt.pt -> model.ckpt-300.pt
│ ├── out.json
│ └── train.log
However, the workflow stops at the model_devi stage with the following error: FileNotFoundError: cannot find download file
frozen_model.pb``
I believe DP-GEN is looking for "frozen_model.pb
" (TensorFlow format) by default, but it's not compatible with the PyTorch model "`frozen_model.pth`".
When I manually attempt to convert the format using: dp convert-backend frozen_model.pth frozen_model.pb I receive another error: `RuntimeError: Unknown descriptor type: dpa2. Did you mean: dpa1?`
Analysis:
It appears that the DPA-2 model currently only supports PyTorch and cannot be converted to the TensorFlow format (frozen_model.pb). This prevents me from proceeding with subsequent DP-GEN operations for the DPA-2 model.
Questions:
Is there a way to configure DP-GEN to work with PyTorch's "frozen_model.pth" for the DPA-2 model?
Are there plans to support TensorFlow backend or format conversion for the DPA-2 model in future releases?
Is there an alternative workflow or workaround to use the DPA-2 model with DP-GEN?
Any guidance or suggestions would be greatly appreciated. Thank you for your time and assistance.
DeePMD-kit Version
3.0.0b4
Backend and its version
Pytorch 2.1.2
How did you download the software?
conda
Input Files, Running Commands, Error Log, etc.
machine.json
"command": "torchrun --nnodes=1 --nproc_per_node=auto dp --pt",
Steps to Reproduce
Use DPA-2 model in DP-GEN.
Further Information, Files, and Links
No response