Skip to content

How to train model using custom data? #36

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
charlescwwang opened this issue Jun 28, 2024 · 5 comments
Open

How to train model using custom data? #36

charlescwwang opened this issue Jun 28, 2024 · 5 comments
Labels
question Further information is requested

Comments

@charlescwwang
Copy link

charlescwwang commented Jun 28, 2024

Issue Description
I tried to train model using my data with 12 labels. (coco dataset format)
When I try to train the model, the following error occurs.

Additional Context
This is my command

python yolo/lazy.py task=train task.epoch=10 task.data.batch_size=8 model=v9-m dataset=data device=cuda name=test-2

This is log

[06/28 18:19:05]   INFO  | 📄 Created log folder: runs/train/test-2
[06/28 18:19:05]   INFO  | 📦 Loaded train cache
[06/28 18:19:05]   INFO  | 🚜 Building YOLO
[06/28 18:19:05]   INFO  |   🏗️  Building backbone
[06/28 18:19:05]   INFO  |   🏗️  Building neck
[06/28 18:19:05]   INFO  |   🏗️  Building head
[06/28 18:19:05]   INFO  |   🏗️  Building detection
[06/28 18:19:05]   INFO  |   🏗️  Building auxiliary
[06/28 18:19:05]   INFO  | ✅ Success load model & weight
[06/28 18:19:06]   INFO  | 🧸 Found no stride of model, performed a dummy test for auto-anchor size
[06/28 18:19:08]   INFO  | ✅ Success load loss function
                             Model Layers                             
┏━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┓
┃ Index ┃     Layer Type     ┃ Tags ┃    Params ┃ Channels (IN->OUT) ┃
┡━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━┩
│   1   │        Conv        │      │       928 │       3 ->   32    │
│   2   │        Conv        │      │    18,560 │      32 ->   64    │
│   3   │    RepNCSPELAN     │      │   171,648 │      64 ->  128    │
│   4   │       AConv        │      │   276,960 │     128 ->  240    │
│   5   │    RepNCSPELAN     │  B3  │   629,520 │     240 ->  240    │
│   6   │       AConv        │      │   778,320 │     240 ->  360    │
│   7   │    RepNCSPELAN     │  B4  │ 1,414,080 │     360 ->  360    │
│   8   │       AConv        │      │ 1,556,160 │     360 ->  480    │
│   9   │    RepNCSPELAN     │  B5  │ 2,511,840 │     480 ->  480    │
│  10   │      SPPELAN       │  N3  │   577,440 │     480 ->  480    │
│  11   │      UpSample      │      │         0 │         -          │
│  12   │       Concat       │      │         0 │         -          │
│  13   │    RepNCSPELAN     │  N4  │ 1,586,880 │     840 ->  360    │
│  14   │      UpSample      │      │         0 │         -          │
│  15   │       Concat       │      │         0 │         -          │
│  16   │    RepNCSPELAN     │  P3  │   715,920 │     600 ->  240    │
│  17   │       AConv        │      │   397,808 │     240 ->  184    │
│  18   │       Concat       │      │         0 │         -          │
│  19   │    RepNCSPELAN     │  P4  │ 1,480,320 │     544 ->  360    │
│  20   │       AConv        │      │   778,080 │     360 ->  240    │
│  21   │       Concat       │      │         0 │         -          │
│  22   │    RepNCSPELAN     │  P5  │ 2,627,040 │     720 ->  480    │
│  23   │ MultiheadDetection │ Main │ 4,602,528 │       M -> 1080    │
│  24   │      CBLinear      │  R3  │    57,840 │     240 ->    M    │
│  25   │      CBLinear      │  R4  │   216,600 │     360 ->    M    │
│  26   │      CBLinear      │  R5  │   519,480 │     480 ->    M    │
│  27   │        Conv        │      │       928 │       3 ->   32    │
│  28   │        Conv        │      │    18,560 │      32 ->   64    │
│  29   │    RepNCSPELAN     │      │   171,648 │      64 ->  128    │
│  30   │       AConv        │      │   276,960 │     128 ->  240    │
│  31   │       CBFuse       │      │         0 │         -          │
│  32   │    RepNCSPELAN     │  A3  │   629,520 │     240 ->  240    │
│  33   │       AConv        │      │   778,320 │     240 ->  360    │
│  34   │       CBFuse       │      │         0 │         -          │
│  35   │    RepNCSPELAN     │  A4  │ 1,414,080 │     360 ->  360    │
│  36   │       AConv        │      │ 1,556,160 │     360 ->  480    │
│  37   │       CBFuse       │      │         0 │         -          │
│  38   │    RepNCSPELAN     │  A5  │ 2,511,840 │     480 ->  480    │
│  39   │ MultiheadDetection │ AUX  │ 4,602,528 │       M -> 1080    │
└───────┴────────────────────┴──────┴───────────┴────────────────────┘
[06/28 18:19:08] WARNING | ⚠️ Could not find graphviz backend, continue without drawing the model architecture
[06/28 18:19:08]   INFO  | 📦 Loaded validation cache
[06/28 18:19:08]   INFO  | 🚄 Start Training!
/home/localadmin/anaconda3/envs/yolo-MIT/lib/python3.9/site-packages/torch/optim/lr_scheduler.py:143: UserWarning: Detected call of 
`lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before 
`lr_scheduler.step()`.  Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at 
https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
  warnings.warn("Detected call of `lr_scheduler.step()` before `optimizer.step()`. "
⠧ Validate |  mAP.5  |mAP.5:.95| ━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1/10 0:01:02
⠧ Run pycocotools                ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0/1  -:--:--
💾 success save at runs/train/test-2/weights/E000.pt/home/localadmin/anaconda3/envs/yolo-MIT/lib/python3.9/site-packages/torch/optim/lr_scheduler.py:156: UserWarning: The epoch parameter in 
`scheduler.step()` was not necessary and is being deprecated where possible. Please use `scheduler.step()` to step the scheduler. During the 
deprecation, if epoch is different from None, the closed form is used instead of the new chainable form, where available. Please open an issue if you 
are unable to replicate your use case: https://github.com/pytorch/pytorch/issues/new/choose.
  warnings.warn(EPOCH_DEPRECATION_WARNING, UserWarning)
⠸ Validate |  mAP.5  |mAP.5:.95| ━━━━━━━━━━━━━━━━━━━━╸━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2/10 0:01:05
⠸ Run pycocotools                ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0/1  -:--:--
┏━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━┓
┃ Epoch ┃ Avg. Precision ┃       ┃ Avg. Recall    ┃       ┃
💾 success save at runs/train/test-2/weights/E001.pt
⠙ Validate |  mAP.5  |mAP.5:.95| ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3/10 0:01:00
⠙ Run pycocotools                ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0/1  -:--:--
┏━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━┓
┃ Epoch ┃ Avg. Precision ┃       ┃ Avg. Recall    ┃       ┃
Error executing job with overrides: ['task=train', 'task.epoch=10', 'task.data.batch_size=8', 'model=v9-m', 'dataset=data', 'device=cuda', 
'name=test-2']
Traceback (most recent call last):  File "/home/localadmin/YOLO/yolo/lazy.py", line 42, in <module>
    main()  File "/home/localadmin/anaconda3/envs/yolo-MIT/lib/python3.9/site-packages/hydra/main.py", line 94, in decorated_main
    _run_hydra(
  File "/home/localadmin/anaconda3/envs/yolo-MIT/lib/python3.9/site-packages/hydra/_internal/utils.py", line 394, in _run_hydra
    _run_app(
  File "/home/localadmin/anaconda3/envs/yolo-MIT/lib/python3.9/site-packages/hydra/_internal/utils.py", line 457, in _run_app    run_and_report(
  File "/home/localadmin/anaconda3/envs/yolo-MIT/lib/python3.9/site-packages/hydra/_internal/utils.py", line 223, in run_and_report    raise ex
  File "/home/localadmin/anaconda3/envs/yolo-MIT/lib/python3.9/site-packages/hydra/_internal/utils.py", line 220, in run_and_report
    return func()
  File "/home/localadmin/anaconda3/envs/yolo-MIT/lib/python3.9/site-packages/hydra/_internal/utils.py", line 458, in <lambda>
    lambda: hydra.run(
  File "/home/localadmin/anaconda3/envs/yolo-MIT/lib/python3.9/site-packages/hydra/_internal/hydra.py", line 132, in run
    _ = ret.return_value
  File "/home/localadmin/anaconda3/envs/yolo-MIT/lib/python3.9/site-packages/hydra/core/utils.py", line 260, in return_value
    raise self._return_value
  File "/home/localadmin/anaconda3/envs/yolo-MIT/lib/python3.9/site-packages/hydra/core/utils.py", line 186, in run_job
    ret.return_value = task_function(task_cfg)
  File "/home/localadmin/YOLO/yolo/lazy.py", line 38, in main
    solver.solve(dataloader)
  File "/home/localadmin/YOLO/yolo/tools/solver.py", line 145, in solve
    mAPs = self.validator.solve(self.validation_dataloader, epoch_idx=epoch_idx)
  File "/home/localadmin/YOLO/yolo/tools/solver.py", line 256, in solve
    result = calculate_ap(self.coco_gt, predict_json)
  File "/home/localadmin/YOLO/yolo/utils/solver_utils.py", line 12, in calculate_ap
    coco_dt = coco_gt.loadRes(pd_path)
  File "/home/localadmin/anaconda3/envs/yolo-MIT/lib/python3.9/site-packages/pycocotools/coco.py", line 332, in loadRes
    assert set(annsImgIds) == (set(annsImgIds) & set(self.getImgIds())), \
AssertionError: Results do not correspond to current coco set
⠙ Validate |  mAP.5  |mAP.5:.95| ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3/10 0:01:00
⠙ Run pycocotools                ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0/1  -:--:--
┏━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━┓
┃ Epoch ┃ Avg. Precision ┃       ┃ Avg. Recall    ┃       ┃
┡━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━┩
│    0  │ AP @ .5:.95    │  0.00 │ AP @        .5 │  0.00 │
│       │                │       │                │       │
│    1  │ AP @ .5:.95    │  0.00 │ AR maxDets   1 │  0.00 │
│    1  │ AP @     .5    │  0.00 │ AR maxDets  10 │  0.00 │
│    1  │ AP @    .75    │  0.00 │ AR maxDets 100 │  0.00 │
│    1  │ AP  (small)    │  0.00 │ AR     (small) │  0.00 │
│    1  │ AP (medium)    │  0.00 │ AR    (medium) │  0.00 │
│    1  │ AP  (large)    │  0.00 │ AR     (large) │  0.00 │
└───────┴────────────────┴───────┴────────────────┴───────┘

Future Considerations
Please suggest any potential future improvements related to this issue.

@charlescwwang charlescwwang added the question Further information is requested label Jun 28, 2024
@prithivi1
Copy link

prithivi1 commented Jul 16, 2024

Hi @charlescwwang ,
I tried to train my model with 1 labels. However I'm unable to load the pretrained weights with 80 classes to my 1 class model. I could see that you have passed that layer in your error logs. Can you help me figure out how to do that.

`[07/16 10:00:40] INFO | 📄 Created log folder: runs/train/v9-dev
[07/16 10:00:40] INFO | 📦 Loaded train cache
[07/16 10:00:40] INFO | 🚜 Building YOLO
[07/16 10:00:40] INFO | 🏗️ Building backbone
[07/16 10:00:40] INFO | 🏗️ Building neck
[07/16 10:00:41] INFO | 🏗️ Building head
[07/16 10:00:41] INFO | 🏗️ Building detection
[07/16 10:00:41] INFO | 🏗️ Building auxiliary
[07/16 10:00:41] INFO | 🌐 Weight weights/v9-c.pt not found, try downloading
📥 Downloading v9-c.pt... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100.0% • 102895262/102895262 bytes • 0:00:00
[07/16 10:00:42] INFO | ✅ Download completed.
Error executing job with overrides: ['task=train', 'task.data.batch_size=8', 'task.epoch=10', 'model=v9-c', 'class_num=1', 'dataset=dev.yaml', 'device=cuda']
Traceback (most recent call last):
File "/content/drive/MyDrive/Colab-Notebooks/yolov9/YOLO/yolo/lazy.py", line 27, in main
model = create_model(cfg.model, class_num=cfg.class_num, weight_path=cfg.weight)
File "/usr/local/lib/python3.10/dist-packages/yolo/model/yolo.py", line 136, in create_model
model.model.load_state_dict(torch.load(weight_path, map_location=torch.device("cpu")), strict=False)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 2189, in load_state_dict
raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for ModuleList:
size mismatch for 22.heads.0.class_conv.2.weight: copying a param with shape torch.Size([80, 256, 1, 1]) from checkpoint, the shape in current model is torch.Size([1, 256, 1, 1]).
size mismatch for 22.heads.0.class_conv.2.bias: copying a param with shape torch.Size([80]) from checkpoint, the shape in current model is torch.Size([1]).
size mismatch for 22.heads.1.class_conv.2.weight: copying a param with shape torch.Size([80, 256, 1, 1]) from checkpoint, the shape in current model is torch.Size([1, 256, 1, 1]).
size mismatch for 22.heads.1.class_conv.2.bias: copying a param with shape torch.Size([80]) from checkpoint, the shape in current model is torch.Size([1]).
size mismatch for 22.heads.2.class_conv.2.weight: copying a param with shape torch.Size([80, 256, 1, 1]) from checkpoint, the shape in current model is torch.Size([1, 256, 1, 1]).
size mismatch for 22.heads.2.class_conv.2.bias: copying a param with shape torch.Size([80]) from checkpoint, the shape in current model is torch.Size([1]).
size mismatch for 38.heads.0.class_conv.2.weight: copying a param with shape torch.Size([80, 512, 1, 1]) from checkpoint, the shape in current model is torch.Size([1, 512, 1, 1]).
size mismatch for 38.heads.0.class_conv.2.bias: copying a param with shape torch.Size([80]) from checkpoint, the shape in current model is torch.Size([1]).
size mismatch for 38.heads.1.class_conv.2.weight: copying a param with shape torch.Size([80, 512, 1, 1]) from checkpoint, the shape in current model is torch.Size([1, 512, 1, 1]).
size mismatch for 38.heads.1.class_conv.2.bias: copying a param with shape torch.Size([80]) from checkpoint, the shape in current model is torch.Size([1]).
size mismatch for 38.heads.2.class_conv.2.weight: copying a param with shape torch.Size([80, 512, 1, 1]) from checkpoint, the shape in current model is torch.Size([1, 512, 1, 1]).
size mismatch for 38.heads.2.class_conv.2.bias: copying a param with shape torch.Size([80]) from checkpoint, the shape in current model is torch.Size([1]).

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.`

@charlescwwang
Copy link
Author

charlescwwang commented Jul 22, 2024

sorry, I have no idea. This is my package list, maybe it could help.

aiofiles==24.1.0
antlr4-python3-runtime==4.9.3
anyio==4.4.0
argcomplete==3.4.0
attrs==23.2.0
beautifulsoup4==4.12.3
boto3==1.34.135
botocore==1.34.135
Brotli==1.1.0
cachetools==5.3.3
certifi==2024.6.2
charset-normalizer==3.3.2
click==8.1.7
contourpy==1.2.1
cycler==0.12.1
dacite==1.7.0
Deprecated==1.2.14
dill==0.3.8
dnspython==2.6.1
docker-pycreds==0.4.0
einops==0.8.0
exceptiongroup==1.2.1
fiftyone==0.24.1
fiftyone-brain==0.16.1
fiftyone_db==1.1.4
filelock==3.15.4
fonttools==4.53.0
fsspec==2024.6.0
ftfy==6.2.0
future==1.0.0
gitdb==4.0.11
GitPython==3.1.43
glob2==0.7
graphql-core==3.2.3
graphviz==0.20.3
h11==0.14.0
h2==4.1.0
hpack==4.0.0
httpcore==1.0.5
httpx==0.27.0
humanize==4.9.0
hydra-core==1.3.2
Hypercorn==0.17.3
hyperframe==6.0.1
idna==3.7
imageio==2.34.2
importlib_resources==6.4.0
inflate64==1.0.0
iniconfig==2.0.0
Jinja2==3.0.3
jmespath==1.0.1
joblib==1.4.2
jsonlines==4.0.0
kaleido==0.2.1
kiwisolver==1.4.5
lazy_loader==0.4
loguru==0.7.2
markdown-it-py==3.0.0
MarkupSafe==2.1.5
matplotlib==3.9.0
mdurl==0.1.2
mongoengine==0.24.2
motor==3.5.0
mpmath==1.3.0
multivolumefile==0.2.3
networkx==3.2.1
numpy==2.0.0
nvidia-cublas-cu12==12.1.3.1
nvidia-cuda-cupti-cu12==12.1.105
nvidia-cuda-nvrtc-cu12==12.1.105
nvidia-cuda-runtime-cu12==12.1.105
nvidia-cudnn-cu12==8.9.2.26
nvidia-cufft-cu12==11.0.2.54
nvidia-curand-cu12==10.3.2.106
nvidia-cusolver-cu12==11.4.5.107
nvidia-cusparse-cu12==12.1.0.106
nvidia-nccl-cu12==2.20.5
nvidia-nvjitlink-cu12==12.5.40
nvidia-nvtx-cu12==12.1.105
omegaconf==2.3.0
opencv-python==4.10.0.84
opencv-python-headless==4.10.0.84
packaging==24.1
pandas==2.2.2
pillow==10.3.0
platformdirs==4.2.2
plotly==5.22.0
pluggy==1.5.0
pprintpp==0.4.0
priority==2.0.0
protobuf==5.27.2
psutil==6.0.0
py7zr==0.21.0
pybcj==1.0.2
pycocotools==2.0.8
pycryptodomex==3.20.0
Pygments==2.18.0
pymongo==4.8.0
pyparsing==3.1.2
pyppmd==1.1.0
pytest==8.2.2
python-dateutil==2.9.0.post0
pytz==2024.1
PyYAML==6.0.1
pyzstd==0.16.0
rarfile==4.2
regex==2024.5.15
requests==2.32.3
retrying==1.3.4
rich==13.7.1
s3transfer==0.10.2
scikit-image==0.24.0
scikit-learn==1.5.0
scipy==1.13.1
sentry-sdk==2.7.0
setproctitle==1.3.3
six==1.16.0
smmap==5.0.1
sniffio==1.3.1
sortedcontainers==2.4.0
soupsieve==2.5
sse-starlette==0.10.3
sseclient-py==1.8.0
starlette==0.37.2
strawberry-graphql==0.138.1
sympy==1.12.1
tabulate==0.9.0
taskgroup==0.0.0a4
tenacity==8.4.2
texttable==1.7.0
threadpoolctl==3.5.0
tifffile==2024.6.18
tomli==2.0.1
torch==2.3.1
torchvision==0.18.1
tqdm==4.66.4
triton==2.3.1
typing_extensions==4.12.2
tzdata==2024.1
tzlocal==5.2
universal-analytics-python3==1.1.1
urllib3==1.26.19
voxel51-eta==0.12.6
wandb==0.17.3
wcwidth==0.2.13
wrapt==1.16.0
wsproto==1.2.0
xmltodict==0.13.0
zipp==3.19.2

@Abdul-Mukit
Copy link
Contributor

Abdul-Mukit commented Aug 17, 2024

@charlescwwang same issue as #67. In short, your image file names probably contain characters other than just numbers.
The root cause is the way the calculate_ap function is written. It should be something like this instead: https://lightning.ai/docs/torchmetrics/stable/detection/mean_average_precision.html
If ap didn't need image ids to begin with then data loader would not need to return image paths at every step.

@Abdul-Mukit
Copy link
Contributor

Abdul-Mukit commented Aug 19, 2024

PR #79 should fix this.
@charlescwwang can you please try the branch https://github.com/Abdul-Mukit/YOLO/tree/67-fix-image-id-usage-consistency and let me know if you still face the same problem?

@charlescwwang
Copy link
Author

charlescwwang commented Aug 19, 2024

PR #79 should fix this. @charlescwwang can you please try the branch https://github.com/Abdul-Mukit/YOLO/tree/67-fix-image-id-usage-consistency and let me know if you still face the same problem?

@Abdul-Mukit I tried the branch, and the training was successfully completed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants