Migrating training scripts to torchrun #1933

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

lkosh wants to merge 12 commits into mindee:main from lkosh:torchrun

Contributor

lkosh commented May 8, 2025

No description provided.

lkosh added 5 commits

May 6, 2025 11:15


          recognition training script

4b64472


          detection training script

c8b0a62


          training scripts

c046f14


          docs

60ea0af


          docs

cbc4de9

codecov bot commented May 8, 2025 •

edited

Loading

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 96.74%. Comparing base (db6d0db) to head (06984f4).
Report is 1 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1933      +/-   ##
==========================================
- Coverage   96.80%   96.74%   -0.06%     
==========================================
  Files         172      172              
  Lines        8442     8442              
==========================================
- Hits         8172     8167       -5     
- Misses        270      275       +5

Flag	Coverage Δ
unittests	`96.74% <ø> (-0.06%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

felixdittrich92 requested changes

View reviewed changes

Contributor

felixdittrich92 left a comment •

edited

Loading

Thanks 👍

Mh Normally in this case I think we can merge both scripts into one (DDP & the normal train script) ? - because the logic is the same - anyway what we should test is that the logging does still work with torchrun (W&B for example)

if args.backend:
   torch.cuda.set_device(rank)
   dist.init_process_group(backend=args.backend)

references/detection/train_pytorch_ddp.py Outdated Show resolved Hide resolved

references/recognition/README.md Outdated Show resolved Hide resolved

references/recognition/train_pytorch_ddp.py Outdated Show resolved Hide resolved


          bugfix

c1501c9

felixdittrich92 added this to the 0.12.0 milestone

felixdittrich92 added topic: documentation type: enhancement ext: references framework: pytorch topic: text detection topic: text recognition labels

felixdittrich92 self-assigned this


          unified training script

4bd2281

felixdittrich92 requested changes

View reviewed changes

references/recognition/README.md Outdated Show resolved Hide resolved

references/recognition/train_pytorch.py Outdated Show resolved Hide resolved

references/recognition/train_pytorch.py Outdated Show resolved Hide resolved

references/recognition/train_pytorch.py Show resolved Hide resolved

references/recognition/train_pytorch.py Show resolved Hide resolved

references/recognition/train_pytorch.py Outdated Show resolved Hide resolved

references/recognition/train_pytorch.py Outdated Show resolved Hide resolved

references/recognition/train_pytorch.py Outdated Show resolved Hide resolved

references/recognition/train_pytorch.py Outdated Show resolved Hide resolved

references/recognition/train_pytorch.py Outdated Show resolved Hide resolved

lkosh added 5 commits

May 12, 2025 09:27


          pr fixes

01f055f


          pr fixes

807bbe6


          pr fixes

6ad8083


          detection training script

302ba4b


          cpu fix

06984f4

felixdittrich92 requested changes

View reviewed changes

references/detection/train_pytorch.py

		@@ -185,65 +192,105 @@ def evaluate(model, val_loader, batch_transforms, val_metric, amp=False, log=Non


		def main(args):
		"""

Contributor

felixdittrich92 May 19, 2025

remove docstring please

references/detection/train_pytorch.py

+                      world_size (int): number of processes participating in the job
+                      args: other arguments passed through the CLI
+                  """
+                  world_size = int(os.environ.get("WORLD_SIZE", 1))

Contributor

felixdittrich92 May 19, 2025

let's add the same comment here as you did in the recognition script :)

references/detection/train_pytorch.py

                       x, target = next(iter(train_loader))
                       plot_samples(x, target)
-                      return
+                      # return

Contributor

felixdittrich92 May 19, 2025

Let's keep the return please 👍

references/detection/train_pytorch.py

                   parser.add_argument(
                       "--save-interval-epoch", dest="save_interval_epoch", action="store_true", help="Save model every epoch"
                   )
                   parser.add_argument("--input_size", type=int, default=1024, help="model input size, H = W")
                   parser.add_argument("--lr", type=float, default=0.001, help="learning rate for the optimizer (Adam or AdamW)")
                   parser.add_argument("--wd", "--weight-decay", default=0, type=float, help="weight decay", dest="weight_decay")
-                  parser.add_argument("-j", "--workers", type=int, default=None, help="number of workers used for dataloading")
+                  parser.add_argument("-j", "--workers", type=int, default=0, help="number of workers used for dataloading")

Contributor

felixdittrich92 May 19, 2025

Let's keep None here as we did in the recognition script :)

references/detection/train_pytorch.py

+                      class_names = val_set.class_names
+                  else:
+                      class_names = None

Contributor

felixdittrich92 May 19, 2025

Mh I think the class_names needs to be shared so no else

-> class_names=val_set.class_names,

references/detection/train_pytorch.py

+                              torch.save(params.state_dict(), Path(args.output_dir) / f"{exp_name}.pt")
+                              min_loss = val_loss
+                          if args.save_interval_epoch:
+                              pbar.write(f"Saving state at epoch: {epoch + 1}")

Contributor

felixdittrich92 May 19, 2025

for interval saving also:

params = model.module if hasattr(model, "module") else model
torch.save(params.state_dict(), Path(args.output_dir) / f"{exp_name}_epoch{epoch + 1}.pt")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ext: references framework: pytorch topic: documentation topic: text detection topic: text recognition type: enhancement