Skip to content

Timeout function in Training causing training to fail #627

@thao-do

Description

@thao-do

Describe the bug
In Training tab, if the timeout option box is checked, the training will fail. This is an existing bug since Nov 2024 and reproducible on both Mac & Windows OS.

To Reproduce
Steps to reproduce the behavior:

  1. Start a new model
  2. Go through Curation & into Training
  3. Fill in all training params as normal, AND check the box to "timeout" and fill in the number of minutes to time out
  4. Start training - a popup modal would notify user that training has failed

Console logs:

Expected behavior
Training should run as normal and ends at the timeout specified time.

Screenshots
Image

Describe your data (image format, 2D /3D etc.) LaminB1 sample dataset

Environment (please complete the following information):

  • OS: Mac OS 13.6 (22G120), Windows OS built 20348.2655 (EC2 instance)
  • Plugin Version: 1.0.0rc8
  • PyTorch version 2.0.1 on Windows OS
  • GPU? yes on Windows OS
  • CUDA version [e.g. 10.0]

Additional context
We discussed to remove this feature completely, with the conditions that:

  1. User should be able to estimate the time base on how long each epoch might potentially take and set the appropriate number of epoch
  2. In case training needs to be stopped before it reach the set number of epoch or before the training is auto-stopped when the model performance is no longer improved, user should be able to cancel the training

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions