-
Notifications
You must be signed in to change notification settings - Fork 90
Add error to point user to slurm resume log #676
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
14ea243 to
881d029
Compare
src/slurm_plugin/instance_manager.py
Outdated
| self._update_failed_nodes(set(nodes_resume_list), "InsufficientInstanceCapacity", override=False) | ||
| self._update_failed_nodes( | ||
| set(nodes_resume_list), | ||
| "InsufficientInstanceCapacity(Check slurm_resume log for ec2 error codes)", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[minor] typo ec2 -> EC2
src/slurm_plugin/instance_manager.py
Outdated
| self._update_failed_nodes(set(nodes_resume_list), "InsufficientInstanceCapacity", override=False) | ||
| self._update_failed_nodes( | ||
| set(nodes_resume_list), | ||
| "InsufficientInstanceCapacity(Check slurm_resume log for ec2 error codes)", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
InsufficientInstanceCapacity is. a common pcluster error code used in different places. Can we define a constant for it?
Same for the sentence Check....codes. Since we repeat that in many places we can have a constant
src/slurm_plugin/instance_manager.py
Outdated
| self._update_failed_nodes(set(nodes_resume_list), "InsufficientInstanceCapacity", override=False) | ||
| self._update_failed_nodes( | ||
| set(nodes_resume_list), | ||
| "InsufficientInstanceCapacity(Check slurm_resume log for ec2 error codes)", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[BLOCKING] I agree in making the log line more helpful, redirecting the user to the right log. However, doing it this way actually changes the error code from InsufficientInstanceCapacity to InsufficientInstanceCapacity(Check...codes), which ultimately can have consequences in the way we monitor ICE errors on the clister dashboard.
See
| **{failure: "ice-failures" for failure in [*SlurmNode.EC2_ICE_ERROR_CODES, "LimitedInstanceCapacity"]}, |
| log.info( | ||
| "The following compute resources are in down state due to insufficient capacity: %s, " | ||
| "compute resources will be reset after insufficient capacity timeout (%s seconds) expired", | ||
| "compute resources will be reset after insufficient capacity timeout (%s seconds) expired. " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Test] Can we reflect this change into the corresponding unit test. The same thing you did for the resume script.
Description of changes
Tests
In slurmctld:
[2025-10-30T12:16:05.106] update_node: node compute-dy-cit-10 reason set to: (Code:InsufficientInstanceCapacity)Failure when resuming nodes - Check the slurm_resume log for EC2 error codesIn clustermgtd:
2025-10-30 12:26:48,861 - [slurm_plugin.clustermgtd:_reset_timeout_expired_compute_resources] - INFO - The following compute resources are in down state due to insufficient capacity: {'compute': {'cit': ComputeResourceFailureEvent(timestamp=datetime.datetime(2025, 10, 30, 12, 16, 42, 904354, tzinfo=datetime.timezone.utc), error_code='InsufficientInstanceCapacity')}}, compute resources will be reset after insufficient capacity timeout (600.0 seconds) expired. Check the slurm_resume log for EC2 error codes.In slurm_resume:
2025-10-29 13:39:59,768 - 8667 - [slurm_plugin.fleet_manager:_launch_instances] - ERROR - JobID 2 - Error in CreateFleet request (aa9a6ad3-7ac3-4745-a4d1-b8c178907b8b): InvalidParameter - Security group sg-0f2789bcd3e49cdf3 and subnet subnet-0c766771e11dab28c belong to different networks.Please review the guidelines for contributing and Pull Request Instructions.
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.