Skip to content

Conversation

@hgreebe
Copy link
Contributor

@hgreebe hgreebe commented Oct 29, 2025

Description of changes

Tests

  • Ran test_slurm_accounting with change to cause the InvalidParameter error

In slurmctld:
[2025-10-30T12:16:05.106] update_node: node compute-dy-cit-10 reason set to: (Code:InsufficientInstanceCapacity)Failure when resuming nodes - Check the slurm_resume log for EC2 error codes

In clustermgtd:
2025-10-30 12:26:48,861 - [slurm_plugin.clustermgtd:_reset_timeout_expired_compute_resources] - INFO - The following compute resources are in down state due to insufficient capacity: {'compute': {'cit': ComputeResourceFailureEvent(timestamp=datetime.datetime(2025, 10, 30, 12, 16, 42, 904354, tzinfo=datetime.timezone.utc), error_code='InsufficientInstanceCapacity')}}, compute resources will be reset after insufficient capacity timeout (600.0 seconds) expired. Check the slurm_resume log for EC2 error codes.

In slurm_resume:
2025-10-29 13:39:59,768 - 8667 - [slurm_plugin.fleet_manager:_launch_instances] - ERROR - JobID 2 - Error in CreateFleet request (aa9a6ad3-7ac3-4745-a4d1-b8c178907b8b): InvalidParameter - Security group sg-0f2789bcd3e49cdf3 and subnet subnet-0c766771e11dab28c belong to different networks.

Please review the guidelines for contributing and Pull Request Instructions.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@hgreebe hgreebe force-pushed the develop-test branch 2 times, most recently from 14ea243 to 881d029 Compare October 29, 2025 16:43
@hgreebe hgreebe marked this pull request as ready for review October 29, 2025 16:52
@hgreebe hgreebe requested review from a team as code owners October 29, 2025 16:52
self._update_failed_nodes(set(nodes_resume_list), "InsufficientInstanceCapacity", override=False)
self._update_failed_nodes(
set(nodes_resume_list),
"InsufficientInstanceCapacity(Check slurm_resume log for ec2 error codes)",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[minor] typo ec2 -> EC2

self._update_failed_nodes(set(nodes_resume_list), "InsufficientInstanceCapacity", override=False)
self._update_failed_nodes(
set(nodes_resume_list),
"InsufficientInstanceCapacity(Check slurm_resume log for ec2 error codes)",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

InsufficientInstanceCapacity is. a common pcluster error code used in different places. Can we define a constant for it?

Same for the sentence Check....codes. Since we repeat that in many places we can have a constant

self._update_failed_nodes(set(nodes_resume_list), "InsufficientInstanceCapacity", override=False)
self._update_failed_nodes(
set(nodes_resume_list),
"InsufficientInstanceCapacity(Check slurm_resume log for ec2 error codes)",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[BLOCKING] I agree in making the log line more helpful, redirecting the user to the right log. However, doing it this way actually changes the error code from InsufficientInstanceCapacity to InsufficientInstanceCapacity(Check...codes), which ultimately can have consequences in the way we monitor ICE errors on the clister dashboard.
See

**{failure: "ice-failures" for failure in [*SlurmNode.EC2_ICE_ERROR_CODES, "LimitedInstanceCapacity"]},

(cherry picked from commit c84aeb5)
(cherry picked from commit bdc8706)
log.info(
"The following compute resources are in down state due to insufficient capacity: %s, "
"compute resources will be reset after insufficient capacity timeout (%s seconds) expired",
"compute resources will be reset after insufficient capacity timeout (%s seconds) expired. "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Test] Can we reflect this change into the corresponding unit test. The same thing you did for the resume script.

@hgreebe hgreebe merged commit 9457ba4 into aws:develop Oct 30, 2025
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants