Skip to content

Conversation

gmarciani
Copy link
Contributor

@gmarciani gmarciani commented Sep 19, 2025

Description of changes

In aws/aws-parallelcluster-cookbook#3029 we moved from shared IMEx configurations to local ones. In this PR we adapt the prolog accordingly.
Also, we fixed an assertion made on IMEx logs, which used to check the logs in the head node, but it should check the compute nodes.

Tests

[ONGOING] test_gb200

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@gmarciani gmarciani changed the title Wip/mgiacomo/3140/gb200 prolog 0919 1 [GB200] Make IMEX prolog use local IMEx configurations + test fixes Sep 19, 2025
@gmarciani gmarciani added skip-changelog-update Disables the check that enforces changelog updates in PRs 3.x Test labels Sep 19, 2025
@gmarciani gmarciani changed the base branch from develop to release-3.14 September 19, 2025 18:25
@gmarciani gmarciani marked this pull request as ready for review September 19, 2025 20:43
@gmarciani gmarciani requested review from a team as code owners September 19, 2025 20:43
return 1 # Not Updated
fi

# Try to acquire lock with timeout
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if we do keep this as part of the prolog?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why?

Copy link
Contributor

@himani2411 himani2411 Sep 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its a deadlock prevention, even through we added it for shared file.
Any scenario where more processes access this file and we end up in a deadlock scenario can be prevented if we keep it and we have logs showing that we were in deadlock

IPS_FROM_CR=$(get_ips_from_node_names "${CR_NODES}")
IMEX_MAIN_CONFIG="/opt/parallelcluster/shared/nvidia-imex/config_${QUEUE_NAME}_${COMPUTE_RESOURCE_NAME}.cfg"
IMEX_NODES_CONFIG="/opt/parallelcluster/shared/nvidia-imex/nodes_config_${QUEUE_NAME}_${COMPUTE_RESOURCE_NAME}.cfg"
IMEX_MAIN_CONFIG="/etc/nvidia-imex/config.cfg"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You also need to chnage the nvidia-imex-status.job file which points to using a config specific file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch, done!

@gmarciani gmarciani force-pushed the wip/mgiacomo/3140/gb200-prolog-0919-1 branch from c611d14 to 94393f7 Compare September 19, 2025 21:13
QUEUE_NAME=$(cat "/etc/chef/dna.json" | jq -r ".cluster.scheduler_queue_name")
COMPUTE_RES_NAME=$(cat "/etc/chef/dna.json" | jq -r ".cluster.scheduler_compute_resource_name")
IMEX_CONFIG_FILE="/opt/parallelcluster/shared/nvidia-imex/config_${QUEUE_NAME}_${COMPUTE_RES_NAME}.cfg"
IMEX_CONFIG_FILE="/etc/nvidia-imex/config.cfg"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can remove this file. We should no longer specify the configuration file if not needed!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember it was necessary, but apparently it is not when the default location is used

himani2411
himani2411 previously approved these changes Sep 19, 2025
@himani2411 himani2411 enabled auto-merge (squash) September 19, 2025 22:17
@himani2411 himani2411 merged commit 3197bcc into aws:release-3.14 Sep 19, 2025
23 of 24 checks passed
gmarciani added a commit to gmarciani/aws-parallelcluster that referenced this pull request Sep 22, 2025
…ws#7013)

* [GB200] Adapt the prolog used to configure IMEX so that it uses the local IMEX nodes config, rather than the shared one.

* [GB200] In test_ultraserver, fix the job script that checks for IMEX status by using the local IMEX config file rather than the shared one.

* [GB200] In test_ultraserver, fix assertion on imex logs.

* [GB200] In test_ultraserver, fix assert_imex_nodes_config_is_correct.

* [GB200] In test_ultraserver, fix job to chekc imex status.

* [GB200] In test_ultraserver, fix assert_no_errors_in_logs
gmarciani added a commit to gmarciani/aws-parallelcluster that referenced this pull request Sep 22, 2025
…ws#7013)

* [GB200] Adapt the prolog used to configure IMEX so that it uses the local IMEX nodes config, rather than the shared one.

* [GB200] In test_ultraserver, fix the job script that checks for IMEX status by using the local IMEX config file rather than the shared one.

* [GB200] In test_ultraserver, fix assertion on imex logs.

* [GB200] In test_ultraserver, fix assert_imex_nodes_config_is_correct.

* [GB200] In test_ultraserver, fix job to chekc imex status.

* [GB200] In test_ultraserver, fix assert_no_errors_in_logs
gmarciani added a commit to gmarciani/aws-parallelcluster that referenced this pull request Sep 22, 2025
…ws#7013)

* [GB200] Adapt the prolog used to configure IMEX so that it uses the local IMEX nodes config, rather than the shared one.

* [GB200] In test_ultraserver, fix the job script that checks for IMEX status by using the local IMEX config file rather than the shared one.

* [GB200] In test_ultraserver, fix assertion on imex logs.

* [GB200] In test_ultraserver, fix assert_imex_nodes_config_is_correct.

* [GB200] In test_ultraserver, fix job to chekc imex status.

* [GB200] In test_ultraserver, fix assert_no_errors_in_logs
himani2411 pushed a commit to himani2411/aws-parallelcluster that referenced this pull request Oct 1, 2025
…ws#7013)

* [GB200] Adapt the prolog used to configure IMEX so that it uses the local IMEX nodes config, rather than the shared one.

* [GB200] In test_ultraserver, fix the job script that checks for IMEX status by using the local IMEX config file rather than the shared one.

* [GB200] In test_ultraserver, fix assertion on imex logs.

* [GB200] In test_ultraserver, fix assert_imex_nodes_config_is_correct.

* [GB200] In test_ultraserver, fix job to chekc imex status.

* [GB200] In test_ultraserver, fix assert_no_errors_in_logs
himani2411 pushed a commit to himani2411/aws-parallelcluster that referenced this pull request Oct 1, 2025
…ws#7013)

* [GB200] Adapt the prolog used to configure IMEX so that it uses the local IMEX nodes config, rather than the shared one.

* [GB200] In test_ultraserver, fix the job script that checks for IMEX status by using the local IMEX config file rather than the shared one.

* [GB200] In test_ultraserver, fix assertion on imex logs.

* [GB200] In test_ultraserver, fix assert_imex_nodes_config_is_correct.

* [GB200] In test_ultraserver, fix job to chekc imex status.

* [GB200] In test_ultraserver, fix assert_no_errors_in_logs
hanwen-cluster pushed a commit to hanwen-cluster/aws-parallelcluster that referenced this pull request Oct 1, 2025
…ws#7013)

* [GB200] Adapt the prolog used to configure IMEX so that it uses the local IMEX nodes config, rather than the shared one.

* [GB200] In test_ultraserver, fix the job script that checks for IMEX status by using the local IMEX config file rather than the shared one.

* [GB200] In test_ultraserver, fix assertion on imex logs.

* [GB200] In test_ultraserver, fix assert_imex_nodes_config_is_correct.

* [GB200] In test_ultraserver, fix job to chekc imex status.

* [GB200] In test_ultraserver, fix assert_no_errors_in_logs
himani2411 pushed a commit that referenced this pull request Oct 1, 2025
…7013)

* [GB200] Adapt the prolog used to configure IMEX so that it uses the local IMEX nodes config, rather than the shared one.

* [GB200] In test_ultraserver, fix the job script that checks for IMEX status by using the local IMEX config file rather than the shared one.

* [GB200] In test_ultraserver, fix assertion on imex logs.

* [GB200] In test_ultraserver, fix assert_imex_nodes_config_is_correct.

* [GB200] In test_ultraserver, fix job to chekc imex status.

* [GB200] In test_ultraserver, fix assert_no_errors_in_logs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

3.x skip-changelog-update Disables the check that enforces changelog updates in PRs Test

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants