-
Notifications
You must be signed in to change notification settings - Fork 315
[GB200] Make IMEX prolog use local IMEx configurations + test fixes #7013
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[GB200] Make IMEX prolog use local IMEx configurations + test fixes #7013
Conversation
return 1 # Not Updated | ||
fi | ||
|
||
# Try to acquire lock with timeout |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if we do keep this as part of the prolog?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its a deadlock prevention, even through we added it for shared file.
Any scenario where more processes access this file and we end up in a deadlock scenario can be prevented if we keep it and we have logs showing that we were in deadlock
IPS_FROM_CR=$(get_ips_from_node_names "${CR_NODES}") | ||
IMEX_MAIN_CONFIG="/opt/parallelcluster/shared/nvidia-imex/config_${QUEUE_NAME}_${COMPUTE_RESOURCE_NAME}.cfg" | ||
IMEX_NODES_CONFIG="/opt/parallelcluster/shared/nvidia-imex/nodes_config_${QUEUE_NAME}_${COMPUTE_RESOURCE_NAME}.cfg" | ||
IMEX_MAIN_CONFIG="/etc/nvidia-imex/config.cfg" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You also need to chnage the nvidia-imex-status.job file which points to using a config specific file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch, done!
…ocal IMEX nodes config, rather than the shared one.
…status by using the local IMEX config file rather than the shared one.
c611d14
to
94393f7
Compare
QUEUE_NAME=$(cat "/etc/chef/dna.json" | jq -r ".cluster.scheduler_queue_name") | ||
COMPUTE_RES_NAME=$(cat "/etc/chef/dna.json" | jq -r ".cluster.scheduler_compute_resource_name") | ||
IMEX_CONFIG_FILE="/opt/parallelcluster/shared/nvidia-imex/config_${QUEUE_NAME}_${COMPUTE_RES_NAME}.cfg" | ||
IMEX_CONFIG_FILE="/etc/nvidia-imex/config.cfg" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can remove this file. We should no longer specify the configuration file if not needed!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remember it was necessary, but apparently it is not when the default location is used
…ws#7013) * [GB200] Adapt the prolog used to configure IMEX so that it uses the local IMEX nodes config, rather than the shared one. * [GB200] In test_ultraserver, fix the job script that checks for IMEX status by using the local IMEX config file rather than the shared one. * [GB200] In test_ultraserver, fix assertion on imex logs. * [GB200] In test_ultraserver, fix assert_imex_nodes_config_is_correct. * [GB200] In test_ultraserver, fix job to chekc imex status. * [GB200] In test_ultraserver, fix assert_no_errors_in_logs
…ws#7013) * [GB200] Adapt the prolog used to configure IMEX so that it uses the local IMEX nodes config, rather than the shared one. * [GB200] In test_ultraserver, fix the job script that checks for IMEX status by using the local IMEX config file rather than the shared one. * [GB200] In test_ultraserver, fix assertion on imex logs. * [GB200] In test_ultraserver, fix assert_imex_nodes_config_is_correct. * [GB200] In test_ultraserver, fix job to chekc imex status. * [GB200] In test_ultraserver, fix assert_no_errors_in_logs
…ws#7013) * [GB200] Adapt the prolog used to configure IMEX so that it uses the local IMEX nodes config, rather than the shared one. * [GB200] In test_ultraserver, fix the job script that checks for IMEX status by using the local IMEX config file rather than the shared one. * [GB200] In test_ultraserver, fix assertion on imex logs. * [GB200] In test_ultraserver, fix assert_imex_nodes_config_is_correct. * [GB200] In test_ultraserver, fix job to chekc imex status. * [GB200] In test_ultraserver, fix assert_no_errors_in_logs
…ws#7013) * [GB200] Adapt the prolog used to configure IMEX so that it uses the local IMEX nodes config, rather than the shared one. * [GB200] In test_ultraserver, fix the job script that checks for IMEX status by using the local IMEX config file rather than the shared one. * [GB200] In test_ultraserver, fix assertion on imex logs. * [GB200] In test_ultraserver, fix assert_imex_nodes_config_is_correct. * [GB200] In test_ultraserver, fix job to chekc imex status. * [GB200] In test_ultraserver, fix assert_no_errors_in_logs
…ws#7013) * [GB200] Adapt the prolog used to configure IMEX so that it uses the local IMEX nodes config, rather than the shared one. * [GB200] In test_ultraserver, fix the job script that checks for IMEX status by using the local IMEX config file rather than the shared one. * [GB200] In test_ultraserver, fix assertion on imex logs. * [GB200] In test_ultraserver, fix assert_imex_nodes_config_is_correct. * [GB200] In test_ultraserver, fix job to chekc imex status. * [GB200] In test_ultraserver, fix assert_no_errors_in_logs
…ws#7013) * [GB200] Adapt the prolog used to configure IMEX so that it uses the local IMEX nodes config, rather than the shared one. * [GB200] In test_ultraserver, fix the job script that checks for IMEX status by using the local IMEX config file rather than the shared one. * [GB200] In test_ultraserver, fix assertion on imex logs. * [GB200] In test_ultraserver, fix assert_imex_nodes_config_is_correct. * [GB200] In test_ultraserver, fix job to chekc imex status. * [GB200] In test_ultraserver, fix assert_no_errors_in_logs
…7013) * [GB200] Adapt the prolog used to configure IMEX so that it uses the local IMEX nodes config, rather than the shared one. * [GB200] In test_ultraserver, fix the job script that checks for IMEX status by using the local IMEX config file rather than the shared one. * [GB200] In test_ultraserver, fix assertion on imex logs. * [GB200] In test_ultraserver, fix assert_imex_nodes_config_is_correct. * [GB200] In test_ultraserver, fix job to chekc imex status. * [GB200] In test_ultraserver, fix assert_no_errors_in_logs
Description of changes
In aws/aws-parallelcluster-cookbook#3029 we moved from shared IMEx configurations to local ones. In this PR we adapt the prolog accordingly.
Also, we fixed an assertion made on IMEx logs, which used to check the logs in the head node, but it should check the compute nodes.
Tests
[ONGOING] test_gb200
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.