-
Notifications
You must be signed in to change notification settings - Fork 110
[Gb200] Skip IMEX configuration File creation if the Directory does not exist #3026
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
mode '0644' | ||
action :create | ||
variables(imex_main_config_file_path: nvidia_imex_main_conf_file) | ||
only_if { Dir.exist?(node['cluster']['nvidia']['imex']['shared_dir']) } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's more correct and safer to skip this resource if the service file already exists. In this way we would prevent overwriting the existing file when custom amis are used.
We can remove this condition and use :create_if_missing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The service file of IMEX for DLAMI is on a different location. So I am writing the conditions based on or locations instead of finding if their location exists or not as they could change their location in future
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added the details in description
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The service file of IMEX for DLAMI is on a different location
This is an assumption that we are not in control of.
I suggest to create the file only if it is missing and avoid the only_if
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its also an assumption that the service file exist in the location we want it to be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The location i see if based on the vanilla DLAMI, its an assumption for now that this location will not chnage in future.
But if I add our own unit file at a particular location based on the check that it doesnt exist then I would fall in the weird scenario where i could potentially create 2 imex daemons on DLAMI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aligned on this offline. Agreement: skip the whole configure
action if the directory node['cluster']['nvidia']['imex']['shared_dir']
does not exists AND the service file "/etc/systemd/system/#{nvidia_imex_service}.service"
does not exist.
Define this condition into a self explanatory variable, e.g. imex_installed_by_parallelcluster
and use such variable in the only_if
blocks or , even better, at the beginning of the configure action.
group 'root' | ||
mode '0755' | ||
action :create_if_missing | ||
only_if { Dir.exist?(node['cluster']['nvidia']['imex']['shared_dir']) } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're using this condition as a way to identify if the AMI executed the install phase. If the configuration phase requires such directory, why not creating it in the configuration phase and avoid every only_if
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because the DLAMI already comes with nvidia-imex and we skip the installation during install phase. So if I make this change for creating the directory in config phase, we basically override their installation!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[BLOCKING] If you create the directory, put the imex main config file and nodes config file there you're not overriding their installation.
To not override an existing IMEX installation is enough to:
- Create the files only if they do not exists (create_if_missing)
- Enable and start IMEX only if is is defined by our service unit file
At high level my point is: our goal is to not override an existing installation.
The safest way to prevent it is to 1/ do not create files if they already exists and 2/ do not start a service if it's not the one installed by us. A condition on the existance of a directory is too weak. If the user creates that specific directory (even if it is empty) IMEX may be broken by new compute nodes executing the config step.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So what you are saying is
- Create Directory in Config Phase
- Then create the 2 config and 1 service fle
- And start the Nvidia-imex
So esentially we will fall in the case where we support nvidia-imex for DLAMi?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The creation of directory and the 2 imex conf files are not a problem per see and I can agree on moving those.
The service file however is a problem as the condition would be difficult to handle
…ot exist * We create the Directory as part of AMI creation * Skip starting Imex service if the service file does not exist
* we remove /opt/parallelcluster/shared/nvidia-imex directory creation * We keep default path of `/etc/nvidia-imex/nodes_config.cfg` and `/etc/nvidia-imex/config.cfg` for IMEX configuration
6e3c1c5
to
f5dfc46
Compare
Description of changes - OLDER NOTES NEED TO BE UPDATED
Skip IMEX configuration File creation if the Directory does not exist
Without this change we see below error
/usr/lib/systemd/system/nvidia-imex.service
and they do not configure/etc/nvidia-imex/nodes_config.cfg
so we should not enable the service as wellTests
References
Checklist
develop
add the branch name as prefix in the PR title (e.g.[release-3.6]
).Please review the guidelines for contributing and Pull Request Instructions.
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.