Skip to content

Conversation

himani2411
Copy link
Contributor

@himani2411 himani2411 commented Sep 18, 2025

Description of changes - OLDER NOTES NEED TO BE UPDATED

Skip IMEX configuration File creation if the Directory does not exist

  • We create the Directory as part of AMI creation
  • Skip starting Imex service if the service file does not exist.
    Without this change we see below error
* template[/opt/parallelcluster/shared/nvidia-imex/nodes_config_q1_cr1.cfg] action create_if_missing[2025-09-18T00:59:55+00:00] INFO: Processing template[/opt/parallelcluster/shared/nvidia-imex/nodes_config_q1_cr1.cfg] action create_if_missing (aws-parallelcluster-platform::nvidia_config line 37)
Sep 18 01:00:00 ip-172-31-42-141 user-data[28237]: 
Sep 18 01:00:00 ip-172-31-42-141 user-data[28237]:      * Parent directory /opt/parallelcluster/shared/nvidia-imex does not exist.
Sep 18 01:00:00 ip-172-31-42-141 user-data[28237]:      ================================================================================
Sep 18 01:00:00 ip-172-31-42-141 user-data[28237]:      Error executing action `create_if_missing` on resource 'template[/opt/parallelcluster/shared/nvidia-imex/nodes_config_q1_cr1.cfg]'
Sep 18 01:00:00 ip-172-31-42-141 user-data[28237]:      ================================================================================
Sep 18 01:00:00 ip-172-31-42-141 user-data[28237]: 
Sep 18 01:00:00 ip-172-31-42-141 user-data[28237]:      Chef::Exceptions::EnclosingDirectoryDoesNotExist
Sep 18 01:00:00 ip-172-31-42-141 user-data[28237]:      ------------------------------------------------
Sep 18 01:00:00 ip-172-31-42-141 user-data[28237]:      Parent directory /opt/parallelcluster/shared/nvidia-imex does not exist.
Sep 18 01:00:00 ip-172-31-42-141 user-data[28237]: 
Sep 18 01:00:00 ip-172-31-42-141 user-data[28237]:      Resource Declaration:
Sep 18 01:00:00 ip-172-31-42-141 user-data[28237]:      ---------------------
  • in DLAMI the service file is at /usr/lib/systemd/system/nvidia-imex.service and they do not configure /etc/nvidia-imex/nodes_config.cfg so we should not enable the service as well
* ○ nvidia-imex.service - NVIDIA IMEX service
    Loaded: loaded (/usr/lib/systemd/system/nvidia-imex.service; enabled; preset: disabled)
    Active: inactive (dead) since Thu 2025-09-18 17:54:33 UTC; 1h 7min ago
  Duration: 2ms
   Process: 2284 ExecStart=/usr/bin/nvidia-imex -c /etc/nvidia-imex/config.cfg (code=exited, status=0/SUCCESS)
  Main PID: 2371 (code=exited, status=0/SUCCESS)
       CPU: 21ms

Sep 18 17:54:31 ip-172-31-69-32.ec2.internal systemd[1]: Starting nvidia-imex.service - NVIDIA IMEX service...
Sep 18 17:54:33 ip-172-31-69-32.ec2.internal systemd[1]: Started nvidia-imex.service - NVIDIA IMEX service.
Sep 18 17:54:33 ip-172-31-69-32.ec2.internal systemd[1]: nvidia-imex.service: Deactivated successfully.

Tests

  • [ONGOING] test_gb200 with Custom AMI

References

  • Link to impacted open issues.
  • Link to related PRs in other packages (i.e. cookbook, node).
  • Link to documentation useful to understand the changes.

Checklist

  • Make sure you are pointing to the right branch.
  • If you're creating a patch for a branch other than develop add the branch name as prefix in the PR title (e.g. [release-3.6]).
  • Check all commits' messages are clear, describing what and why vs how.
  • Make sure to have added unit tests or integration tests to cover the new/modified code.
  • Check if documentation is impacted by this change.

Please review the guidelines for contributing and Pull Request Instructions.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@himani2411 himani2411 requested review from a team as code owners September 18, 2025 17:45
@himani2411 himani2411 added the 3.x label Sep 18, 2025
@himani2411 himani2411 changed the base branch from develop to release-3.14 September 18, 2025 17:51
mode '0644'
action :create
variables(imex_main_config_file_path: nvidia_imex_main_conf_file)
only_if { Dir.exist?(node['cluster']['nvidia']['imex']['shared_dir']) }
Copy link
Contributor

@gmarciani gmarciani Sep 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's more correct and safer to skip this resource if the service file already exists. In this way we would prevent overwriting the existing file when custom amis are used.

We can remove this condition and use :create_if_missing

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The service file of IMEX for DLAMI is on a different location. So I am writing the conditions based on or locations instead of finding if their location exists or not as they could change their location in future

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the details in description

Copy link
Contributor

@gmarciani gmarciani Sep 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The service file of IMEX for DLAMI is on a different location

This is an assumption that we are not in control of.

I suggest to create the file only if it is missing and avoid the only_if

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its also an assumption that the service file exist in the location we want it to be.

Copy link
Contributor Author

@himani2411 himani2411 Sep 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The location i see if based on the vanilla DLAMI, its an assumption for now that this location will not chnage in future.
But if I add our own unit file at a particular location based on the check that it doesnt exist then I would fall in the weird scenario where i could potentially create 2 imex daemons on DLAMI.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aligned on this offline. Agreement: skip the whole configure action if the directory node['cluster']['nvidia']['imex']['shared_dir'] does not exists AND the service file "/etc/systemd/system/#{nvidia_imex_service}.service" does not exist.

Define this condition into a self explanatory variable, e.g. imex_installed_by_parallelcluster and use such variable in the only_if blocks or , even better, at the beginning of the configure action.

group 'root'
mode '0755'
action :create_if_missing
only_if { Dir.exist?(node['cluster']['nvidia']['imex']['shared_dir']) }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're using this condition as a way to identify if the AMI executed the install phase. If the configuration phase requires such directory, why not creating it in the configuration phase and avoid every only_if?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the DLAMI already comes with nvidia-imex and we skip the installation during install phase. So if I make this change for creating the directory in config phase, we basically override their installation!

Copy link
Contributor

@gmarciani gmarciani Sep 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[BLOCKING] If you create the directory, put the imex main config file and nodes config file there you're not overriding their installation.
To not override an existing IMEX installation is enough to:

  1. Create the files only if they do not exists (create_if_missing)
  2. Enable and start IMEX only if is is defined by our service unit file

At high level my point is: our goal is to not override an existing installation.
The safest way to prevent it is to 1/ do not create files if they already exists and 2/ do not start a service if it's not the one installed by us. A condition on the existance of a directory is too weak. If the user creates that specific directory (even if it is empty) IMEX may be broken by new compute nodes executing the config step.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So what you are saying is

  • Create Directory in Config Phase
  • Then create the 2 config and 1 service fle
  • And start the Nvidia-imex
    So esentially we will fall in the case where we support nvidia-imex for DLAMi?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The creation of directory and the 2 imex conf files are not a problem per see and I can agree on moving those.

The service file however is a problem as the condition would be difficult to handle

Himani Anil Deshpande added 2 commits September 18, 2025 19:05
…ot exist

* We create the Directory as part of AMI creation
* Skip starting Imex service if the service file does not exist
* we remove /opt/parallelcluster/shared/nvidia-imex directory creation
* We keep default path of `/etc/nvidia-imex/nodes_config.cfg` and `/etc/nvidia-imex/config.cfg` for IMEX configuration
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants