Skip to content

Conversation

@sjpb
Copy link
Collaborator

@sjpb sjpb commented Jan 6, 2026

Affects StackHPC CI only:

  • Uses new Leafcloud jumphost in CI project to make CI configuration less confusing and more secure
  • Stores jumphost fingerprints in repo variable to make them independent of specific checkouts
  • Defines bastion configuration for Ansible (ansible_ssh_common_args) from current branch in upgrade test workflow, even when latest release is checked out.
  • Prints jumphost ssh public key from repo secret for private key; ensures key format is valid and allows checking correct key is in use
  • Updates extrabuild workflow so CI cloud and Packer failure behaviour can be selected when running manually.
  • Adds retries to tofu apply when reimaging login/compute nodes in the upgrade test workflow, due to transient CI cloud issues (matches retries on initial creation added by Add retries to CI tofu apply #833)

NB: This cannot be merged until rebuild works properly on Leafcloud again, but can be used to test ssh works.
NB: The fat image builds below do not need to be in this PR - they are simply to check that ssh etc works to the build VM.

@sjpb
Copy link
Collaborator Author

sjpb commented Jan 6, 2026

Image build - does not need to be merged: https://github.com/stackhpc/ansible-slurm-appliance/actions/runs/20750594341

edit - failed, hadn't updated bastion details for Packer.

@sjpb
Copy link
Collaborator Author

sjpb commented Jan 6, 2026

Image build - does not need to be merged: https://github.com/stackhpc/ansible-slurm-appliance/actions/runs/20751351672

@sjpb
Copy link
Collaborator Author

sjpb commented Jan 7, 2026

Image build - does not need to merge: https://github.com/stackhpc/ansible-slurm-appliance/actions/runs/20780362798

@sjpb
Copy link
Collaborator Author

sjpb commented Jan 7, 2026

Image build - does not need to merge: https://github.com/stackhpc/ansible-slurm-appliance/actions/runs/20780857603

@sjpb
Copy link
Collaborator Author

sjpb commented Jan 7, 2026

@sjpb
Copy link
Collaborator Author

sjpb commented Jan 7, 2026

@sjpb
Copy link
Collaborator Author

sjpb commented Jan 7, 2026

Ok so using the ssh_agent options, Packer has not added the ssh key to the instance! Horizon shows no keypair, and cloud-init logs from Horizon show:

ci-info: no authorized SSH keys fingerprints found for user rocky.

@sjpb
Copy link
Collaborator Author

sjpb commented Jan 7, 2026

@sjpb
Copy link
Collaborator Author

sjpb commented Jan 7, 2026

@sjpb
Copy link
Collaborator Author

sjpb commented Jan 7, 2026

@sjpb sjpb force-pushed the ci/leafcloud-tf-jumphost branch from 3a59489 to 590286c Compare January 7, 2026 16:34
@sjpb sjpb force-pushed the ci/leafcloud-tf-jumphost branch from 590286c to 12d44f0 Compare January 7, 2026 16:38
@sjpb sjpb force-pushed the ci/leafcloud-tf-jumphost branch from 16e44cd to 9677e18 Compare January 7, 2026 17:13
@sjpb
Copy link
Collaborator Author

sjpb commented Jan 7, 2026

@sjpb
Copy link
Collaborator Author

sjpb commented Jan 8, 2026

@sjpb
Copy link
Collaborator Author

sjpb commented Jan 8, 2026

Manual extra build: https://github.com/stackhpc/ansible-slurm-appliance/actions/runs/20813201479/job/59782268969

Could communicate and was on manually-selected cloud. Cancelled, don't need it to complete.

@sjpb
Copy link
Collaborator Author

sjpb commented Jan 8, 2026

Ok I suspect the stackhpc workflow is failed because leafcloud rebuild is currently broken. I'm kind of confused how the correct bastion key is getting into known-hosts though given we're checking out at last release ...

@sjpb
Copy link
Collaborator Author

sjpb commented Jan 8, 2026

Extrabuild did successfully launch on LEAFCLOUD, as required by PR tag.

@sjpb sjpb changed the title Use TF-based jumphost for Leafcloud CI Use new OpenTofu-based jumphost for Leafcloud CI Jan 8, 2026
@sjpb
Copy link
Collaborator Author

sjpb commented Jan 8, 2026

  • extrabuild - ansible connected
  • stackhpc - ansible DIDN't connect during "Configure cluster at latest release"
    • instance public key in logs matches keypair matches workflow output
    • could login to instance from laptop w/ relevant key

Problem could be that this PR means ansible_ssh_common_args has changed since last release? But TBH, I'd have expected the last one to work as well. And wouldn't explain why I've had a successful run before.

@sjpb
Copy link
Collaborator Author

sjpb commented Jan 8, 2026

Ah. No problem is that before, we were using the OLD bastion IP based on the checkout with the OLD bastion fingerprint. Now I've moved the latter to env vars, we're using the OLD bastion IP + NEW fingerprint.

Before that change, we had the reverse situation, which is why stackhpc workflow failed to connect when checked out at current release.

The OLD bastion is still up, hence confusion! I'm going to move its IP ...

@sjpb sjpb force-pushed the ci/leafcloud-tf-jumphost branch 2 times, most recently from ba82599 to ca86fa7 Compare January 8, 2026 14:14
@sjpb sjpb force-pushed the ci/leafcloud-tf-jumphost branch from ca86fa7 to 9656175 Compare January 8, 2026 14:41
@sjpb
Copy link
Collaborator Author

sjpb commented Jan 8, 2026

Ok extrabuild has comms, stackhpc has comms, but is failing in RL8 after swapping to current branch and doing rebuild due to Leafcloud flakiness. RL9 has comms after the swap/rebuild, so repo config is now OK.

@sjpb sjpb marked this pull request as ready for review January 8, 2026 15:53
@sjpb sjpb requested a review from a team as a code owner January 8, 2026 15:53
@sjpb
Copy link
Collaborator Author

sjpb commented Jan 8, 2026

Checking fatimage again: https://github.com/stackhpc/ansible-slurm-appliance/actions/runs/20822896535.

Ansible running OK, cancelling.

@sjpb
Copy link
Collaborator Author

sjpb commented Jan 8, 2026

Extrabuilds failing above with:

FAT_IMAGES: {
  "cluster_image": {
    "RL8": "openhpc-RL8-260107-1747-7e14a51d",
    "RL9": "openhpc-RL9-260107-1747-7e14a51d"
  }
}

...
==> openstack.openhpc: Found Image ID: b82e6509-0c4c-44b7-9646-6cf79bf855f6
...
==> openstack.openhpc: Error creating volume: Bad request with: [POST https://create.leaf.cloud:8776/v3/f39848421b2747148400ad8eeae8d536/volumes], error message: {"badRequest": {"code": 400, "message": "Invalid input received: Image b82e6509-0c4c-44b7-9646-6cf79bf855f6 is not active."}}

Previous extrabuilds which worked:

FAT_IMAGES: {
  "cluster_image": {
    "RL8": "openhpc-RL8-260102-1202-9b0ab59d",
    "RL9": "openhpc-RL9-260102-1202-9b0ab59d"
  }
}

why did it change when cluster_images.auto.tfvars has not changed?? Was it merging something to main?

Ah yes, b/c workflows by default run in a merge branch ... so can just rerun it once image sync has finished.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants