-
Notifications
You must be signed in to change notification settings - Fork 38
Use new OpenTofu-based jumphost for Leafcloud CI #886
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Image build - does not need to be merged: https://github.com/stackhpc/ansible-slurm-appliance/actions/runs/20750594341 edit - failed, hadn't updated bastion details for Packer. |
|
Image build - does not need to be merged: https://github.com/stackhpc/ansible-slurm-appliance/actions/runs/20751351672 |
|
Image build - does not need to merge: https://github.com/stackhpc/ansible-slurm-appliance/actions/runs/20780362798 |
|
Image build - does not need to merge: https://github.com/stackhpc/ansible-slurm-appliance/actions/runs/20780857603 |
|
Image build - https://github.com/stackhpc/ansible-slurm-appliance/actions/runs/20781003408 don't think this ran on the right commit! Try again: https://github.com/stackhpc/ansible-slurm-appliance/actions/runs/20781290287 |
|
Ok so using the ssh_agent options, Packer has not added the ssh key to the instance! Horizon shows no keypair, and cloud-init logs from Horizon show:
|
3a59489 to
590286c
Compare
590286c to
12d44f0
Compare
16e44cd to
9677e18
Compare
|
Image build: https://github.com/stackhpc/ansible-slurm-appliance/actions/runs/20790220411 |
|
Manual extra build: https://github.com/stackhpc/ansible-slurm-appliance/actions/runs/20813201479/job/59782268969 Could communicate and was on manually-selected cloud. Cancelled, don't need it to complete. |
|
Ok I suspect the stackhpc workflow is failed because leafcloud rebuild is currently broken. I'm kind of confused how the correct bastion key is getting into known-hosts though given we're checking out at last release ... |
|
Extrabuild did successfully launch on LEAFCLOUD, as required by PR tag. |
This reverts commit f5ac850.
Problem could be that this PR means ansible_ssh_common_args has changed since last release? But TBH, I'd have expected the last one to work as well. And wouldn't explain why I've had a successful run before. |
… @ last release works
|
Ah. No problem is that before, we were using the OLD bastion IP based on the checkout with the OLD bastion fingerprint. Now I've moved the latter to env vars, we're using the OLD bastion IP + NEW fingerprint. Before that change, we had the reverse situation, which is why stackhpc workflow failed to connect when checked out at current release. The OLD bastion is still up, hence confusion! I'm going to move its IP ... |
ba82599 to
ca86fa7
Compare
ca86fa7 to
9656175
Compare
|
Ok extrabuild has comms, stackhpc has comms, but is failing in RL8 after swapping to current branch and doing rebuild due to Leafcloud flakiness. RL9 has comms after the swap/rebuild, so repo config is now OK. |
|
Checking fatimage again: https://github.com/stackhpc/ansible-slurm-appliance/actions/runs/20822896535. Ansible running OK, cancelling. |
|
Extrabuilds failing above with: Previous extrabuilds which worked: why did it change when cluster_images.auto.tfvars has not changed?? Was it merging something to main? Ah yes, b/c workflows by default run in a merge branch ... so can just rerun it once image sync has finished. |
Affects StackHPC CI only:
ansible_ssh_common_args) from current branch in upgrade test workflow, even when latest release is checked out.tofu applywhen reimaging login/compute nodes in the upgrade test workflow, due to transient CI cloud issues (matches retries on initial creation added by Add retries to CI tofu apply #833)NB: This cannot be merged until rebuild works properly on Leafcloud again, but can be used to test ssh works.
NB: The fat image builds below do not need to be in this PR - they are simply to check that ssh etc works to the build VM.