Skip to content

Enable autoscaling #128

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 110 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
110 commits
Select commit Hold shift + click to select a range
a123492
WIP autoscale PoC
sjpb Mar 30, 2021
c9c9bfc
add IMB package to allow testing
sjpb Mar 31, 2021
ab14526
move cloud_nodes config to right environment
sjpb Mar 31, 2021
67b16a4
fix /etc/openstack permissions for resume
sjpb Mar 31, 2021
a618aca
fix clouds.yaml
sjpb Mar 31, 2021
341a5c9
get resume/suspend scripts working manually
sjpb Mar 31, 2021
99fe7ad
note issue with adhoc slurm restart for combined headnode
sjpb Apr 1, 2021
a956a54
fix openhpc variables for autoscale
sjpb Apr 1, 2021
4ea81c5
set new image ID
sjpb Apr 1, 2021
354c67a
set autoscale branch for openhpc role requirements
sjpb Apr 1, 2021
c74a271
fix /etc/openstack for autoscale
sjpb Apr 1, 2021
73eed39
remove SlurmctldParameters unsupported in slurm 20.02.5
sjpb Apr 1, 2021
967d107
use openhpc_munge_key parameter
sjpb Apr 1, 2021
94de099
don't cache node ips in slurm
sjpb Apr 1, 2021
99793ad
tune slurm debug info for powersave only
sjpb Apr 1, 2021
1a3fd48
use default security groups
sjpb Apr 6, 2021
b992161
remove ssh proxying from inventory
sjpb Apr 6, 2021
0ebba20
add helloworld MPI program setup
sjpb Apr 6, 2021
79b0516
specify NFS server by hostname not IP
sjpb Apr 6, 2021
9f9430a
update to latest built image
sjpb Apr 6, 2021
95a8ed2
remove inventory hosts file from git
sjpb Apr 6, 2021
510a1bf
show cloud nodes even when powered off
sjpb Apr 8, 2021
9392c39
revert compute image to vanilla cento8.2
sjpb Apr 8, 2021
9467973
remove sausagecloud environment
sjpb Sep 8, 2021
000a4e7
move autoscale into slurm
sjpb Sep 8, 2021
f6514e6
allow for overriding slurm config in appliance
sjpb Sep 8, 2021
b285620
add autoscale group/group_vars
sjpb Sep 8, 2021
7ae5042
use autoscale branch of openhpc role
sjpb Sep 8, 2021
ea5c3bc
Add podman_cidr to allow changing podman network range
sjpb Sep 9, 2021
9fba933
Merge branch 'main' into feature/autoscale
sjpb Sep 9, 2021
290f01c
Merge branch 'fix/podman-cidr' into feature/autoscale - to allow test…
sjpb Sep 9, 2021
237b069
fix order of slurm.conf changes and {Resume,Suspend}Program creation …
sjpb Sep 9, 2021
82e4fac
turn up slurmctld logging
sjpb Sep 10, 2021
1353f86
add extension to templates
sjpb Sep 10, 2021
6047313
log exception tracebacks from resume/suspend programs
sjpb Sep 10, 2021
919ff50
chhange appcred owner
sjpb Sep 10, 2021
02377b1
fix try/except in resume/suspend
sjpb Sep 10, 2021
b0622d9
handle incorrect resume config
sjpb Sep 10, 2021
d1ba38e
fix autoscale config for smslabs
sjpb Sep 10, 2021
8e2a827
avoid suspend/resume exceptions on successful run
sjpb Sep 10, 2021
6f25ff8
Merge branch 'main' into feature/autoscale
sjpb Sep 23, 2021
37055b5
basic (messy) working autoscale
sjpb Sep 23, 2021
6a37f50
make clouds.yaml idemponent (TODO: fix for rebuild nodes)
sjpb Sep 24, 2021
49a76cc
fix /etc/openstack permissions for autoscale
sjpb Sep 24, 2021
9c9a69e
use openhpc_suspend_exc_nodes to prevent login nodes autoscaling
sjpb Sep 24, 2021
10a2036
install slurm user before adding slurm tools
sjpb Sep 28, 2021
7de823f
read node Features to get openstack instance information
sjpb Sep 28, 2021
d7bfa75
move autoscale node info to openhpc_slurm_partitions
sjpb Sep 28, 2021
544b1ab
rename openhpc vars
sjpb Sep 29, 2021
31d8e84
add vars from smslabs environment as demo
sjpb Sep 29, 2021
3257a85
cope with no non-cloud nodes in suspend_exc defaults
sjpb Sep 29, 2021
75a0069
smslabs: more complex partition example
sjpb Sep 29, 2021
4a61c5d
use cloud_features support
sjpb Sep 29, 2021
74404c2
fix feature extraction for multiple nodes
sjpb Sep 29, 2021
7d13831
smslabs: testable (default) burst partition
sjpb Sep 29, 2021
8d627f4
write instance ID to StateSaveLocation on creation
sjpb Sep 29, 2021
8b31189
use instance id on deletion
sjpb Sep 29, 2021
a1ba9ea
fixup rebuild/autoscale variable names
sjpb Sep 30, 2021
ebf3dd9
create autoscale role with auto-modification of openhpc_slurm_partitions
sjpb Sep 30, 2021
0bde5fc
set autoscale defaults with merged options
sjpb Sep 30, 2021
37a1070
enable rebuild from controller
sjpb Sep 30, 2021
138de0a
make suspend less picky about instance ID file format
sjpb Oct 1, 2021
dee0807
use existing compute-based rebuild
sjpb Oct 1, 2021
993d413
move suspend/resume program into slurm_openstack_tools
sjpb Oct 1, 2021
53e27fd
use autoscale defaults in role via set_fact
sjpb Oct 1, 2021
0516499
improve autoscale vars/defaults/docs
sjpb Oct 1, 2021
04198d5
use set_fact merging on rebuild and fix venv deployment
sjpb Oct 5, 2021
60e74a8
use openhpc role's extra_nodes feature
sjpb Oct 5, 2021
1ee10e9
fix actually generataing cloud_node info
sjpb Oct 6, 2021
8a95667
retrieve cloud_node instance cpu/mem from openstack
sjpb Oct 6, 2021
a96e68c
WIP autoscale README
sjpb Oct 6, 2021
20d98fc
smslabs: update demo partition
sjpb Oct 6, 2021
173fe3e
add install tag to first run of stackhpc.openhpc:install.yml
sjpb Oct 6, 2021
474c838
fix changed_when
sjpb Oct 6, 2021
8054f77
add autoscale_clouds
sjpb Oct 7, 2021
8c1b4be
move suspend_excl_nodes definition from openhpc role to here
sjpb Oct 7, 2021
dfc859e
use separate tasks for rebuild and autoscale and move rebuild role in…
sjpb Oct 7, 2021
a236d36
move rebuild role back into collection
sjpb Oct 7, 2021
62b6cf2
move autoscale into collection
sjpb Oct 7, 2021
e140d6a
remove autoscale validation as needed vars not available
sjpb Oct 7, 2021
1132ccd
fix merging of enable_configless
sjpb Oct 7, 2021
4e7b28d
avoid multiple package installation tasks when using autoscale
sjpb Oct 7, 2021
99950ad
remove in-appliance rebuild role
sjpb Oct 8, 2021
3f6419d
fallback to working smslabs partition definition for demo
sjpb Oct 8, 2021
ef90759
smslabs: demo groups in openhpc_slurm_partitions
sjpb Oct 12, 2021
2ee9304
tidy for PR
sjpb Oct 15, 2021
6476e82
fix branch for ansible_collection_slurm_openstack_tools
sjpb Oct 15, 2021
4a342fd
Merge branch 'main' into feature/autoscale
sjpb Jan 24, 2022
2e9c926
Merge branch 'main' into feature/autoscale
sjpb Jan 27, 2022
2c6c642
fix up autoscale test environment
sjpb Jan 31, 2022
12e7de4
change autoscale group to be openstack-specific
sjpb Jan 31, 2022
c0370d6
fix security groups in smslabs for idempotency
sjpb Feb 17, 2022
2192ab7
fix smslabs env not being configless, add checks for this
sjpb Feb 17, 2022
0290115
WIP for smslabs autoscale
sjpb Feb 17, 2022
7c33f1c
add basic autoscale to CI
sjpb Feb 21, 2022
e9a0552
fix failure during CI at 'stackhpc.slurm_openstack_tools.autoscale : …
sjpb Feb 22, 2022
2414203
fix cloud instance name in CI
sjpb Feb 22, 2022
e649ab8
fix cloud network definition during CI image build
sjpb Feb 22, 2022
c9e956e
remove debugging exit
sjpb Feb 22, 2022
69e5d07
smslabs CI fix for cloud node image
sjpb Feb 22, 2022
b41b4ae
fix HPL-solo issue due to memory mismatch with 2 static + 2 cloud nod…
sjpb Feb 24, 2022
0e597f8
re-fix security groups in smslabs for idempotency
sjpb Feb 24, 2022
efc7ab4
smslabs: don't require OS_CLOUD= to be set for TF
sjpb Feb 24, 2022
776db15
update smslabs CI to do build in parallel with deploy
sjpb Mar 7, 2022
4b3ac96
add squid proxy for smslabs
sjpb Mar 7, 2022
f6b3efa
move pytools to feature/ports
sjpb Mar 7, 2022
2e3c77b
use CI project squid for smslabs
sjpb Mar 7, 2022
0f7dc83
change smslabs CI workflow name for clarity
sjpb Mar 7, 2022
c217f20
disable dnf proxy in smslabs for debugging
sjpb Mar 7, 2022
24bfeb2
add check_slurm tasks to sms-labs CI
sjpb Mar 7, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
40 changes: 18 additions & 22 deletions .github/workflows/smslabs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ on:
pull_request:
concurrency: stackhpc-ci # openstack project
jobs:
openstack-example:
smslabs:
runs-on: ubuntu-20.04
steps:
- uses: actions/checkout@v2
Expand Down Expand Up @@ -77,50 +77,44 @@ jobs:
TF_VAR_cluster_name: ci${{ github.run_id }}
if: ${{ always() && steps.provision.outcome == 'failure' && contains('not enough hosts available', steps.provision_failure.messages) }}

- name: Configure infrastructure
- name: Directly configure cluster and build compute + login images
run: |
. venv/bin/activate
. environments/smslabs-example/activate
ansible all -m wait_for_connection
ansible-playbook ansible/adhoc/generate-passwords.yml
ansible-playbook -vv ansible/site.yml
env:
OS_CLOUD: openstack # required so openhpc_slurm_partitions filter used by stackhpc.slurm_openstack_tools.autoscale can use clouds.yaml file to run openstack cli to get node config
ANSIBLE_FORCE_COLOR: True

- name: Run MPI-based tests
- name: Test reimage of login and compute nodes
run: |
. venv/bin/activate
. environments/smslabs-example/activate
ansible-playbook -vv ansible/adhoc/hpctests.yml
env:
ANSIBLE_FORCE_COLOR: True

- name: Build control and compute images
run: |
. venv/bin/activate
. environments/smslabs-example/activate
cd packer
PACKER_LOG=1 PACKER_LOG_PATH=build.log packer build -var-file=$PKR_VAR_environment_root/builder.pkrvars.hcl openstack.pkr.hcl
ansible all -m wait_for_connection
ansible-playbook -vv $APPLIANCES_ENVIRONMENT_ROOT/ci/test_reimage.yml
env:
OS_CLOUD: openstack

- name: Reimage compute nodes via slurm and check cluster still up
ANSIBLE_FORCE_COLOR: True

- name: Update cloud image and reconfigure Slurm
run: |
. venv/bin/activate
. environments/smslabs-example/activate
ansible-playbook -vv $APPLIANCES_ENVIRONMENT_ROOT/ci/reimage-compute.yml
ansible-playbook -vv $APPLIANCES_ENVIRONMENT_ROOT/hooks/post.yml
ansible-playbook -vv $APPLIANCES_ENVIRONMENT_ROOT/ci/update_cloudnode_image.yml
ansible-playbook -vv ansible/slurm.yml --tags openhpc --skip-tags install
env:
ANSIBLE_FORCE_COLOR: True
OS_CLOUD: openstack

- name: Reimage login nodes via openstack and check cluster still up
- name: Run MPI-based tests (triggers autoscaling)
run: |
. venv/bin/activate
. environments/smslabs-example/activate
ansible-playbook -vv $APPLIANCES_ENVIRONMENT_ROOT/ci/reimage-login.yml
ansible-playbook -vv $APPLIANCES_ENVIRONMENT_ROOT/hooks/post.yml
ansible-playbook -vv ansible/adhoc/hpctests.yml
env:
OS_CLOUD: openstack
ANSIBLE_FORCE_COLOR: True

- name: Delete infrastructure
run: |
Expand All @@ -132,3 +126,5 @@ jobs:
OS_CLOUD: openstack
TF_VAR_cluster_name: ci${{ github.run_id }}
if: ${{ success() || cancelled() }}

# TODO: delete images!
16 changes: 12 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -98,9 +98,17 @@ NB: This section describes generic instructions - check for any environment-spec

source environments/<environment>/activate

2. Deploy instances - see environment-specific instructions.
1. Activate your OpenStack credentials (required if Slurm-controlled rebuild or Slurm autoscaling is enabled):

3. Generate passwords:
# either source an openrc.sh file
source path_to/openrc.sh

# or if using a clouds.yaml file in ~/.config/openstack/clouds.yaml:
export OS_CLOUD=openstack

1. Deploy instances - see environment-specific instructions.

1. Generate passwords:

ansible-playbook ansible/adhoc/generate-passwords.yml

Expand All @@ -110,7 +118,7 @@ NB: This section describes generic instructions - check for any environment-spec

See the [Ansible vault documentation](https://docs.ansible.com/ansible/latest/user_guide/vault.html) for more details.

4. Deploy the appliance:
1. Deploy the appliance:

ansible-playbook ansible/site.yml

Expand All @@ -120,7 +128,7 @@ NB: This section describes generic instructions - check for any environment-spec

Tags as defined in the various sub-playbooks defined in `ansible/` may be used to only run part of the `site` tasks.

5. "Utility" playbooks for managing a running appliance are contained in `ansible/adhoc` - run these by activating the environment and using:
1. "Utility" playbooks for managing a running appliance are contained in `ansible/adhoc` - run these by activating the environment and using:

ansible-playbook ansible/adhoc/<playbook name>

Expand Down
2 changes: 1 addition & 1 deletion ansible/adhoc/restart-slurm.yml
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@
name: slurmctld
state: restarted

- hosts: compute,login
- hosts: compute,login # FIXME: doesn't work if using `login` as combined slurmctld
become: yes
gather_facts: no
tasks:
Expand Down
40 changes: 35 additions & 5 deletions ansible/slurm.yml
Original file line number Diff line number Diff line change
Expand Up @@ -9,24 +9,54 @@
- include_role:
name: geerlingguy.mysql

- name: Setup slurm
- name: Setup Slurm-driven reimage on OpenStack
hosts: rebuild
become: yes
tags:
- rebuild
- openhpc
tasks:
- import_role:
name: stackhpc.slurm_openstack_tools.rebuild

- name: Preinstall Slurm packages to create slurm user
# This is an optimisation for speed as it avoids having to do this once for `control` then again for `openhpc` nodes.
hosts: openhpc
become: yes
tags:
- openstack_autoscale
- openhpc
- install
tasks:
- import_role:
name: stackhpc.openhpc
tasks_from: install.yml
when: groups.get('openstack_autoscale', []) | length > 0

- name: Setup slurm-driven reimage
hosts: rebuild
- name: Setup autoscaling on OpenStack
hosts: openstack_autoscale
become: yes
tags:
- rebuild
- openstack_autoscale
- openhpc
tasks:
- import_role:
name: stackhpc.slurm_openstack_tools.rebuild
name: stackhpc.slurm_openstack_tools.autoscale

- name: Setup slurm
hosts: openhpc
become: yes
tags:
- openhpc
tasks:
- assert:
that: "'enable_configless' in openhpc_config.SlurmctldParameters | default([])"
fail_msg: |
'enable_configless' not found in openhpc_config.SlurmctldParameters - is variable openhpc_config overridden?
Additional slurm.conf parameters should be provided using variable openhpc_config_extra.
success_msg: Checked Slurm will be configured for configless operation
- import_role:
name: stackhpc.openhpc

- name: Set locked memory limits on user-facing nodes
hosts:
Expand Down
4 changes: 4 additions & 0 deletions environments/common/inventory/group_vars/all/autoscale.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
autoscale_rebuild_clouds: ~/.config/openstack/clouds.yaml
autoscale_suspend_exc_nodes_default: "{{ (groups.get('compute', []) + groups.get('login', [])) }}" # i.e. all non-CLOUD nodes, and prevent login-only slurmd nodes getting powered down
autoscale_suspend_exc_nodes_extra: []
autoscale_suspend_exc_nodes: "{{ autoscale_suspend_exc_nodes_default + autoscale_suspend_exc_nodes_extra }}"
9 changes: 7 additions & 2 deletions environments/common/inventory/group_vars/all/openhpc.yml
Original file line number Diff line number Diff line change
Expand Up @@ -22,5 +22,10 @@ openhpc_packages_default:
openhpc_packages_extra: []
openhpc_packages: "{{ openhpc_packages_default + openhpc_packages_extra }}"
openhpc_munge_key: "{{ vault_openhpc_mungekey | b64decode }}"
openhpc_slurm_configless: true
openhpc_login_only_nodes: login
openhpc_login_only_nodes: login
openhpc_config_default:
SlurmctldParameters:
- enable_configless
openhpc_config_extra: {}
openhpc_config: "{{ openhpc_config_default | combine(openhpc_config_extra, list_merge='append') }}"
openhpc_ram_multiplier: 0.90 # TODO: DOCS: needs to be available to stackhpc.slurm_openstack_tools.autoscale role, plus lowered a bit to cope with autoscale problems
2 changes: 2 additions & 0 deletions environments/common/inventory/group_vars/all/pytools.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
# pytools_editable: false
pytools_gitref: feature/ports
1 change: 1 addition & 0 deletions environments/common/inventory/group_vars/all/rebuild.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
openhpc_rebuild_clouds: ~/.config/openstack/clouds.yaml
4 changes: 4 additions & 0 deletions environments/common/inventory/groups
Original file line number Diff line number Diff line change
Expand Up @@ -75,6 +75,10 @@ cluster
[update]
# All hosts to (optionally) run yum update on.

[autoscale]
# Add control to enable autoscaling on OpenStack.
# See ansible/collections/ansible_collections/stackhpc/slurm_openstack_tools/roles/autoscale/README.md

[block_devices]
# Superset of hosts to configure filesystems on - see ansible/roles/block_devices/README.md

Expand Down
3 changes: 3 additions & 0 deletions environments/common/layouts/everything
Original file line number Diff line number Diff line change
Expand Up @@ -41,3 +41,6 @@ cluster

[basic_users]
# Add `openhpc` group to add Slurm users via creation of users on each node.

[openstack_autoscale]
# Add `control` group to configure autoscaling on OpenStack clouds.
7 changes: 7 additions & 0 deletions environments/smslabs-example/ci/reimage-compute.yml
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,13 @@
set_fact:
compute_build: "{{ manifest['builds'] | selectattr('custom_data', 'eq', {'source': 'compute'}) | last }}"

- name: Add compute image ID to autoscale definition
copy:
dest: "{{ lookup('env', 'APPLIANCES_ENVIRONMENT_ROOT') }}/inventory/group_vars/openhpc/autoscale.yml"
content: |
openhpc_autoscale_image: {{ compute_build.artifact_id }}
delegate_to: localhost

- name: Request compute node rebuild via Slurm
shell:
cmd: scontrol reboot ASAP nextstate=RESUME reason='rebuild image:{{ compute_build.artifact_id }}' {{ openhpc_cluster_name }}-compute-[0-1]
Expand Down
64 changes: 64 additions & 0 deletions environments/smslabs-example/ci/test_reimage.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
- hosts: login:!builder
become: no
tasks:
- name: Read packer build manifest
set_fact:
manifest: "{{ lookup('file', manifest_path) | from_json }}"
vars:
manifest_path: "{{ lookup('env', 'APPLIANCES_REPO_ROOT') }}/packer/packer-manifest.json"
delegate_to: localhost

- name: Get latest image builds
set_fact:
login_build: "{{ manifest['builds'] | selectattr('custom_data', 'eq', {'source': 'login'}) | last }}"
compute_build: "{{ manifest['builds'] | selectattr('custom_data', 'eq', {'source': 'compute'}) | last }}"

- name: Reimage login node via openstack
shell:
cmd: "openstack server rebuild {{ instance_id | default(inventory_hostname) }} --image {{ login_build.artifact_id }}"
delegate_to: localhost

- name: Check login node rebuild completed
shell:
cmd: openstack server show {{ inventory_hostname }} --format value -c image
register: openstack_login
delegate_to: localhost
retries: 5
delay: 30
until: login_build.artifact_id in openstack_login.stdout
changed_when: false

- name: Wait for login connection
wait_for_connection:
timeout: 800

- name: Check slurm up after reimaging login node
import_tasks: ../hooks/check_slurm.yml

- name: Request compute node rebuild via Slurm
shell:
cmd: scontrol reboot ASAP nextstate=RESUME reason='rebuild image:{{ compute_build.artifact_id }}' {{ openhpc_cluster_name }}-compute-[0-1]
become: yes

- name: Check compute node rebuild completed
shell:
cmd: openstack server show {{ item }} --format value -c image
register: openstack_compute
delegate_to: localhost
loop: "{{ groups['compute'] }}"
retries: 5
delay: 30
until: compute_build.artifact_id in openstack_compute.stdout
changed_when: false

- hosts: compute:!builder
become: no
gather_facts: no
tasks:
- name: Wait for compute connection
wait_for_connection:
timeout: 800

- name: Check slurm up after reimaging login node
import_tasks: ../hooks/check_slurm.yml
run_once: true
22 changes: 22 additions & 0 deletions environments/smslabs-example/ci/update_cloudnode_image.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
- hosts: localhost
become: no
tasks:
- name: Read packer build manifest
set_fact:
manifest: "{{ lookup('file', manifest_path) | from_json }}"
vars:
manifest_path: "{{ lookup('env', 'APPLIANCES_REPO_ROOT') }}/packer/packer-manifest.json"
delegate_to: localhost

- name: Get latest image builds
set_fact:
login_build: "{{ manifest['builds'] | selectattr('custom_data', 'eq', {'source': 'login'}) | last }}"
compute_build: "{{ manifest['builds'] | selectattr('custom_data', 'eq', {'source': 'compute'}) | last }}"

- name: Add compute image ID to autoscale definition (for later autoscaling tests)
copy:
dest: "{{ lookup('env', 'APPLIANCES_ENVIRONMENT_ROOT') }}/inventory/group_vars/openhpc/autoscale.yml"
content: |
openhpc_autoscale_image: {{ compute_build.artifact_id }}
delegate_to: localhost
run_once: true
21 changes: 21 additions & 0 deletions environments/smslabs-example/hooks/check_slurm.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
- name: Run sinfo
shell: 'sinfo --noheader --format="%N %P %a %l %D %t" | sort' # using --format ensures we control whitespace: Partition,partition_state,max_jobtime,num_nodes,node_state,node_name
register: sinfo
changed_when: false
until: "'boot' not in sinfo.stdout_lines"
retries: 5
delay: 10
- name: Check nodes have expected slurm state
assert:
that: sinfo.stdout_lines == expected_sinfo
fail_msg: |
sinfo output not as expected:
actual:
{{ sinfo.stdout_lines }}
expected:
{{ expected_sinfo }}
<end>
vars:
expected_sinfo:
- "{{ openhpc_cluster_name }}-compute-[0-1] {{ openhpc_slurm_partitions[0].name }}* up 60-00:00:00 2 idle"
- "{{ openhpc_cluster_name }}-compute-[2-3] {{ openhpc_slurm_partitions[0].name }}* up 60-00:00:00 2 idle~"
17 changes: 14 additions & 3 deletions environments/smslabs-example/hooks/post.yml
Original file line number Diff line number Diff line change
Expand Up @@ -4,11 +4,22 @@
tasks:
- block:
- name: Run sinfo
shell: 'sinfo --noheader --format="%N %P %a %l %D %t"' # using --format ensures we control whitespace: Partition,partition_state,max_jobtime,num_nodes,node_state,node_name
shell: 'sinfo --noheader --format="%N %P %a %l %D %t" | sort' # using --format ensures we control whitespace: Partition,partition_state,max_jobtime,num_nodes,node_state,node_name
register: sinfo
changed_when: false
- name: Check nodes have expected slurm state
assert:
that: "(sinfo.stdout_lines[0] | split)[1:] == ['small*', 'up', '60-00:00:00', '2', 'idle']" # don't know what instance names are as have CI run ID in them
fail_msg: "sinfo output not as expected: {{ sinfo.stdout }}"
that: sinfo.stdout_lines == expected_sinfo
fail_msg: |
sinfo output not as expected:
actual:
{{ sinfo.stdout_lines }}
expected:
{{ expected_sinfo }}
<end>
vars:
expected_sinfo:
- "{{ openhpc_cluster_name }}-compute-[0-1] small* up 60-00:00:00 2 idle"
- "{{ openhpc_cluster_name }}-compute-[2-3] small* up 60-00:00:00 2 idle~"

when: "'builder' not in group_names" # won't have a slurm control daemon when in build
Loading