Skip to content

Latest commit

 

History

History
 
 

schedmd-slurm-gcp-v5-node-group

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 

Description

WARNING: This module is in active development and is therefore not guaranteed to work consistently. Expect the interface to change rapidly while warning exists.

This module creates a node group data structure intended to be input to the schedmd-slurm-gcp-v5-partition module.

Node groups allow adding heterogeneous node types to a partition, and hence running jobs that mix multiple node characteristics. See the heterogeneous jobs section of the SchedMD documentation for more information. An example of multiple node groups being used can be found in the slurm-gcp-v5-high-io.yaml blueprint.

To specify nodes from a specific node group in a partition, the --nodelist (or -w) flag can be used, for example:

srun -N 3 -p compute --nodelist cluster-compute-group-[0-2] hostname

Where the 3 nodes will be selected from the nodes cluster-compute-group-[0-2] in the compute partition.

Additionally, depending on how the nodes differ, a constraint can be added via the --constraint (or -C) flag or other flags such as --mincpus can be used to specify nodes with the desired characteristics.

Example

The following code snippet creates a partition module using the node-group module as input with:

  • a max node count of 200
  • VM machine type of c2-standard-30
  • partition name of "compute"
  • default group name of "ghpc"
  • connected to the network1 module via use
  • nodes mounted to homefs via use
- id: node_group
  source: community/modules/compute/schedmd-slurm-gcp-v5-node-group
  settings:
    node_count_dynamic_max: 200
    machine_type: c2-standard-30

- id: compute_partition
  source: community/modules/compute/schedmd-slurm-gcp-v5-partition
  use:
  - network1
  - homefs
  - node_group
  settings:
    partition_name: compute

Custom Images

For more information on creating valid custom images for the node group VM instances or for custom instance templates, see our vm-images.md documentation page.

GPU Support

More information on GPU support in Slurm on GCP and other HPC Toolkit modules can be found at docs/gpu-support.md

Support

The HPC Toolkit team maintains the wrapper around the slurm-on-gcp terraform modules. For support with the underlying modules, see the instructions in the slurm-gcp README.

License

Copyright 2023 Google LLC

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

 http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Requirements

Name Version
terraform >= 0.13.0
google >= 3.83

Providers

Name Version
google >= 3.83

Modules

No modules.

Resources

Name Type
google_compute_default_service_account.default data source

Inputs

Name Description Type Default Required
access_config Access configurations, i.e. IPs via which the node group instances can be accessed via the internet.
list(object({
network_tier = string
}))
[] no
additional_disks Configurations of additional disks to be included on the partition nodes.
list(object({
disk_name = string
device_name = string
disk_size_gb = number
disk_type = string
disk_labels = map(string)
auto_delete = bool
boot = bool
}))
[] no
bandwidth_tier Configures the network interface card and the maximum egress bandwidth for VMs.
- Setting platform_default respects the Google Cloud Platform API default values for networking.
- Setting virtio_enabled explicitly selects the VirtioNet network adapter.
- Setting gvnic_enabled selects the gVNIC network adapter (without Tier 1 high bandwidth).
- Setting tier_1_enabled selects both the gVNIC adapter and Tier 1 high bandwidth networking.
- Note: both gVNIC and Tier 1 networking require a VM image with gVNIC support as well as specific VM families and shapes.
- See official docs for more details.
string "platform_default" no
can_ip_forward Enable IP forwarding, for NAT instances for example. bool false no
disable_public_ips If set to false. The node group VMs will have a random public IP assigned to it. Ignored if access_config is set. bool true no
disk_auto_delete Whether or not the boot disk should be auto-deleted. bool true no
disk_labels Labels specific to the boot disk. These will be merged with var.labels. map(string) {} no
disk_size_gb Size of boot disk to create for the partition compute nodes. number 50 no
disk_type Boot disk type, can be either pd-ssd, local-ssd, or pd-standard. string "pd-standard" no
enable_confidential_vm Enable the Confidential VM configuration. Note: the instance image must support option. bool false no
enable_oslogin Enables Google Cloud os-login for user login and authentication for VMs.
See https://cloud.google.com/compute/docs/oslogin
bool true no
enable_shielded_vm Enable the Shielded VM configuration. Note: the instance image must support option. bool false no
enable_smt Enables Simultaneous Multi-Threading (SMT) on instance. bool false no
enable_spot_vm Enable the partition to use spot VMs (https://cloud.google.com/spot-vms). bool false no
gpu GPU information. Type and count of GPU to attach to the instance template. See
https://cloud.google.com/compute/docs/gpus more details.
- type : the GPU type, e.g. nvidia-tesla-t4, nvidia-a100-80gb, nvidia-tesla-a100, etc
- count : number of GPUs

If both 'var.gpu' and 'var.guest_accelerator' are set, 'var.gpu' will be used.
object({
count = number,
type = string
})
null no
guest_accelerator Alternative method of providing 'var.gpu' with a consistent naming scheme to
other HPC Toolkit modules.

If both 'var.gpu' and 'var.guest_accelerator' are set, 'var.gpu' will be used.
list(object({
type = string,
count = number
}))
null no
instance_image Defines the image that will be used in the node group VM instances. This
value is overridden if any of source_image, source_image_family or
source_image_project are set.

Expected Fields:
name: The name of the image. Mutually exclusive with family.
family: The image family to use. Mutually exclusive with name.
project: The project where the image is hosted.

For more information on creating custom images that comply with Slurm on GCP
see the "Slurm on GCP Custom Images" section in docs/vm-images.md.
map(string)
{
"family": "schedmd-v5-slurm-22-05-8-hpc-centos-7",
"project": "projects/schedmd-slurm-public/global/images/family"
}
no
instance_template Self link to a custom instance template. If set, other VM definition
variables such as machine_type and instance_image will be ignored in favor
of the provided instance template.

For more information on creating custom images for the instance template
that comply with Slurm on GCP see the "Slurm on GCP Custom Images" section
in docs/vm-images.md.
string null no
labels Labels to add to partition compute instances. Key-value pairs. map(string) {} no
machine_type Compute Platform machine type to use for this partition compute nodes. string "c2-standard-60" no
metadata Metadata, provided as a map. map(string) {} no
min_cpu_platform The name of the minimum CPU platform that you want the instance to use. string null no
name Name of the node group. string "ghpc" no
node_conf Map of Slurm node line configuration. map(any) {} no
node_count_dynamic_max Maximum number of dynamic nodes allowed in this partition. number 10 no
node_count_static Number of nodes to be statically created. number 0 no
on_host_maintenance Instance availability Policy.

Note: Placement groups are not supported when on_host_maintenance is set to
"MIGRATE" and will be deactivated regardless of the value of
enable_placement. To support enable_placement, ensure on_host_maintenance is
set to "TERMINATE".
string "TERMINATE" no
preemptible Should use preemptibles to burst. bool false no
project_id Project in which the HPC deployment will be created. string n/a yes
service_account Service account to attach to the compute instances. If not set, the
default compute service account for the given project will be used with the
"https://www.googleapis.com/auth/cloud-platform" scope.
object({
email = string
scopes = set(string)
})
null no
shielded_instance_config Shielded VM configuration for the instance. Note: not used unless
enable_shielded_vm is 'true'.
- enable_integrity_monitoring : Compare the most recent boot measurements to the
integrity policy baseline and return a pair of pass/fail results depending on
whether they match or not.
- enable_secure_boot : Verify the digital signature of all boot components, and
halt the boot process if signature verification fails.
- enable_vtpm : Use a virtualized trusted platform module, which is a
specialized computer chip you can use to encrypt objects like keys and
certificates.
object({
enable_integrity_monitoring = bool
enable_secure_boot = bool
enable_vtpm = bool
})
{
"enable_integrity_monitoring": true,
"enable_secure_boot": true,
"enable_vtpm": true
}
no
source_image The custom VM image. It is recommended to use instance_image instead. string "" no
source_image_family The custom VM image family. It is recommended to use instance_image instead. string "" no
source_image_project The hosting the custom VM image. It is recommended to use instance_image instead. string "" no
spot_instance_config Configuration for spot VMs.
object({
termination_action = string
})
null no
tags Network tag list. list(string) [] no
zone_policy_allow Partition nodes will prefer to be created in the listed zones. If a zone appears
in both zone_policy_allow and zone_policy_deny, then zone_policy_deny will take
priority for that zone.
set(string) [] no
zone_policy_deny Partition nodes will not be created in the listed zones. If a zone appears in
both zone_policy_allow and zone_policy_deny, then zone_policy_deny will take
priority for that zone.
set(string) [] no

Outputs

Name Description
node_groups Details of the node group. Typically used as input to schedmd-slurm-gcp-v5-partition.