diff --git a/cloud_bursting/slurm-23.11.9-1/README.md b/cloud_bursting/slurm-23.11.9-1/README.md new file mode 100644 index 00000000..b12b7ea9 --- /dev/null +++ b/cloud_bursting/slurm-23.11.9-1/README.md @@ -0,0 +1,216 @@ +# Slurm Cloud Bursting Using CycleCloud + +This repository provides detailed instructions and scripts for setting up Slurm bursting using CycleCloud on Microsoft Azure, allowing you to seamlessly scale your Slurm cluster into the cloud for additional compute resources. + +## Overview + +Slurm bursting enables the extension of your on-premises Slurm cluster into Azure for flexible and scalable compute capacity. CycleCloud simplifies the management and provisioning of cloud resources, bridging your local infrastructure with cloud environments. + +Example architecture diagram of an external Slurm scheduler and CycleCloud compute nodes, both hosted in Azure, to demonstrate the hybrid setup. In a real scenario, the external Slurm scheduler would reside in an on-premises environment. + +![architecture](images/architecture_diagram.png) + +## Table of Contents + +- [Requirements](#requirements) +- [Setup Instructions](#setup-instructions) +- [Importing a Cluster Using the Slurm Headless Template in CycleCloud](#importing-a-cluster-using-the-slurm-headless-template-in-cyclecloud) +- [Slurm Scheduler Installation and Configuration](#slurm-scheduler-installation-and-configuration) +- [CycleCloud UI Configuration](#cyclecloud-ui-configuration) +- [CycleCloud Autoscaler Integration with Slurm Scheduler](#cyclecloud-autoscaler-integration-on-slurm-scheduler) +- [User and Group Setup (Optional)](#user-and-group-setup-optional) +- [Testing the Setup](#testing-the-setup) +- [Contributing](#contributing) + +## Requirements + +Ensure you have the following prerequisites in place: + +- **OS Version supported**: AlmaLinux release 8.7 (`almalinux:almalinux-hpc:8_7-hpc-gen2:latest`) & Ubuntu HPC 22.04 (`microsoft-dsvm:ubuntu-hpc:2204:latest`) +- **Slurm Version**: 23.11.9-1 +- **CycleCloud Version**: 8.6.4-3320 + + +### Network Ports and Security +The following NSG rules must be configured for successful communication between Master node, CycleCloud server and Compute nodes. + +| **Service** | **Port** | **Protocol** | **Direction** | **Purpose** | **Requirement** | +|------------------------------------|-----------------|--------------|------------------|------------------------------------------------------------------------|---------------------------------------------------------------------------------| +| **SSH (Secure Shell)** | 22 | TCP | Inbound/Outbound | Secure command-line access to the Slurm Master node | Open on both on-premises firewall and Azure NSGs | +| **Slurm Control (slurmctld, slurmd)** | 6817, 6818 | TCP | Inbound/Outbound | Communication between Slurm Master and compute nodes | Open in on-premises firewall and Azure NSGs | +| **Munge Authentication Service** | 4065 | TCP | Inbound/Outbound | Authentication between Slurm Master and compute nodes | Open on both on-premises network and Azure NSGs | +| **CycleCloud Service** | 443 | TCP | Outbound | Communication between Slurm Master node and Azure CycleCloud | Allow outbound connections to Azure CycleCloud services from the Slurm Master node | +| **NFS ports** | 2049 | TCP | Inbound/Outbound | Shared filesystem access between Master node and Azure CycleCloud | Open on both on-premises network and Azure NSGs | +| **LDAP port** (Optional) | 389 | TCP | Inbound/Outbound | Centralized authentication mechanism for user management | Open on both on-premises network and Azure NSGs | + + +Please refer [Slurm Network Configuration Guide](https://slurm.schedmd.com/network.html) + + +### NFS File server +A shared file system between the external Slurm Scheduler node and the CycleCloud cluster. You can use Azure NetApp Files, Azure Files, NFS, or other methods to mount the same file system on both sides. In this example, we are using a Scheduler VM as an NFS server. + +### Centralized User management system (LDAP or AD) +In HPC environments, maintaining consistent user IDs (UIDs) and group IDs (GIDs) across the cluster is critical for seamless user access and resource management. A centralized user management system, such as LDAP or Active Directory (AD), ensures that UIDs and GIDs are synchronized across all compute nodes and storage systems. + +## Setup Instructions + +### Importing a Cluster Using the Slurm Headless Template in CycleCloud + +- This step must be executed on the **CycleCloud VM**. +- Make sure that the **CycleCloud 8.6.4 VM** is running and accessible via the `cyclecloud` CLI. +- Execute the `cyclecloud-project-build.sh` script and provide the desired cluster name (e.g., `hpc1`). This will set up a custom project based on the `cyclecloud-slurm-3.0.9` version and import the cluster using the Slurm headless template. +- In the example provided, `hpc1` is used as the cluster name. You can choose any cluster name, but be consistent and use the same name throughout the entire setup. + + +```bash +git clone https://github.com/Azure/cyclecloud-slurm.git +cd cyclecloud-slurm/cloud_bursting/slurm-23.11.9-1/cyclecloud +sh cyclecloud-project-build.sh +``` + +Output : + +```bash +[user1@cc86vm ~]$ cd cyclecloud-slurm/cloud_bursting/slurm-23.11.9-1/cyclecloud +[user1@cc86vm cyclecloud]$ sh cyclecloud-project-build.sh +Enter Cluster Name: hpc1 +Cluster Name: hpc1 +Use the same cluster name: hpc1 in building the scheduler +Importing Cluster +Importing cluster Slurm_HL and creating cluster hpc1.... +---------- +hpc1 : off +---------- +Resource group: +Cluster nodes: +Total nodes: 0 +Locker Name: HPC+AI storage +Fetching CycleCloud project +Uploading CycleCloud project to the locker +``` + +### Slurm Scheduler Installation and Configuration + +- A VM should be deployed using the specified **AlmaLinux 8.7** or **Ubuntu 22.04** image. +- If you already have a Slurm Scheduler installed, you may skip this step. However, it is recommended to review the script to ensure compatibility with your existing setup. +- Run the Slurm scheduler installation script (`slurm-scheduler-builder.sh`) and provide the cluster name (`hpc1`) when prompted. +- This script will setup NFS server and install and configure Slurm Scheduler. +- If you are using an external NFS server, you can remove the NFS setup entries from the script. + + + +```bash +git clone https://github.com/Azure/cyclecloud-slurm.git +cd cyclecloud-slurm/cloud_bursting/slurm-23.11.9-1/scheduler +sh slurm-scheduler-builder.sh +``` +Output + +```bash +------------------------------------------------------------------------------------------------------------------------------ +Building Slurm scheduler for cloud bursting with Azure CycleCloud +------------------------------------------------------------------------------------------------------------------------------ + +Enter Cluster Name: hpc1 +------------------------------------------------------------------------------------------------------------------------------ + +Summary of entered details: +Cluster Name: hpc1 +Scheduler Hostname: masternode2 +NFSServer IP Address: 10.222.1.26 +``` + +### CycleCloud UI Configuration + +- Access the **CycleCloud UI** and navigate to the settings for the `hpc1` cluster. +- Edit the cluster settings to configure the VM SKUs and networking options as needed. +- In the **Network Attached Storage** section, enter the NFS server IP address for the `/sched` and `/shared` mounts. +- Select the OS from Advance setting tab - **Ubuntu 22.04** or **AlmaLinux 8** from the drop down based on the scheduler VM. +- Once all settings are configured, click **Save** and then **Start** the `hpc1` cluster. + +![NFS settings](images/NFSSettings.png) + +### CycleCloud Autoscaler Integration on Slurm Scheduler + +- Integrate Slurm with CycleCloud using the `cyclecloud-integrator.sh` script. +- Provide CycleCloud details (username, password, and ip address) when prompted. + +```bash +cd cyclecloud-slurm/cloud_bursting/slurm-23.11.9-1/scheduler +sh cyclecloud-integrator.sh +``` +Output: + +```bash +[root@masternode2 scripts]# sh cyclecloud-integrator.sh +Please enter the CycleCloud details to integrate with the Slurm scheduler + +Enter Cluster Name: hpc1 +Enter CycleCloud Username: user1 +Enter CycleCloud Password: +Enter CycleCloud IP (e.g., 10.222.1.19): 10.222.1.19 +------------------------------------------------------------------------------------------------------------------------------ + +Summary of entered details: +Cluster Name: hpc1 +CycleCloud Username: user1 +CycleCloud URL: https://10.222.1.19 + +------------------------------------------------------------------------------------------------------------------------------ +``` + +### User and Group Setup (Optional) + +- Ensure consistent user and group IDs across all nodes. +- It is advisable to use a centralized User Management system like LDAP to maintain consistent UID and GID across all nodes. +- In this example, we are using the `useradd_example.sh` script to create a test user `user1` and a group for job submission. (User `user1` already exists in CycleCloud) + +```bash +cd cyclecloud-slurm/cloud_bursting/slurm-23.11.9-1/scheduler +sh useradd_example.sh +``` + +### Testing the Setup + +- Log in as a test user (e.g., `user1`) on the Scheduler node. +- Submit a test job to verify that the setup is functioning correctly. + +```bash +su - user1 +srun hostname & +``` +Output: +```bash +[root@masternode2 scripts]# su - user1 +Last login: Tue May 14 04:54:51 UTC 2024 on pts/0 +[user1@masternode2 ~]$ srun hostname & +[1] 43448 +[user1@masternode2 ~]$ squeue + JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) + 1 hpc hostname user1 CF 0:04 1 hpc1-hpc-1 +[user1@masternode2 ~]$ hpc1-hpc-1 +``` +![Node Creation](images/nodecreation.png) + +You should see the job running successfully, indicating a successful integration with CycleCloud. + +For further details and advanced configurations, refer to the scripts and documentation within this repository. + +--- + +These instructions provide a comprehensive guide for setting up Slurm bursting with CycleCloud on Azure. If you encounter any issues or have questions, please refer to the provided scripts and documentation for troubleshooting steps. Happy bursting! + +# Contributing + +This project welcomes contributions and suggestions. Most contributions require you to agree to a +Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us +the rights to use your contribution. For details, visit https://cla.microsoft.com. + +When you submit a pull request, a CLA-bot will automatically determine whether you need to provide +a CLA and decorate the PR appropriately (e.g., label, comment). Simply follow the instructions +provided by the bot. You will only need to do this once across all repos using our CLA. + +This project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/). +For more information see the [Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) or +contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additional questions or comments. \ No newline at end of file diff --git a/cloud_bursting/slurm-23.11.9-1/cyclecloud/cyclecloud-project-build.sh b/cloud_bursting/slurm-23.11.9-1/cyclecloud/cyclecloud-project-build.sh new file mode 100644 index 00000000..b3fad5ce --- /dev/null +++ b/cloud_bursting/slurm-23.11.9-1/cyclecloud/cyclecloud-project-build.sh @@ -0,0 +1,23 @@ +#!/bin/sh +# This script need to run on cyclecloud VM. +# This script will fetch the CycleCloud project and upload it to the locker. +# Author : Vinil Vadakkepurakkal +# Date : 01/10/2024 +set -e +read -p "Enter Cluster Name: " cluster_name +echo "Cluster Name: $cluster_name" +echo "Use the same cluster name: $cluster_name in building the scheduler" + +echo "Importing Cluster" +cyclecloud import_cluster $cluster_name -c Slurm_HL -f slurm.txt + +# creating custom project and upload it to the locker + +CCLOCKERNAME=$(cyclecloud locker list | sed 's/(.*)//; s/[[:space:]]*$//') +echo "Locker Name: $CCLOCKERNAME" +echo "Fetching CycleCloud project" +SLURM_PROJ_VERSION="3.0.9" +cyclecloud project fetch https://github.com/Azure/cyclecloud-slurm/releases/$SLURM_PROJ_VERSION slurm-$SLURM_PROJ_VERSION +cd slurm-$SLURM_PROJ_VERSION +echo "Uploading CycleCloud project to the locker" +cyclecloud project upload "$CCLOCKERNAME" \ No newline at end of file diff --git a/cloud_bursting/slurm-23.11.9-1/cyclecloud/slurm.txt b/cloud_bursting/slurm-23.11.9-1/cyclecloud/slurm.txt new file mode 100644 index 00000000..60249a65 --- /dev/null +++ b/cloud_bursting/slurm-23.11.9-1/cyclecloud/slurm.txt @@ -0,0 +1,588 @@ + +################################ +## Cluster Configuration File ## +################################ + + +[cluster Slurm_HL] +FormLayout = selectionpanel +Category = Schedulers + +Autoscale = $Autoscale + + [[node defaults]] + UsePublicNetwork = $UsePublicNetwork + Credentials = $Credentials + SubnetId = $SubnetId + Region = $Region + KeyPairLocation = ~/.ssh/cyclecloud.pem + Azure.Identities = $ManagedIdentity + + # Slurm autoscaling supports both Terminate and Deallocate shutdown policies + ShutdownPolicy = $configuration_slurm_shutdown_policy + + # Lustre mounts require termination notifications to unmount + EnableTerminateNotification = ${NFSType == "lustre" || NFSSchedType == "lustre" || AdditionalNFSType == "lustre" || EnableTerminateNotification} + TerminateNotificationTimeout = 10m + + [[[configuration]]] + + slurm.install_pkg = azure-slurm-install-pkg-3.0.9.tar.gz + slurm.autoscale_pkg = azure-slurm-pkg-3.0.9.tar.gz + + slurm.version = $configuration_slurm_version + slurm.disable_pmc = $configuration_slurm_disable_pmc + slurm.user.name = slurm + slurm.user.uid = 11100 + slurm.user.gid = 11100 + munge.user.name = munge + munge.user.uid = 11101 + munge.user.gid = 11101 + + # Disable ip-XXXXXXXX hostname generation + cyclecloud.hosts.standalone_dns.enabled = ${NodeNameIsHostname==false} + cyclecloud.hosts.simple_vpc_dns.enabled = ${NodeNameIsHostname==false} + + # For fast spin-up after Deallocate, force an immediate re-converge on boot + cyclecloud.converge_on_boot = true + + # Disable normal NFS exports and mounts + cyclecloud.mounts.sched.disabled = true + cyclecloud.mounts.shared.disabled = true + cyclecloud.exports.sched.disabled = true + cyclecloud.exports.shared.disabled = true + cyclecloud.exports.sched.samba.enabled = false + cyclecloud.exports.shared.samba.enabled = false + cyclecloud.exports.defaults.samba.enabled = false + cshared.server.legacy_links_disabled = true + + # May be used to identify the ID in cluster-init scripts + cluster.identities.default = $ManagedIdentity + + [[[cluster-init slurm:default:3.0.9]]] + Optional = true + + [[[volume boot]]] + Size = ${ifThenElse(BootDiskSize > 0, BootDiskSize, undefined)} + SSD = True + + [[[configuration cyclecloud.mounts.nfs_shared]]] + type = $NFSType + mountpoint = /shared + export_path = ${ifThenElse(NFSType == "lustre", strcat("tcp:/lustrefs", NFSSharedExportPath), NFSSharedExportPath)} + address = $NFSAddress + options = $NFSSharedMountOptions + + [[[configuration cyclecloud.mounts.nfs_sched]]] + type = $NFSSchedType + mountpoint = /sched + export_path = ${ifThenElse(NFSSchedType == "lustre", strcat("tcp:/lustrefs", NFSSchedExportPath), NFSSchedExportPath)} + address = $NFSSchedAddress + options = $NFSSchedMountOptions + + [[[configuration cyclecloud.mounts.additional_nfs]]] + disabled = ${AdditionalNFS isnt true} + type = $AdditionalNFSType + address = $AdditionalNFSAddress + mountpoint = $AdditionalNFSMountPoint + export_path = ${ifThenElse(AdditionalNFSType == "lustre", strcat("tcp:/lustrefs", AdditionalNFSExportPath), AdditionalNFSExportPath)} + options = $AdditionalNFSMountOptions + + [[node nodearraybase]] + Abstract = true + [[[configuration]]] + slurm.autoscale = true + + slurm.node_prefix = ${ifThenElse(NodeNamePrefix=="Cluster Prefix", StrJoin("-", ClusterName, ""), NodeNamePrefix)} + slurm.use_nodename_as_hostname = $NodeNameIsHostname + + [[[cluster-init slurm:execute:3.0.9]]] + + [[[network-interface eth0]]] + AssociatePublicIpAddress = $ExecuteNodesPublic + + [[nodearray hpc]] + Extends = nodearraybase + MachineType = $HPCMachineType + ImageName = $HPCImageName + MaxCoreCount = $MaxHPCExecuteCoreCount + Azure.MaxScalesetSize = $HPCMaxScalesetSize + AdditionalClusterInitSpecs = $HPCClusterInitSpecs + EnableNodeHealthChecks = $EnableNodeHealthChecks + + + [[[configuration]]] + slurm.default_partition = true + slurm.hpc = true + slurm.partition = hpc + + + [[nodearray htc]] + Extends = nodearraybase + MachineType = $HTCMachineType + ImageName = $HTCImageName + MaxCoreCount = $MaxHTCExecuteCoreCount + + Interruptible = $HTCUseLowPrio + MaxPrice = $HTCSpotMaxPrice + AdditionalClusterInitSpecs = $HTCClusterInitSpecs + + [[[configuration]]] + slurm.hpc = false + slurm.partition = htc + # set pcpu = false for all hyperthreaded VMs + slurm.use_pcpu = false + +[parameters About] +Order = 1 + + [[parameters About Slurm]] + + [[[parameter slurm]]] + HideLabel = true + Config.Plugin = pico.widget.HtmlTemplateWidget + Config.Template = '''
Slurm icon

Follow the instructions in theREADMEfor details on instructions on extending and configuring the Project for your environment.


Slurm is the most widely used workload manager in HPC, as the scheduler of choice for six of the top ten systems in the TOP500 and with market penetration of more than 70%. Slurm is an advanced, open-source scheduler designed to satisfy the demanding needs of high-performance computing (HPC), high-throughput computing (HTC), and artificial intelligence (AI).

Commercial Support provided by SchedMD

Get more from your HPC investment! SchedMD, the company behind Slurm development, can answer your Slurm questions and explain our options for consultation, training, support, and migration.

Contact SchedMD

View more details about Slurm?

Slurm at a glance

Slurm provides massive scalability and can easily manage performance requirements for small cluster, large cluster, and supercomputer needs. Slurm outperforms competitive schedulers with compute rates at:

  • 100K+ nodes/GPU
  • 17M+ jobs per day
  • 120M+ jobs per week

Slurm’s plug-in based architecture enables optimization and control in scheduling operations to meet organizational priorities. With first class resource management for GPUs, Slurm allows users to request GPU resources alongside CPUs. This flexibility ensures that jobs are executed quickly and efficiently, while maximizing resource utilization.


Other Slurm features include:

  • NVIDIA and AMD GPU support for AI, LLM, and ML environments
  • Advanced scheduling policies
  • Unique HPC, HTC, AI/ML workload expertise
  • Cloud bursting capabilities
  • Power saving capabilities, accounting, and reporting
  • Provided REST API daemon
  • Native support of containers
  • Tailored Slurm consulting and training available through SchedMD
''' + +[parameters Required Settings] +Order = 10 + + + [[parameters Virtual Machines ]] + Description = "The cluster, in this case, has two roles: the scheduler node with shared filer and the execute hosts. Configure which VM types to use based on the requirements of your application." + Order = 20 + + [[[parameter Region]]] + Label = Region + Description = Deployment Location + ParameterType = Cloud.Region + + [[[parameter HPCMachineType]]] + Label = HPC VM Type + Description = The VM type for HPC execute nodes + ParameterType = Cloud.MachineType + DefaultValue = Standard_F2s_v2 + + [[[parameter HTCMachineType]]] + Label = HTC VM Type + Description = The VM type for HTC execute nodes + ParameterType = Cloud.MachineType + DefaultValue = Standard_F2s_v2 + + + [[parameters Auto-Scaling]] + Description = "The cluster can autoscale to the workload, adding execute hosts as jobs are queued. To enable this check the box below and choose the initial and maximum core counts for the cluster." + Order = 30 + + [[[parameter Autoscale]]] + Label = Autoscale + DefaultValue = true + Widget.Plugin = pico.form.BooleanCheckBox + Widget.Label = Start and stop execute instances automatically + + [[[parameter MaxHPCExecuteCoreCount]]] + Label = Max HPC Cores + Description = The total number of HPC execute cores to start + DefaultValue = 100 + Config.Plugin = pico.form.NumberTextBox + Config.MinValue = 1 + Config.IntegerOnly = true + + [[[parameter MaxHTCExecuteCoreCount]]] + Label = Max HTC Cores + Description = The total number of HTC execute cores to start + DefaultValue = 100 + Config.Plugin = pico.form.NumberTextBox + Config.MinValue = 1 + Config.IntegerOnly = true + + + [[[parameter HPCMaxScalesetSize]]] + Label = Max VMs per VMSS + Description = The maximum number of VMs created per VM Scaleset e.g. switch in Slurm. + DefaultValue = 100 + Config.Plugin = pico.form.NumberTextBox + Config.MinValue = 1 + Config.IntegerOnly = true + + + [[[parameter HTCUseLowPrio]]] + Label = HTC Spot + DefaultValue = false + Widget.Plugin = pico.form.BooleanCheckBox + Widget.Label = Use Spot VMs for HTC execute hosts + + [[[parameter HTCSpotMaxPrice]]] + Label = Max Price HTC + DefaultValue = -1 + Description = Max price for Spot VMs in USD (value of -1 will not evict based on price) + Config.Plugin = pico.form.NumberTextBox + Conditions.Excluded := HTCUseLowPrio isnt true + Config.MinValue = -1 + + [[parameters Networking]] + Order = 40 + + [[[parameter SubnetId]]] + Label = Subnet ID + Description = Subnet Resource Path (ResourceGroup/VirtualNetwork/Subnet) + ParameterType = Azure.Subnet + Required = True + + +[parameters Network Attached Storage] +Order = 15 + + [[parameters Shared Storage]] + Order = 10 + + [[[parameter About Shared Storage]]] + HideLabel = true + Config.Plugin = pico.widget.HtmlTemplateWidget + Config.Template = '''

The directories /sched and /shared are network attached mounts and exist on all nodes of the cluster.
+
+ Options for providing these mounts:
+ [Builtin]: The scheduler node is an NFS server that provides the mountpoint to the other nodes of the cluster (not supported for HA configurations).
+ [External NFS]: A network attached storage such as Azure Netapp Files, HPC Cache, or another VM running an NFS server provides the mountpoint.
+ [Azure Managed Lustre]: An Azure Managed Lustre deployment provides the mountpoint.
+

+

+ Note: the cluster must be terminated for changes to filesystem mounts to take effect. +

''' + Conditions.Hidden := false + + [[parameters Scheduler Mount]] + Order = 20 + Label = File-system Mount for /sched + + [[[parameter About sched]]] + HideLabel = true + Config.Plugin = pico.widget.HtmlTemplateWidget + Config.Template = '''

Slurm's configuration is linked in from the /sched directory. It is managed by the scheduler node

''' + Order = 6 + + [[[parameter About sched part 2]]] + HideLabel = true + Config.Plugin = pico.widget.HtmlTemplateWidget + Config.Template = '''

To disable the built-in NFS export of the /sched directory, and to use an external filesystem, select the checkbox below.

''' + Order = 7 + Conditions.Hidden := configuration_slurm_ha_enabled + + [[[parameter UseBuiltinSched]]] + Label = Use Builtin NFS + Description = Use the builtin NFS for /sched + DefaultValue = false + ParameterType = Boolean + Conditions.Hidden := configuration_slurm_ha_enabled + Disabled = configuration_slurm_ha_enabled + + [[[parameter NFSSchedDiskWarning]]] + HideLabel = true + Config.Plugin = pico.widget.HtmlTemplateWidget + Config.Template := "

Warning: switching an active cluster over to NFS or Lustre from Builtin will delete the shared disk.

" + Conditions.Hidden := UseBuiltinSched || configuration_slurm_ha_enabled + + [[[parameter NFSSchedType]]] + Label = FS Type + ParameterType = StringList + Config.Label = Type of shared filesystem to use for this cluster + Config.Plugin = pico.form.Dropdown + Config.Entries := {[Label="External NFS"; Value="nfs"], [Label="Azure Managed Lustre"; Value="lustre"]} + DefaultValue = nfs + Conditions.Hidden := UseBuiltinSched && !configuration_slurm_ha_enabled + + [[[parameter NFSSchedAddress]]] + Label = IP Address + Description = The IP address or hostname of the NFS server or Lustre FS. Also accepts a list comma-separated addresses, for example, to mount a frontend load-balanced Azure HPC Cache. + Config.ParameterType = String + Conditions.Hidden := UseBuiltinSched && !configuration_slurm_ha_enabled + + [[[parameter NFSSchedExportPath]]] + Label = Export Path + Description = The path exported by the file system + DefaultValue = /sched + Conditions.Hidden := UseBuiltinSched && !configuration_slurm_ha_enabled + + [[[parameter NFSSchedMountOptions]]] + Label = Mount Options + Description = NFS Client Mount Options + Conditions.Hidden := UseBuiltinSched && !configuration_slurm_ha_enabled + + + [[[parameter SchedFilesystemSize]]] + Label = Size (GB) + Description = The filesystem size (cannot be changed after initial start) + DefaultValue = 30 + Config.Plugin = pico.form.NumberTextBox + Config.MinValue = 10 + Config.MaxValue = 10240 + Config.IntegerOnly = true + Conditions.Excluded := !UseBuiltinSched || configuration_slurm_ha_enabled + + + + [[parameters Default NFS Share]] + Order = 30 + Label = File-system Mount for /shared + + [[[parameter About shared]]] + HideLabel = true + Config.Plugin = pico.widget.HtmlTemplateWidget + Config.Template = '''

Users' home directories reside within the /shared mountpoint with the base homedir /shared/home.

''' + Order = 6 + + [[[parameter About shared part 2]]] + HideLabel = true + Config.Plugin = pico.widget.HtmlTemplateWidget + Config.Template = '''

To disable the built-in NFS export of the /shared directory, and to use an external filesystem, select the checkbox below.

''' + Order = 7 + Conditions.Hidden := configuration_slurm_ha_enabled + + [[[parameter UseBuiltinShared]]] + Label = Use Builtin NFS + Description = Use the builtin NFS for /share + DefaultValue = false + ParameterType = Boolean + Conditions.Hidden := configuration_slurm_ha_enabled + Disabled = configuration_slurm_ha_enabled + + [[[parameter NFSDiskWarning]]] + HideLabel = true + Config.Plugin = pico.widget.HtmlTemplateWidget + Config.Template := "

Warning: switching an active cluster over to NFS or Lustre from Builtin will delete the shared disk.

" + Conditions.Hidden := UseBuiltinShared || configuration_slurm_ha_enabled + + [[[parameter NFSType]]] + Label = FS Type + ParameterType = StringList + Config.Label = Type of shared filesystem to use for this cluster + Config.Plugin = pico.form.Dropdown + Config.Entries := {[Label="External NFS"; Value="nfs"], [Label="Azure Managed Lustre"; Value="lustre"]} + DefaultValue = nfs + Conditions.Hidden := UseBuiltinShared && !configuration_slurm_ha_enabled + + [[[parameter NFSAddress]]] + Label = IP Address + Description = The IP address or hostname of the NFS server or Lustre FS. Also accepts a list comma-separated addresses, for example, to mount a frontend load-balanced Azure HPC Cache. + Config.ParameterType = String + Conditions.Hidden := UseBuiltinShared && !configuration_slurm_ha_enabled + + [[[parameter NFSSharedExportPath]]] + Label = Export Path + Description = The path exported by the file system + DefaultValue = /shared + Conditions.Hidden := UseBuiltinShared && !configuration_slurm_ha_enabled + + [[[parameter NFSSharedMountOptions]]] + Label = Mount Options + Description = NFS Client Mount Options + Conditions.Hidden := UseBuiltinShared && !configuration_slurm_ha_enabled + + + [[[parameter FilesystemSize]]] + Label = Size (GB) + Description = The filesystem size (cannot be changed after initial start) + DefaultValue = 100 + Config.Plugin = pico.form.NumberTextBox + Config.MinValue = 10 + Config.MaxValue = 10240 + Config.IntegerOnly = true + Conditions.Excluded := !UseBuiltinShared || configuration_slurm_ha_enabled + + [[parameters Additional NFS Mount]] + Order = 40 + Label = Additional Filesystem Mount + [[[parameter Additional Shared FS Mount Readme]]] + HideLabel = true + Config.Plugin = pico.widget.HtmlTemplateWidget + Config.Template := "

Mount another shared filesystem endpoint on the cluster nodes.

" + Order = 20 + + [[[parameter AdditionalNFS]]] + HideLabel = true + DefaultValue = false + Widget.Plugin = pico.form.BooleanCheckBox + Widget.Label = Add Shared Filesystem mount + + [[[parameter AdditionalNFSType]]] + Label = FS Type + ParameterType = StringList + Config.Label = Shared filesystem type of the additional mount + Config.Plugin = pico.form.Dropdown + Config.Entries := {[Label="External NFS"; Value="nfs"], [Label="Azure Managed Lustre"; Value="lustre"]} + DefaultValue = nfs + Conditions.Excluded := AdditionalNFS isnt true + + [[[parameter AdditionalNFSAddress]]] + Label = IP Address + Description = The IP address or hostname of the additional mount. Also accepts a list comma-separated addresses, for example, to mount a frontend load-balanced Azure HPC Cache. + Config.ParameterType = String + Conditions.Excluded := AdditionalNFS isnt true + + [[[parameter AdditionalNFSMountPoint]]] + Label = Mount Point + Description = The path at which to mount the Filesystem + DefaultValue = /data + Conditions.Excluded := AdditionalNFS isnt true + + [[[parameter AdditionalNFSExportPath]]] + Label = Export Path + Description = The path exported by the file system + DefaultValue = /data + Conditions.Excluded := AdditionalNFS isnt true + + [[[parameter AdditionalNFSMountOptions]]] + Label = Mount Options + Description = Filesystem Client Mount Options + Conditions.Excluded := AdditionalNFS isnt true + + +[parameters Advanced Settings] +Order = 20 + + [[parameters Azure Settings]] + Order = 10 + + [[[parameter Credentials]]] + Description = The credentials for the cloud provider + ParameterType = Cloud.Credentials + + [[[parameter ManagedIdentity]]] + Label = Managed Id + Description = Optionally assign an Azure user assigned managed identity to all nodes to access Azure resources using assigned roles. + ParameterType = Azure.ManagedIdentity + DefaultValue = =undefined + + [[[parameter BootDiskSize]]] + Description = Optional: Size of the OS/boot disk in GB for all nodes in the cluster (leave at 0 to use Image size) + ParameterType = Integer + Config.Plugin = pico.form.NumberTextBox + Config.MinValue = 0 + Config.MaxValue = 32,000 + Config.IntegerOnly = true + Config.Increment = 64 + DefaultValue = 0 + + [[parameters Slurm Settings ]] + + Order = 5 + + [[[parameter slurm_version_warning]]] + HideLabel = true + Config.Plugin = pico.widget.HtmlTemplateWidget + Config.Template := "
Note: For SLES HPC, we can only install the version supported by SLES HPC's zypper repos. At the time of this release, that is 23.02.7
" + + + [[[parameter configuration_slurm_version]]] + Required = True + Label = Slurm Version + Description = Version of Slurm to install on the cluster + ParameterType = StringList + Config.Plugin = pico.form.Dropdown + Config.FreeForm = true + Config.Entries := {[Value="23.11.9-1"]} + DefaultValue = 23.11.9-1 + + + [[[parameter configuration_slurm_shutdown_policy]]] + Label = Shutdown Policy + description = By default, autostop will Delete stopped VMS for lowest cost. Optionally, Stop/Deallocate the VMs for faster restart instead. + DefaultValue = Terminate + config.plugin = pico.control.AutoCompleteDropdown + [[[[list Config.Entries]]]] + Name = Terminate + Label = Terminate + [[[[list Config.Entries]]]] + Name = Deallocate + Label = Deallocate + + [[[parameter EnableTerminateNotification]]] + Label = Enable Termination notifications + DefaultValue = False + + + [[parameters Software]] + Description = "Specify the scheduling software, and base OS installed on all nodes, and optionally the cluster-init and chef versions from your locker." + Order = 10 + + [[[parameter NodeNameIsHostname]]] + Label = Name As Hostname + Description = Should the hostname match the nodename for execute nodes? + ParameterType = Boolean + DefaultValue = true + + [[[parameter NodeNamePrefix]]] + Label = Node Prefix + Description = Prefix for generated node names, i.e. "prefix-" generates prefix-nodearray-1. Use 'Cluster Prefix' to get $ClusterName-nodearray-1 + ParameterType = StringList + Config.Plugin = pico.form.Dropdown + Config.FreeForm = true + DefaultValue = "Cluster Prefix" + Config.Entries := {[Value=""], [Value="Cluster Prefix"]} + Conditions.Hidden := NodeNameIsHostname != true + + + [[[parameter HPCImageName]]] + Label = HPC OS + ParameterType = Cloud.Image + Config.OS = linux + DefaultValue = almalinux8 + Config.Filter := Package in {"cycle.image.ubuntu22", "almalinux8"} + + [[[parameter HTCImageName]]] + Label = HTC OS + ParameterType = Cloud.Image + Config.OS = linux + DefaultValue = almalinux8 + Config.Filter := Package in {"cycle.image.ubuntu22", "almalinux8"} + + + [[[parameter HTCClusterInitSpecs]]] + Label = HTC Cluster-Init + DefaultValue = =undefined + Description = Cluster init specs to apply to HTC execute nodes + ParameterType = Cloud.ClusterInitSpecs + + [[[parameter HPCClusterInitSpecs]]] + Label = HPC Cluster-Init + DefaultValue = =undefined + Description = Cluster init specs to apply to HPC execute nodes + ParameterType = Cloud.ClusterInitSpecs + + + [[[parameter configuration_slurm_disable_pmc]]] + Label = Disable PMC + Description = Disable packages from packages.microsoft.com + ParameterType = Boolean + DefaultValue = true + + + [[parameters Advanced Networking]] + + [[[parameter ReturnProxy]]] + Label = Return Proxy + DefaultValue = false + ParameterType = Boolean + Config.Label = Use SSH tunnel to connect to CycleCloud (required if direct access is blocked) + + [[[parameter UsePublicNetwork]]] + Label = Public Head Node + DefaultValue = false + ParameterType = Boolean + Config.Label = Access scheduler node from the Internet + + [[[parameter ExecuteNodesPublic]]] + Label = Public Execute + DefaultValue = false + ParameterType = Boolean + Config.Label = Access execute nodes from the Internet + + + [[parameters Node Health Checks]] + Description = "Section for configuring Node Health Checks" + Order = 12 + + [[[parameter EnableNodeHealthChecks]]] + Label = Enable NHC tests + DefaultValue = false + Widget.Plugin = pico.form.BooleanCheckBox + Widget.Label = Run Node Health Checks on startup diff --git a/cloud_bursting/slurm-23.11.9-1/images/NFSSettings.png b/cloud_bursting/slurm-23.11.9-1/images/NFSSettings.png new file mode 100644 index 00000000..8d160546 Binary files /dev/null and b/cloud_bursting/slurm-23.11.9-1/images/NFSSettings.png differ diff --git a/cloud_bursting/slurm-23.11.9-1/images/architecture_diagram.png b/cloud_bursting/slurm-23.11.9-1/images/architecture_diagram.png new file mode 100644 index 00000000..d28f79bb Binary files /dev/null and b/cloud_bursting/slurm-23.11.9-1/images/architecture_diagram.png differ diff --git a/cloud_bursting/slurm-23.11.9-1/images/nodecreation.png b/cloud_bursting/slurm-23.11.9-1/images/nodecreation.png new file mode 100644 index 00000000..3e302fe3 Binary files /dev/null and b/cloud_bursting/slurm-23.11.9-1/images/nodecreation.png differ diff --git a/cloud_bursting/slurm-23.11.9-1/scheduler/cyclecloud-integrator.sh b/cloud_bursting/slurm-23.11.9-1/scheduler/cyclecloud-integrator.sh new file mode 100644 index 00000000..921aa219 --- /dev/null +++ b/cloud_bursting/slurm-23.11.9-1/scheduler/cyclecloud-integrator.sh @@ -0,0 +1,115 @@ +#!/bin/bash +# ----------------------------------------------------------------------------- +# Script: Install CycleCloud Autoscaler and Integrate with Slurm Scheduler +# +# This script automates the installation of the CycleCloud Autoscaler package, +# a key component used to dynamically scale compute resources in a cluster managed +# by the CycleCloud environment. It integrates with the Slurm scheduler to ensure +# efficient scaling based on workload demands. +# +# Key Features: +# - Installs the CycleCloud Autoscaler package. +# - Configures integration with the Slurm workload manager for automated scaling. +# - Ensures that the compute resources in the cluster can scale up or down based +# on the job queue and resource usage, optimizing both performance and cost. +# +# Prerequisites: +# - Root or sudo privileges are required to execute the installation steps. +# - Slurm scheduler should already be set up in the environment. +# +# Usage: +# sh cyclecloud-integrator.sh +# ----------------------------------------------------------------------------- +set -e +if [ $(whoami) != root ]; then + echo "Please run as root" + exit 1 +fi + + +# Prompt user to enter CycleCloud details for Slurm scheduler integration +echo "Please enter the CycleCloud details to integrate with the Slurm scheduler" +echo " " +# Prompt for Cluster Name +read -p "Enter Cluster Name: " cluster_name + +# Prompt for Username +read -p "Enter CycleCloud Username: " username + +# Prompt for Password (masked input) +echo -n "Enter CycleCloud password: " +stty -echo # Turn off echo +read password +stty echo # Turn echo back on +echo +echo "Password entered." +# Prompt for URL +read -p "Enter CycleCloud IP (e.g.,10.222.1.19): " ip +url="https://$ip" + +# Display summary of entered details +echo "------------------------------------------------------------------------------------------------------------------------------" +echo " " +echo "Summary of entered details:" +echo "Cluster Name: $cluster_name" +echo "CycleCloud Username: $username" +echo "CycleCloud URL: $url" +echo " " +echo "------------------------------------------------------------------------------------------------------------------------------" + +# Define variables + +slurm_autoscale_pkg_version="3.0.9" +slurm_autoscale_pkg="azure-slurm-pkg-$slurm_autoscale_pkg_version.tar.gz" +slurm_script_dir="/opt/azurehpc/slurm" +config_dir="/sched/$cluster_name" + +# Create necessary directories +mkdir -p "$slurm_script_dir" + +# Activate Python virtual environment for Slurm integration +echo "------------------------------------------------------------------------------------------------------------------------------" +echo "Configuring virtual enviornment and Activating Python virtual environment" +echo "------------------------------------------------------------------------------------------------------------------------------" +python3 -m venv "$slurm_script_dir/venv" +. "$slurm_script_dir/venv/bin/activate" + +# Download and install CycleCloud Slurm integration package +echo "------------------------------------------------------------------------------------------------------------------------------" +echo "Downloading and installing CycleCloud Slurm integration package" +echo "------------------------------------------------------------------------------------------------------------------------------" + +wget https://github.com/Azure/cyclecloud-slurm/releases/download/$slurm_autoscale_pkg_version/$slurm_autoscale_pkg -P "$slurm_script_dir" +tar -xvf "$slurm_script_dir/$slurm_autoscale_pkg" -C "$slurm_script_dir" +cd "$slurm_script_dir/azure-slurm" +head -n -30 install.sh > integrate-cc.sh +chmod +x integrate-cc.sh +./integrate-cc.sh +#cleanup +rm -rf azure-slurm* + +# Initialize autoscaler configuration +echo "------------------------------------------------------------------------------------------------------------------------------" +echo "Initializing autoscaler configuration" +echo "------------------------------------------------------------------------------------------------------------------------------" + +azslurm initconfig --username "$username" --password "$password" --url "$url" --cluster-name "$cluster_name" --config-dir "$config_dir" --default-resource '{"select": {}, "name": "slurm_gpus", "value": "node.gpu_count"}' > "$slurm_script_dir/autoscale.json" +chown slurm:slurm "$slurm_script_dir/autoscale.json" +chown -R slurm:slurm "$slurm_script_dir" +# Connect and scale +echo "------------------------------------------------------------------------------------------------------------------------------" +echo "Connecting to CycleCloud and scaling resources" +echo "------------------------------------------------------------------------------------------------------------------------------" + +azslurm connect +azslurm scale --no-restart +chown -R slurm:slurm "$slurm_script_dir"/logs/*.log + +systemctl restart munge +systemctl restart slurmctld +echo " " +echo "------------------------------------------------------------------------------------------------------------------------------" +echo "Slurm scheduler integration with CycleCloud completed successfully" +echo " Create User and Group for job submission. Make sure that GID and UID is consistent across all nodes and home directory is shared" +echo "------------------------------------------------------------------------------------------------------------------------------" +echo " " \ No newline at end of file diff --git a/cloud_bursting/slurm-23.11.9-1/scheduler/slurm-scheduler-builder.sh b/cloud_bursting/slurm-23.11.9-1/scheduler/slurm-scheduler-builder.sh new file mode 100644 index 00000000..15bedfbc --- /dev/null +++ b/cloud_bursting/slurm-23.11.9-1/scheduler/slurm-scheduler-builder.sh @@ -0,0 +1,387 @@ +#!/bin/sh +# ----------------------------------------------------------------------------- +# Script: Install and Configure Slurm Scheduler +# +# This script automates the installation and configuration of the Slurm scheduler +# on your VM or machine. It sets up the Slurm software to manage and schedule +# workloads efficiently across the available resources in your environment. +# +# Key Features: +# - Installs and configures Slurm software on your VM or machine. +# - Sets up the Slurm configuration to manage compute resources. +# +# Prerequisites: +# - Root or sudo privileges are required to run the script. +# +# Usage: +# # sh slurm-scheduler-builder.sh +# ----------------------------------------------------------------------------- + + +set -e +if [ $(whoami) != root ]; then + echo "Please run as root" + exit 1 +fi + +# Check if the script is running on a supported OS with the required version of almaLinux 8.7 or Ubuntu 22.04 + +# Check if /etc/os-release exists +if [ ! -e /etc/os-release ]; then + echo "This script only supports AlmaLinux 8.7 or Ubuntu 22.04" + exit 1 +fi + +# Source /etc/os-release to get OS information +. /etc/os-release + +# Check OS name and version +if { [ "$ID" = "almalinux" ] && [ "$VERSION_ID" = "8.7" ]; } || \ + { [ "$ID" = "ubuntu" ] && [ "$VERSION_ID" = "22.04" ]; }; then + echo "OS version is supported." +else + echo "This script only supports AlmaLinux 8.7 or Ubuntu 22.04" + exit 1 +fi + +echo "------------------------------------------------------------------------------------------------------------------------------" +echo "Building Slurm scheduler for cloud bursting with Azure CycleCloud" +echo "------------------------------------------------------------------------------------------------------------------------------" +echo " " +# Prompt for Cluster Name +read -p "Enter Cluster Name: " cluster_name + +ip_address=$(hostname -I | awk '{print $1}') +echo "------------------------------------------------------------------------------------------------------------------------------" +echo " " +echo "Summary of entered details:" +echo "Cluster Name: $cluster_name" +echo "Scheduler Hostname: $(hostname)" +echo "NFSServer IP Address: $ip_address" +echo " " +echo "------------------------------------------------------------------------------------------------------------------------------" + +sched_dir="/sched/$cluster_name" +slurm_conf="$sched_dir/slurm.conf" +munge_key="/etc/munge/munge.key" +slurm_script_dir="/opt/azurehpc/slurm" +OS_ID=$(cat /etc/os-release | grep ^ID= | cut -d= -f2 | cut -d\" -f2 | cut -d. -f1) +OS_VERSION=$(cat /etc/os-release | grep VERSION_ID | cut -d= -f2 | cut -d\" -f2) +SLURM_VERSION="23.11.9-1" + +# Create directories +mkdir -p "$sched_dir" + +# Create Munge and Slurm users +echo "------------------------------------------------------------------------------------------------------------------------------" +echo "Creating Munge and Slurm users" +echo "------------------------------------------------------------------------------------------------------------------------------" + +#check if the users already exist +#uid 11100 is used for slurm and 11101 is used for munge in cyclecloud + +# Function to check and create a user and group +create_user_and_group() { + username=$1 + uid=$2 + gid=$3 + + if id "$username" >/dev/null 2>&1; then + echo "$username user already exists" + else + echo "Creating $username user and group..." + groupadd -g "$gid" "$username" + useradd -u "$uid" -g "$gid" -s /bin/false -M "$username" + echo "$username user and group created" + fi +} + +# Check and create 'munge' user and group if necessary +create_user_and_group "munge" 11101 11101 + +# Check and create 'slurm' user and group if necessary +create_user_and_group "slurm" 11100 11100 + +# Set up NFS server +echo "------------------------------------------------------------------------------------------------------------------------------" +echo "Setting up NFS server" +echo "------------------------------------------------------------------------------------------------------------------------------" +if [ "$OS_ID" = "almalinux" ]; then + dnf install -y nfs-utils +elif [ "$OS_ID" = "ubuntu" ]; then + apt-get install -y nfs-kernel-server +fi +mkdir -p /sched /shared +echo "/sched *(rw,sync,no_root_squash)" >> /etc/exports +echo "/shared *(rw,sync,no_root_squash)" >> /etc/exports +systemctl restart nfs-server.service +systemctl enable nfs-server.service +echo "NFS server setup complete" +showmount -e localhost + +# setting up Microsoft repo +# Set up Microsoft repository based on the OS +echo "Setting up Microsoft repo" + +# Check if OS is AlmaLinux +if [ "$OS_ID" = "almalinux" ]; then + echo "Detected AlmaLinux" + if [ ! -e /etc/yum.repos.d/microsoft-prod.repo ]; then + echo "Downloading and installing Microsoft repo for AlmaLinux..." + curl -sSL -O https://packages.microsoft.com/config/rhel/$(echo "$OS_VERSION" | cut -d. -f1)/packages-microsoft-prod.rpm + rpm -i packages-microsoft-prod.rpm + rm -f packages-microsoft-prod.rpm + echo "Microsoft repo setup complete for AlmaLinux" + else + echo "Microsoft repo already exists on AlmaLinux" + fi + +# Check if OS is Ubuntu +elif [ "$OS_ID" = "ubuntu" ]; then + echo "Detected Ubuntu" + if [ ! -e /etc/apt/sources.list.d/microsoft-prod.list ]; then + echo "Downloading and installing Microsoft repo for Ubuntu..." + curl -sSL -O https://packages.microsoft.com/config/ubuntu/$OS_VERSION/packages-microsoft-prod.deb + dpkg -i packages-microsoft-prod.deb + rm -f packages-microsoft-prod.deb + echo "Microsoft repo setup complete for Ubuntu" + else + echo "Microsoft repo already exists on Ubuntu" + fi + +# If OS is neither AlmaLinux nor Ubuntu +else + echo "Unsupported OS: $OS_ID" + exit 1 +fi + +# Install and configure Munge +echo "------------------------------------------------------------------------------------------------------------------------------" +echo "Installing and configuring Munge" +echo "------------------------------------------------------------------------------------------------------------------------------" +if [ "$OS_ID" = "almalinux" ]; then + dnf install -y epel-release + dnf install -y munge munge-libs +elif [ "$OS_ID" = "ubuntu" ]; then + apt-get update + apt-get install -y munge +else + echo "Unsupported OS: $OS_ID" + exit 1 +fi + +# Generate the munge key and set proper permissions +dd if=/dev/urandom bs=1 count=1024 of="$munge_key" +chown munge:munge "$munge_key" +chmod 400 "$munge_key" + +# Start and enable the munge service +systemctl start munge +systemctl enable munge + +# Copy the munge key to the sched directory +cp "$munge_key" "$sched_dir/munge.key" +chown munge: "$sched_dir/munge.key" +chmod 400 "$sched_dir/munge.key" + +echo "------------------------------------------------------------------------------------------------------------------------------" +echo "Munge installed and configured" +echo "------------------------------------------------------------------------------------------------------------------------------" + +# Install and configure Slurm +echo "------------------------------------------------------------------------------------------------------------------------------" +echo "Installing Slurm" +echo "------------------------------------------------------------------------------------------------------------------------------" + +# Installing Slurm on AlmaLinux +if [ "$OS_ID" = "almalinux" ]; then + echo "Installing Slurm on AlmaLinux" + echo "Setting up Slurm repository..." + + # Create Slurm repository file for AlmaLinux + cat < /etc/yum.repos.d/slurm.repo +[slurm] +name=Slurm Workload Manager +baseurl=https://packages.microsoft.com/yumrepos/slurm-el8-insiders +enabled=1 +gpgcheck=1 +gpgkey=https://packages.microsoft.com/keys/microsoft.asc +priority=10 +EOF + + echo "Slurm repository setup complete." + echo "------------------------------------------------------------------------------------------------------------------------------" + echo "Installing Slurm packages" + echo "------------------------------------------------------------------------------------------------------------------------------" + + # List of Slurm packages to install + slurm_packages="slurm slurm-slurmrestd slurm-libpmi slurm-devel slurm-pam_slurm slurm-perlapi slurm-torque slurm-openlava slurm-example-configs" + sched_packages="slurm-slurmctld slurm-slurmdbd" + + # Install Slurm packages on AlmaLinux + OS_MAJOR_VERSION=$(echo "$OS_VERSION" | cut -d. -f1) + for pkg in $slurm_packages; do + yum -y install $pkg-${SLURM_VERSION}.el${OS_MAJOR_VERSION} --disableexcludes=slurm + done + for pkg in $sched_packages; do + yum -y install $pkg-${SLURM_VERSION}.el${OS_MAJOR_VERSION} --disableexcludes=slurm + done + +# Installing Slurm on Ubuntu +elif [ "$OS_ID" = "ubuntu" ]; then + echo "Installing Slurm on Ubuntu" + REPO="slurm-ubuntu-jammy" + + echo "Setting up Slurm repository for Ubuntu..." + # Add Slurm repository + echo "deb [arch=amd64] https://packages.microsoft.com/repos/$REPO/ insiders main" > /etc/apt/sources.list.d/slurm.list + + # Set package pinning preferences + echo "\ +Package: slurm, slurm-* +Pin: origin \"packages.microsoft.com\" +Pin-Priority: 990 + +Package: slurm, slurm-* +Pin: origin *ubuntu.com* +Pin-Priority: -1" > /etc/apt/preferences.d/slurm-repository-pin-990 + + echo "Slurm repository setup completed." + echo "------------------------------------------------------------------------------------------------------------------------------" + echo "Installing Slurm packages" + echo "------------------------------------------------------------------------------------------------------------------------------" + + # Update package lists and install Slurm + # remove the need restart prompt for outdated libraries + grep -qxF "\$nrconf{restart} = 'a';" /etc/needrestart/conf.d/no-prompt.conf || echo "\$nrconf{restart} = 'a';" | sudo tee -a /etc/needrestart/conf.d/no-prompt.conf > /dev/null + apt-get update + apt install -y libhwloc15 libmysqlclient-dev libssl-dev jq python3-venv chrony + systemctl enable chrony + systemctl start chrony + slurm_packages="slurm-smd slurm-smd-client slurm-smd-dev slurm-smd-libnss-slurm slurm-smd-libpam-slurm-adopt slurm-smd-slurmrestd slurm-smd-sview slurm-smd-slurmctld slurm-smd-slurmdbd" + for pkg in $slurm_packages; do + apt-get update + DEBIAN_FRONTEND=noninteractive apt install -y $pkg=$SLURM_VERSION + DEBIAN_FRONTEND=noninteractive apt-mark hold $pkg + done + + # Unsupported OS + +else + echo "Unsupported OS: $OS_ID" + exit 1 +fi + +echo "------------------------------------------------------------------------------------------------------------------------------" +echo "Slurm installation completed." +echo "------------------------------------------------------------------------------------------------------------------------------" + +# Configure Slurm +echo "------------------------------------------------------------------------------------------------------------------------------" +echo "Configuring Slurm" +echo "------------------------------------------------------------------------------------------------------------------------------" + +cat < "$slurm_conf" +MpiDefault=none +ProctrackType=proctrack/cgroup +ReturnToService=2 +PropagateResourceLimits=ALL +SlurmctldPidFile=/var/run/slurmctld.pid +SlurmdPidFile=/var/run/slurmd.pid +SlurmdSpoolDir=/var/spool/slurmd +SlurmUser=slurm +StateSaveLocation=/var/spool/slurmctld +SwitchType=switch/none +TaskPlugin=task/affinity,task/cgroup +SchedulerType=sched/backfill +SelectType=select/cons_tres +GresTypes=gpu +SelectTypeParameters=CR_Core_Memory +# We use a "safe" form of the CycleCloud cluster_name throughout slurm. +# First we lowercase the cluster name, then replace anything +# that is not letters, digits and '-' with a '-' +# eg My Cluster == my-cluster +ClusterName=$cluster_name +JobAcctGatherType=jobacct_gather/none +SlurmctldDebug=debug +SlurmctldLogFile=/var/log/slurmctld/slurmctld.log +SlurmctldParameters=idle_on_node_suspend +SlurmdDebug=debug +SlurmdLogFile=/var/log/slurmd/slurmd.log +# TopologyPlugin=topology/tree +# If you use the TopologyPlugin you likely also want to use our +# job submit plugin so that your jobs run on a single switch +# or just add --switches 1 to your submission scripts +# JobSubmitPlugins=lua +PrivateData=cloud +PrologSlurmctld=/opt/azurehpc/slurm/prolog.sh +TreeWidth=65533 +ResumeTimeout=1800 +SuspendTimeout=600 +SuspendTime=300 +ResumeProgram=/opt/azurehpc/slurm/resume_program.sh +ResumeFailProgram=/opt/azurehpc/slurm/resume_fail_program.sh +SuspendProgram=/opt/azurehpc/slurm/suspend_program.sh +SchedulerParameters=max_switch_wait=24:00:00 +# Only used with dynamic node partitions. +MaxNodeCount=10000 +# This as the partition definitions managed by azslurm partitions > /sched/azure.conf +Include azure.conf +# If slurm.accounting.enabled=true this will setup slurmdbd +# otherwise it will just define accounting_storage/none as the plugin +Include accounting.conf +# SuspendExcNodes is managed in /etc/slurm/keep_alive.conf +# see azslurm keep_alive for more information. +# you can also remove this import to remove support for azslurm keep_alive +Include keep_alive.conf +EOF + +# Configure Hostname in slurmd.conf +echo "SlurmctldHost=$(hostname -s)" >> "$slurm_conf" + +# Create cgroup.conf +cat < "$sched_dir/cgroup.conf" +CgroupAutomount=no +ConstrainCores=yes +ConstrainRamSpace=yes +ConstrainDevices=yes +EOF + +echo "# Do not edit this file. It is managed by azslurm" >> "$sched_dir/keep_alive.conf" + +# Set limits for Slurm +cat < /etc/security/limits.d/slurm-limits.conf +* soft memlock unlimited +* hard memlock unlimited +EOF + +# Add accounting configuration +echo "AccountingStorageType=accounting_storage/none" >> "$sched_dir/accounting.conf" + +# Set permissions and create symlinks + +ln -s "$slurm_conf" /etc/slurm/slurm.conf +ln -s "$sched_dir/keep_alive.conf" /etc/slurm/keep_alive.conf +ln -s "$sched_dir/cgroup.conf" /etc/slurm/cgroup.conf +ln -s "$sched_dir/accounting.conf" /etc/slurm/accounting.conf +ln -s "$sched_dir/azure.conf" /etc/slurm/azure.conf +ln -s "$sched_dir/gres.conf" /etc/slurm/gres.conf +touch "$sched_dir"/gres.conf "$sched_dir"/azure.conf +chown slurm:slurm "$sched_dir"/*.conf +chmod 644 "$sched_dir"/*.conf +chown slurm:slurm /etc/slurm/*.conf + +# Set up log and spool directories +mkdir -p /var/spool/slurmd /var/spool/slurmctld /var/log/slurmd /var/log/slurmctld +chown slurm:slurm /var/spool/slurmd /var/spool/slurmctld /var/log/slurmd /var/log/slurmctld +echo " " +echo "------------------------------------------------------------------------------------------------------------------------------" +echo "Slurm configured" +echo "------------------------------------------------------------------------------------------------------------------------------" +echo " " +echo "------------------------------------------------------------------------------------------------------------------------------" +echo " Go to CycleCloud Portal and edit the $cluster_name cluster configuration to use the external scheduler and start the cluster." +echo " Use $ip_address IP Address for File-system Mount for /sched and /shared in Network Attached Storage section in CycleCloud GUI " +echo " Once the cluster is started, proceed to run cyclecloud-integrator.sh script to complete the integration with CycleCloud." +echo "------------------------------------------------------------------------------------------------------------------------------" +echo " " \ No newline at end of file diff --git a/cloud_bursting/slurm-23.11.9-1/scheduler/useradd_example.sh b/cloud_bursting/slurm-23.11.9-1/scheduler/useradd_example.sh new file mode 100644 index 00000000..8d68c468 --- /dev/null +++ b/cloud_bursting/slurm-23.11.9-1/scheduler/useradd_example.sh @@ -0,0 +1,46 @@ +#!/bin/bash +# ----------------------------------------------------------------------------- +# Script: Create Shared Home Directory for a Test User in Slurm Scheduler +# +# This script creates a new user in a Slurm scheduler environment, setting up a +# shared home directory. The user is configured with a specific username, GID, +# and UID. It is primarily designed for environments like CycleCloud where +# consistent user IDs are important (starting with 20001 for the first user). +# +# CycleCloud Convention: +# - For the first user, the UID and GID default to 20001 in CycleCloud. Modify +# these values as needed for additional users. +# +# Prerequisites: +# - Script must be run with root privileges. +# - The desired UID, GID, and username should be set before execution. +# ----------------------------------------------------------------------------- + +set -e +if [ $(whoami) != root ]; then + echo "Please run as root" + exit 1 +fi + +# test user details +username="user1" +gid=20001 +uid=20001 + +mkdir -p /shared/home/$username +chmod 755 /shared/home/ + +# Create group if not exists +if ! getent group $gid >/dev/null; then + groupadd -g $gid $username +fi + +# Create user with specified uid, gid, home directory, and shell +useradd -g $gid -u $uid -d /shared/home/$username -s /bin/bash $username +chown -R $username:$username /shared/home/$username +# Switch to user to perform directory and file operations +su - $username -c "mkdir -p /shared/home/$username/.ssh" +su - $username -c "ssh-keygen -t rsa -N '' -f /shared/home/$username/.ssh/id_rsa" +su - $username -c "cat /shared/home/$username/.ssh/id_rsa.pub >> /shared/home/$username/.ssh/authorized_keys" +su - $username -c "chmod 600 /shared/home/$username/.ssh/authorized_keys" +su - $username -c "chmod 700 /shared/home/$username/.ssh" \ No newline at end of file