Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 4 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ CycleCloud Slurm Clusters in Azure
This project sets up an auto-scaling Slurm cluster
Slurm is a highly configurable open source workload manager. See the [Slurm project site](https://www.schedmd.com/) for an overview.
# Table of Contents:
1. [Managing Slurm Clusters in 4.0.4](#managing-slurm-clusters)
1. [Managing Slurm Clusters in 4.0.5](#managing-slurm-clusters)
1. [Making Cluster Changes](#making-cluster-changes)
2. [No longer pre-creating execute nodes](#no-longer-pre-creating-execute-nodes)
3. [Creating additional partitions](#creating-additional-partitions)
Expand Down Expand Up @@ -36,7 +36,7 @@ Slurm is a highly configurable open source workload manager. See the [Slurm proj
8. [Capturing logs and configuration for troubleshooting](#capturing-logs-and-configuration-data-for-troubleshooting)
6. [Contributing](#contributing)
---
## Managing Slurm Clusters in 4.0.4
## Managing Slurm Clusters in 4.0.5

### Making Cluster Changes
In CycleCloud, cluster changes can be made using the "Edit" dialog from the cluster page in the GUI or from the CycleCloud CLI. Cluster topology changes, such as new partitions, generally require editing and re-importing the cluster template. This can be applied to live, running clusters as well as terminated clusters. It is also possible to import changes as a new Template for future cluster creation via the GUI.
Expand Down Expand Up @@ -362,12 +362,12 @@ Cyclecloud Slurm clusters now include prolog and epilog scripts to enable and cl


### Setting KeepAlive
Added in 4.0.4: If the KeepAlive attribute is set in the CycleCloud UI, then the azslurmd will add that node's name to the `SuspendExcNodes` attribute via scontrol. Note that it is required that `ReconfigFlags=KeepPowerSaveSettings` is set in the slurm.conf, as is the default as of 4.0.4. Once KeepALive is set back to false, `azslurmd` will then remove this node from `SuspendExcNodes`.
Added in 4.0.5: If the KeepAlive attribute is set in the CycleCloud UI, then the azslurmd will add that node's name to the `SuspendExcNodes` attribute via scontrol. Note that it is required that `ReconfigFlags=KeepPowerSaveSettings` is set in the slurm.conf, as is the default as of 4.0.5. Once KeepALive is set back to false, `azslurmd` will then remove this node from `SuspendExcNodes`.

If a node is added to `SuspendExcNodes` either via `azslurm keep_alive` or via the scontrol command, then `azslurmd` will not remove this node from the `SuspendExcNodes` if KeepAlive is false in CycleCloud. However, if the node is later set to KeepAlive as true in the UI then `azslurmd` will then remove it from `SuspendExcNodes` when the node is set back to KeepAlive is false.

### Slurmrestd
As of version 4.0.4, `slurmrestd` is automatically configured and started on the scheduler node and scheduler-ha node for all Slurm clusters. This REST API service provides programmatic access to Slurm functionality, allowing external applications and tools to interact with the cluster. For more information on the Slurm REST API, see the [official Slurm REST API documentation](https://slurm.schedmd.com/rest_api.html).
As of version 4.0.5, `slurmrestd` is automatically configured and started on the scheduler node and scheduler-ha node for all Slurm clusters. This REST API service provides programmatic access to Slurm functionality, allowing external applications and tools to interact with the cluster. For more information on the Slurm REST API, see the [official Slurm REST API documentation](https://slurm.schedmd.com/rest_api.html).

### Node Health Checks

Expand Down
2 changes: 1 addition & 1 deletion azure-slurm-install/setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
from setuptools.command.test import Command
from setuptools.command.test import test as TestCommand # noqa: N812

__version__ = "4.0.4"
__version__ = "4.0.5"
CWD = os.path.dirname(os.path.abspath(__file__))


Expand Down
2 changes: 1 addition & 1 deletion azure-slurm/setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
from setuptools.command.test import Command
from setuptools.command.test import test as TestCommand # noqa: N812

__version__ = "4.0.4"
__version__ = "4.0.5"
CWD = os.path.dirname(os.path.abspath(__file__))


Expand Down
2 changes: 1 addition & 1 deletion azure-slurm/slurmcc/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@
from . import topology


VERSION = "4.0.4"
VERSION = "4.0.5"


def csv_list(x: str) -> List[str]:
Expand Down
4 changes: 2 additions & 2 deletions project.ini
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
[project]
name = slurm
label = Slurm
version = 4.0.4
version = 4.0.5
type = scheduler

[blobs]
Files = azure-slurm-pkg-4.0.4.tar.gz, azure-slurm-install-pkg-4.0.4.tar.gz
Files = azure-slurm-pkg-4.0.5.tar.gz, azure-slurm-install-pkg-4.0.5.tar.gz

[spec scheduler]
run_list = role[slurm_scheduler_role]
Expand Down
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Copyright (c) Microsoft Corporation. All rights reserved.
# Licensed under the MIT License.
default[:slurm][:autoscale_version] = "4.0.4"
default[:slurm][:autoscale_version] = "4.0.5"
default[:slurm][:version] = "23.11.9-1"
default[:slurm][:user][:name] = 'slurm'
default[:slurm][:cyclecloud_api] = "cyclecloud_api-8.4.1-py2.py3-none-any.whl"
Expand Down
2 changes: 1 addition & 1 deletion specs/default/chef/site-cookbooks/slurm/metadata.rb
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
license 'All Rights Reserved'
description 'Installs/Configures slurm'
long_description 'Installs/Configures slurm'
version '4.0.4'
version '4.0.5'
chef_version '>= 12.1' if respond_to?(:chef_version)

%w{ cuser cshared }.each {|c| depends c}
Expand Down
2 changes: 1 addition & 1 deletion specs/default/cluster-init/files/install-non-scheduler.sh
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ mode=$1
echo $mode | grep -Eqw "login|execute" || (echo "Usage: $0 [login|execute]" && exit 1)

do_install=$(jetpack config slurm.do_install True)
install_pkg=$(jetpack config slurm.install_pkg azure-slurm-install-pkg-4.0.4.tar.gz)
install_pkg=$(jetpack config slurm.install_pkg azure-slurm-install-pkg-4.0.5.tar.gz)
slurm_project_name=$(jetpack config slurm.project_name slurm)


Expand Down
4 changes: 2 additions & 2 deletions specs/scheduler/cluster-init/scripts/00-install.sh
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,8 @@
set -e

do_install=$(jetpack config slurm.do_install True)
install_pkg=$(jetpack config slurm.install_pkg azure-slurm-install-pkg-4.0.4.tar.gz)
autoscale_pkg=$(jetpack config slurm.autoscale_pkg azure-slurm-pkg-4.0.4.tar.gz)
install_pkg=$(jetpack config slurm.install_pkg azure-slurm-install-pkg-4.0.5.tar.gz)
autoscale_pkg=$(jetpack config slurm.autoscale_pkg azure-slurm-pkg-4.0.5.tar.gz)
slurm_project_name=$(jetpack config slurm.project_name slurm)

find_python3() {
Expand Down
6 changes: 3 additions & 3 deletions templates/slurm-cs.txt
Original file line number Diff line number Diff line change
Expand Up @@ -22,8 +22,8 @@ Autoscale = $Autoscale

[[[configuration]]]

slurm.install_pkg = azure-slurm-install-pkg-4.0.4.tar.gz
slurm.autoscale_pkg = azure-slurm-pkg-4.0.4.tar.gz
slurm.install_pkg = azure-slurm-install-pkg-4.0.5.tar.gz
slurm.autoscale_pkg = azure-slurm-pkg-4.0.5.tar.gz

slurm.version = $configuration_slurm_version
slurm.accounting.enabled = $configuration_slurm_accounting_enabled
Expand All @@ -40,7 +40,7 @@ Autoscale = $Autoscale
# For fast spin-up after Deallocate, force an immediate re-converge on boot
cyclecloud.converge_on_boot = false

[[[cluster-init cyclecloud/slurm:default:4.0.4]]]
[[[cluster-init cyclecloud/slurm:default:4.0.5]]]
Optional = true


Expand Down
8 changes: 4 additions & 4 deletions templates/slurm.txt
Original file line number Diff line number Diff line change
Expand Up @@ -71,7 +71,7 @@ Autoscale = $Autoscale

[[[cluster-init cyclecloud/healthagent:default]]]
[[[cluster-init cyclecloud/monitoring:default]]]
[[[cluster-init cyclecloud/slurm:default:4.0.4]]]
[[[cluster-init cyclecloud/slurm:default:4.0.5]]]

[[[volume boot]]]
Size = ${ifThenElse(BootDiskSize > 0, BootDiskSize, undefined)}
Expand Down Expand Up @@ -126,7 +126,7 @@ Autoscale = $Autoscale

[[[cluster-init cyclecloud/healthagent:default]]]
[[[cluster-init cyclecloud/monitoring:default]]]
[[[cluster-init cyclecloud/slurm:scheduler:4.0.4]]]
[[[cluster-init cyclecloud/slurm:scheduler:4.0.5]]]

[[[network-interface eth0]]]
AssociatePublicIpAddress = $UsePublicNetwork
Expand Down Expand Up @@ -191,7 +191,7 @@ Autoscale = $Autoscale

[[[cluster-init cyclecloud/healthagent:default]]]
[[[cluster-init cyclecloud/monitoring:default]]]
[[[cluster-init cyclecloud/slurm:login:4.0.4]]]
[[[cluster-init cyclecloud/slurm:login:4.0.5]]]
[[[configuration]]]
slurm.role = login
autoscale.enabled = false
Expand All @@ -209,7 +209,7 @@ Autoscale = $Autoscale

[[[cluster-init cyclecloud/healthagent:default]]]
[[[cluster-init cyclecloud/monitoring:default]]]
[[[cluster-init cyclecloud/slurm:execute:4.0.4]]]
[[[cluster-init cyclecloud/slurm:execute:4.0.5]]]

[[[network-interface eth0]]]
AssociatePublicIpAddress = $ExecuteNodesPublic
Expand Down
Loading