-
Notifications
You must be signed in to change notification settings - Fork 47
Description
Not sure if this is the right channel, but I'm implementing my own cyclecloud project to retroactively add support for slurm priority and DefCpuPerGPU in slurm after cyclecloud-slurm install finishes. I didn't see a way to do this natively and it would probably be cleaner to do it that way. I don't currently use dynamic partitions because afaik it does not have DefCpuPerGPU implemented there either. For example if I had a dynamic partition with NC A100 v4 series VMs, DefCpuPerGPU should be set to 24 but I don't see any way to ensure that. The GPU users generally do not really care about cpu/mem as long as they get the maximum allowed per GPU. The cyclecloud project I implemented sets up partitions and priority this way[1]. It calculates those values and appends it to the end of the partition config. I also added a priority.conf slurm config[2] and added partitions for our reserved static non-dynamic (keep-alive) slurm nodes. This is also missing or I could not find how to configure it natively. Would appreciate thoughts and opinions.
[1]
PartitionName=cpu-ond Nodes=cpuond-[1-6] Default=YES DefMemPerCPU=3891 MaxTime=INFINITE State=UP
Nodename=cpuond-[1-6] Feature=cloud STATE=CLOUD CPUs=24 ThreadsPerCore=2 RealMemory=93388
PartitionName=cpu-spt Nodes=cpuspt-[1-10] Default=NO DefMemPerCPU=3891 MaxTime=INFINITE State=UP
Nodename=cpuspt-[1-10] Feature=cloud STATE=CLOUD CPUs=24 ThreadsPerCore=2 RealMemory=93388
PartitionName=t4-ond Nodes=nc16ast4v3ond-1 Default=NO DefMemPerCPU=6688 MaxTime=INFINITE State=UP DefCpuPerGPU=16
Nodename=nc16ast4v3ond-1 Feature=cloud STATE=CLOUD CPUs=16 ThreadsPerCore=1 RealMemory=107008 Gres=gpu:1
PartitionName=t4-spt Nodes=nc16ast4v3spt-[1-2] Default=NO DefMemPerCPU=6688 MaxTime=INFINITE State=UP DefCpuPerGPU=16
Nodename=nc16ast4v3spt-[1-2] Feature=cloud STATE=CLOUD CPUs=16 ThreadsPerCore=1 RealMemory=107008 Gres=gpu:1
PartitionName=a100-ond Nodes=nc24adsv4ond-[1-4] Default=NO DefMemPerCPU=8917 MaxTime=INFINITE State=UP DefCpuPerGPU=24
Nodename=nc24adsv4ond-[1-4] Feature=cloud STATE=CLOUD CPUs=24 ThreadsPerCore=1 RealMemory=214016 Gres=gpu:1
PartitionName=a100-spt Nodes=nc24adsv4spt-[1-8] Default=NO DefMemPerCPU=8917 MaxTime=INFINITE State=UP DefCpuPerGPU=24
Nodename=nc24adsv4spt-[1-8] Feature=cloud STATE=CLOUD CPUs=24 ThreadsPerCore=1 RealMemory=214016 Gres=gpu:1
PartitionName=a100x2-ond Nodes=nc48adsv4ond-[1-2] Default=NO DefMemPerCPU=8917 MaxTime=INFINITE State=UP DefCpuPerGPU=24
Nodename=nc48adsv4ond-[1-2] Feature=cloud STATE=CLOUD CPUs=48 ThreadsPerCore=1 RealMemory=428032 Gres=gpu:2
PartitionName=a100x2-spt Nodes=nc48adsv4spt-[1-4] Default=NO DefMemPerCPU=8917 MaxTime=INFINITE State=UP DefCpuPerGPU=24
Nodename=nc48adsv4spt-[1-4] Feature=cloud STATE=CLOUD CPUs=48 ThreadsPerCore=1 RealMemory=428032 Gres=gpu:2
PartitionName=a100x4-ond Nodes=nc96adsv4ond-1 Default=NO DefMemPerCPU=8917 MaxTime=INFINITE State=UP DefCpuPerGPU=24
Nodename=nc96adsv4ond-1 Feature=cloud STATE=CLOUD CPUs=96 ThreadsPerCore=1 RealMemory=856064 Gres=gpu:4
PartitionName=a100x4-spt Nodes=nc96adsv4spt-[1-2] Default=NO DefMemPerCPU=8917 MaxTime=INFINITE State=UP DefCpuPerGPU=24
Nodename=nc96adsv4spt-[1-2] Feature=cloud STATE=CLOUD CPUs=96 ThreadsPerCore=1 RealMemory=856064 Gres=gpu:4
PartitionName=a100x8-rsv Nodes=nd96asrv4-[1-8] Default=NO DefMemPerCPU=9120 MaxTime=INFINITE State=UP DefCpuPerGPU=12
Nodename=nd96asrv4-[1-8] Feature=cloud STATE=CLOUD CPUs=96 ThreadsPerCore=1 RealMemory=875520 Gres=gpu:8
PartitionName=a10-ond Nodes=nv36adsv5ond-[1-75] Default=NO DefMemPerCPU=23779 MaxTime=INFINITE State=UP DefCpuPerGPU=18
Nodename=nv36adsv5ond-[1-75] Feature=cloud STATE=CLOUD CPUs=18 ThreadsPerCore=2 RealMemory=428032 Gres=gpu:1
PartitionName=a10-spt Nodes=nv36adsv5spt-[1-75] Default=NO DefMemPerCPU=23779 MaxTime=INFINITE State=UP DefCpuPerGPU=18
Nodename=nv36adsv5spt-[1-75] Feature=cloud STATE=CLOUD CPUs=18 ThreadsPerCore=2 RealMemory=428032 Gres=gpu:1
PartitionName=a10x2-rsv Nodes=nv72adsv5-[1-3] Default=NO DefMemPerCPU=23779 MaxTime=INFINITE State=UP DefCpuPerGPU=18
Nodename=nv72adsv5-[1-3] Feature=cloud STATE=CLOUD CPUs=36 ThreadsPerCore=2 RealMemory=856064 Gres=gpu:2
PartitionName=a100x8-rsv-HP Nodes=nd96asrv4-[1-8] Default=NO DefMemPerCPU=9120 MaxTime=INFINITE State=UP DefCpuPerGPU=12 Priority=5000
PartitionName=a10x2-rsv-HP Nodes=nv72adsv5-[1-3] Default=NO DefMemPerCPU=23779 MaxTime=INFINITE State=UP DefCpuPerGPU=18 Priority=5000
[2]
JOB PRIORITY
#PriorityFlags=
PriorityType=priority/multifactor
#PriorityDecayHalfLife=
#PriorityCalcPeriod=
PriorityFavorSmall=NO
PriorityMaxAge=14-0
#PriorityUsageResetPeriod=
PriorityWeightAge=1000
PriorityWeightFairshare=10000
PriorityWeightJobSize=1000
PriorityWeightPartition=1000
PriorityWeightQOS=0