GCS v9.0.2 #44
ernst-bablick
announced in
Announcements
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Enhanced NVIDIA GPU Support with qgpu
qgpu
command has been added to simplify workload management for GPU resources. Theqgpu
command allows administrators to manage GPU resources more efficiently. It is available for Linux amd64 and Linux arm64.qgpu
is a multi-purpose command which can act as aload sensor
reporting the characteristics and metrics of of NVIDIA GPU devices. For that it depends on NVIDIA DCGM to be installed on the GPU nodes. It also works as aprolog
andepilog
for jobs to setup NVIDIA runtime and environment variables. Further it sets up per job GPU accounting so that the GPU usage and power consumption is automatically reported in the accounting being visible in the standardqacct -j
output. It supports all NVIDIA GPUs which are supported by Nvidias DCGM including NVIDIA's latest Grace Hopper superchips. For more information aboutqgpu
please refer to theAdmin Guide
.(Available in Gridware Cluster Scheduler only)
Automatic Session Management
Patch 9.0.2 introduces the new concept of automatic sessions. Session allows the Gridware Cluster Scheduler system to synchronize internal data stores, so that client commands can be enforced to get the most recent data. Session management is enabled, but can be disabled by setting the
DISABLE_AUTOMATIC_SESSIONS
parameter to true in theqmaster_params
of the cluster configuration.The default for the
qmaster_param
DISABLE_SECONDARY_DS_READER
is now also false. This means that the reader thread pool is enabled by default and does not need to be enabled manually as in patch 9.0.1.The reader thread pool in combination with sessions ensure that commands that trigger changes within the cluster (write-requests), such as submitting a job, modifying a queue or changing a complex value, are executed and the outcome of those commands is guaranteed to be visible to the user who initiated the change. Commands that only read data (read-requests), such as
qstat
,qhost
orqconf -s...
, that are triggered by the same user, always return the most recent data although all read-requests in the system are executed completely in parallel to the other Gridware Cluster Scheduler core components. This additional synchronization ensures that the data is consistent for the user with each read-request but on the other side might slow down individual read-requests.Assume following script:
Without activated sessions it is not guaranteed that the
qstat -j
command will see the job that was submitted before. With sessions enabled, theqstat -j
command will always see the job but the command will be slightly slower compared to the same scenario without sessions.Sessions eliminate the need to poll for information about an action until it is visible in the system. Unlike other workload management systems, session management in Gridware Cluster Scheduler is automatic. There is no need to manually create or destroy sessions after they have been enabled globally.
The
sge_qmaster
monitoring has been improved. Beginning with this patch the output for worker and reader threads will show following numbers in the output section for reader and worker threads:All three values show internal request queue lengths. Usually they are all 0 but in high load situations or when sessions are enabled then they can increase:
Increasing values are uncritical as long as the numbers also decrease again. If the numbers increase continuously then the system is under high load and the performance might be impacted.
(Available in Open Cluster Scheduler and Gridware Cluster Scheduler)
Departments, Users and Jobs - Department View
With the release of patch 9.0.2, we have removed the restriction that users can only be assigned to one department. Users can now be assigned to multiple departments. This is particularly useful in environments where users are members of multiple departments in a company and access to resources is based on department affiliation.
Jobs must still be assigned to a single department. This means that a user who is a member of multiple departments can submit jobs to any of the departments of which he/she is a member, by specifying the department in the job submission command using the
-dept
switch. If a user does not specify a particular department,sge_qmaster
assigns the job to the first department found.Using
qstat
andqhost
, the output can be filtered based on access lists and departments using the-sdv
switch. When this switch is used, the following applies:Please note that this may result in situations where users are no longer being able to see their own jobs if the access permissions are changed for a user who has jobs running in the system.
Users having the manager role always see all hosts/queues and jobs independent of the use of the
-sdv
switch.Please note that this specific functionality is still in beta phase. It is only available in Gridware Cluster Scheduler and the implementation will change with upcoming patch releases.
This discussion was created from the release GCS v9.0.2.
Beta Was this translation helpful? Give feedback.
All reactions