Skip to content

Conversation

jskswamy
Copy link

@jskswamy jskswamy commented Oct 3, 2025

Description

This PR fixes #62 where the Slurm operator ignores the controller.spec.service.spec.clusterIP configuration, preventing the creation of headless services required for StatefulSet pod DNS resolution.

Problem

  • Worker nodes (slurmd) failed to connect to controller (slurmctld) with DNS resolution errors
  • The operator always created regular ClusterIP services, ignoring user-specified clusterIP: None
  • StatefulSet pod DNS patterns require headless services to function properly

Solution

  • Modified BuildControllerService() to check ServiceSpec.ClusterIP and set Headless: true when clusterIP: None
  • Added comprehensive test coverage for headless service functionality
  • Preserved existing behavior for regular ClusterIP services

Changes Made

/internal/builder/controller_service.go

 func (b *Builder) BuildControllerService(controller *slinkyv1alpha1.Controller) (*corev1.Service, error) {
  spec := controller.Spec.Service
  opts := ServiceOpts{
    Key:         controller.ServiceKey(),
    Metadata:    controller.Spec.Template.PodMetadata,
    ServiceSpec: controller.Spec.Service.ServiceSpecWrapper.ServiceSpec,
    Selector: labels.NewBuilder().
      WithControllerSelectorLabels(controller).
      Build(),
+		Headless: controller.Spec.Service.ServiceSpecWrapper.ServiceSpec.ClusterIP == corev1.ClusterIPNone,
  }

/internal/builder/controller_service_test.go

  • Added test case for headless service configuration
  • Added verification that ClusterIP=None and PublishNotReadyAddresses=true
  • Ensured backward compatibility with existing functionality

Testing

  • All existing tests pass
  • New test case verifies headless service creation when clusterIP: None is specified
  • Manual testing confirms DNS resolution works with generated headless services

Impact

  • Fixes DNS resolution for worker-to-controller communication
  • Enables StatefulSet pod DNS patterns (pod-name.service-name.namespace)
  • Respects user configuration as expected from Kubernetes operators
  • Backward compatible - no impact on existing ClusterIP service deployments
  • Required for Helm chart deployments that specify headless services

Helm Chart Deployment Issue

Installing the SlinkyProject Helm chart with headless service configuration fails due to this bug:

Configuration that was failing:

# Helm values.yaml
clusterName: slurm-test
controller:
  service:
    spec:
      clusterIP: None # Required for StatefulSet pod DNS

Symptoms observed:

# Worker pods failed to start
kubectl get pods -n slurm
# slurm-worker-slinky-0     1/2     CrashLoopBackOff

# DNS resolution errors
kubectl logs -n slurm slurm-worker-slinky-0 -c slurmd
# error: slurm_set_addr: Unable to resolve "slurm-controller-0.slurm-controller.slurm"

# Service was wrong type
kubectl get service slurm-controller -n slurm
# ClusterIP: 10.73.116.152 (should be None for headless)

After this fix:

# Worker pods start successfully
kubectl get pods -n slurm
# slurm-worker-slinky-0     2/2     Running

# DNS resolution works
kubectl exec -n slurm slurm-worker-slinky-0 -c slurmd -- \
  getent hosts slurm-controller-0.slurm-controller.slurm
# 10.72.1.133     slurm-controller-0.slurm-controller.slurm.svc.cluster.local

# Slurm cluster operational
kubectl exec -n slurm slurm-controller-0 -c slurmctld -- sinfo
# PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
# slinky       up   infinite      2   idle slinky-[0-1]

This commit addresses an issue where the Slurm operator was
ignoring the `clusterIP` configuration, preventing the creation of
headless services necessary for StatefulSet pod DNS resolution.

The `BuildControllerService()` function now checks the
`ServiceSpec.ClusterIP` and sets `Headless` to true
when `clusterIP: None` is specified.

- Added a test case to verify headless service functionality
- Ensured backward compatibility with existing ClusterIP services

Signed-off-by: Krishnaswamy Subramanian <[email protected]>
Copy link
Contributor

@SkylerMalinowski SkylerMalinowski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR!

The other Slurm cluster components, which allow a service configuration, suffer from this same bug. If you would, please exptend the fix to all Slurm component services. Thanks.

Comment on lines +94 to +103

// Test headless service configuration
if tt.name == "headless service" {
if got.Spec.ClusterIP != corev1.ClusterIPNone {
t.Errorf("Expected headless service (ClusterIP=None), got ClusterIP=%v", got.Spec.ClusterIP)
}
if !got.Spec.PublishNotReadyAddresses {
t.Errorf("Expected PublishNotReadyAddresses=true for headless service, got %v", got.Spec.PublishNotReadyAddresses)
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a fragile methodology to express the test.

@SkylerMalinowski
Copy link
Contributor

The other Slurm cluster components, which allow a service configuration, suffer from this same bug.

Perhaps bug is not right. Headless services are strictly defined by clusterIP=None. Using publishNotReadyAddresses=true with clusterIP=None is typical by not required for headless services.

You should be able to do the following:

controller:
  service:
    spec:
      clusterIP: None
      publishNotReadyAddresses: true

Once a service is created, Kubernetes does not seem to allow the spec to be altered but it does not throw an error. That is an entirely different issue to handle.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Slurm Operator Does Not Respect Controller Service Configuration - Breaking StatefulSet Pod DNS Resolution

2 participants