Is your feature request related to a problem? Please describe.
The node-local load balancing (NLLB) Envoy Pod is a static pod generated entirely in k0s code (makePodManifest() in pkg/component/worker/nllb/envoy.go) and served to the kubelet over the internal static-pod HTTP server (staticPodURL). Its EnvoyProxy config (pkg/apis/k0s/v1beta1/nllb.go) currently only exposes image, imagePullPolicy, apiServerBindPort, and konnectivityServerBindPort.
Because of this limitation, two operationally critical Pod settings cannot be configured:
- No
priorityClassName: The Envoy Pod runs at priority 0. Since NLLB is the node's load-balanced path to the control plane (kube-apiserver and konnectivity), its failure causes the node to lose its API path and become unstable. Under node-pressure, the kubelet's eviction manager prioritizes killing low-priority pods first, making this critical infrastructure pod a primary candidate for termination exactly when it is most needed.
- No
terminationGracePeriodSeconds: Envoy is locked to the default 30-second grace period, preventing clean draining of in-flight connections during shutdowns or restarts.
These fields are immutable once a Pod is created, and the manifest is regenerated by k0s on every restart, rendering manual edits or external workarounds like admission webhooks ineffective.
Concrete failure: graceful node shutdown breaks on NLLB nodes
This configuration gap has direct negative impacts on reliability. When graceful node shutdown is enabled via a worker profile, the kubelet's shutdown manager separates pods into phases based on priority.
spec:
workerProfiles:
- name: default
values:
shutdownGracePeriod: 30s
shutdownGracePeriodCriticalPods: 15s
Pods at or above the system-cluster-critical threshold are terminated last, while all others are terminated first. Because the NLLB Envoy pod has a priority of 0, it is killed immediately in the first phase:
nodeshutdown_manager.go:153 "Shutdown manager killing pod with gracePeriod" pod="kube-system/nllb-<node>" gracePeriod=15. Since this pod serves as the worker's path to the API server, killing it early severs the connection for all other pods still attempting to drain. These remaining pods are left in inconsistent states (Running, Pending, or Unknown) as they can no longer report status to the API server. Consequently, graceful node shutdown is effectively broken on any node using NLLB.
Assigning the Envoy pod system-node-critical ensures it is shut down last, maintaining the API path while other workloads finish draining.
Describe the solution you would like
Expose two new optional fields on spec.network.nodeLocalLoadBalancing.envoyProxy:
priorityClassName: Assign a PriorityClass (e.g., system-node-critical) to protect the Envoy Pod from node-pressure eviction and ensure it is shut down last during graceful node shutdown events.
terminationGracePeriodSeconds: Allow overriding the default termination grace period for cleaner connection draining.
These fields should be plumbed through envoyPodParams into makePodManifest() and set on the Pod spec. To maintain backward compatibility, both will default to unset.
Recommended default: Set priorityClassName to system-node-critical by default. This change is considered a reliability fix rather than just an optimization, as it addresses the malfunction of the graceful node shutdown feature on NLLB nodes.
Describe alternatives you've considered
- Mutating HTTP proxy for
staticPodURL: Highly fragile due to random upstream ports, tmpfs-based kubelet configs, and startup races.
Restart=always systemd unit: Does not address the kill-ordering problem during shutdown and cannot durably set priority on a regenerated manifest.
- Mutating admission webhooks: Ineffective as they only see the mirror pod, not the actual pod run by the kubelet.
- Self-hosting the Envoy static pod: Requires manual reimplementation of k0s's automated Envoy configuration generation.
Additional context
We do not always have the luxury of draining nodes or performing maintenance under ideal conditions, as our operating environments can be subject to unexpected power loss and abrupt node failures. This change improves resiliency by ensuring critical components are available wherever workloads are scheduled, allowing the system to recover more reliably from real-world disruptions.
@Josh-Tracy
Is your feature request related to a problem? Please describe.
The node-local load balancing (NLLB) Envoy Pod is a static pod generated entirely in k0s code (
makePodManifest()inpkg/component/worker/nllb/envoy.go) and served to the kubelet over the internal static-pod HTTP server (staticPodURL). ItsEnvoyProxyconfig (pkg/apis/k0s/v1beta1/nllb.go) currently only exposesimage,imagePullPolicy,apiServerBindPort, andkonnectivityServerBindPort.Because of this limitation, two operationally critical Pod settings cannot be configured:
priorityClassName: The Envoy Pod runs at priority 0. Since NLLB is the node's load-balanced path to the control plane (kube-apiserverandkonnectivity), its failure causes the node to lose its API path and become unstable. Under node-pressure, the kubelet's eviction manager prioritizes killing low-priority pods first, making this critical infrastructure pod a primary candidate for termination exactly when it is most needed.terminationGracePeriodSeconds: Envoy is locked to the default 30-second grace period, preventing clean draining of in-flight connections during shutdowns or restarts.These fields are immutable once a Pod is created, and the manifest is regenerated by k0s on every restart, rendering manual edits or external workarounds like admission webhooks ineffective.
Concrete failure: graceful node shutdown breaks on NLLB nodes
This configuration gap has direct negative impacts on reliability. When graceful node shutdown is enabled via a worker profile, the kubelet's shutdown manager separates pods into phases based on priority.
Pods at or above the system-cluster-critical threshold are terminated last, while all others are terminated first. Because the NLLB Envoy pod has a priority of 0, it is killed immediately in the first phase:
nodeshutdown_manager.go:153 "Shutdown manager killing pod with gracePeriod" pod="kube-system/nllb-<node>" gracePeriod=15. Since this pod serves as the worker's path to the API server, killing it early severs the connection for all other pods still attempting to drain. These remaining pods are left in inconsistent states (Running, Pending, or Unknown) as they can no longer report status to the API server. Consequently, graceful node shutdown is effectively broken on any node using NLLB.Assigning the Envoy pod system-node-critical ensures it is shut down last, maintaining the API path while other workloads finish draining.
Describe the solution you would like
Expose two new optional fields on
spec.network.nodeLocalLoadBalancing.envoyProxy:priorityClassName: Assign a PriorityClass (e.g., system-node-critical) to protect the Envoy Pod from node-pressure eviction and ensure it is shut down last during graceful node shutdown events.terminationGracePeriodSeconds: Allow overriding the default termination grace period for cleaner connection draining.These fields should be plumbed through
envoyPodParamsintomakePodManifest()and set on the Pod spec. To maintain backward compatibility, both will default to unset.Recommended default: Set priorityClassName to
system-node-criticalby default. This change is considered a reliability fix rather than just an optimization, as it addresses the malfunction of the graceful node shutdown feature on NLLB nodes.Describe alternatives you've considered
staticPodURL: Highly fragile due to random upstream ports, tmpfs-based kubelet configs, and startup races.Restart=always systemd unit: Does not address the kill-ordering problem during shutdown and cannot durably set priority on a regenerated manifest.Additional context
We do not always have the luxury of draining nodes or performing maintenance under ideal conditions, as our operating environments can be subject to unexpected power loss and abrupt node failures. This change improves resiliency by ensuring critical components are available wherever workloads are scheduled, allowing the system to recover more reliably from real-world disruptions.
@Josh-Tracy