*** WORK IN PROGRESS DO NOT USE **
WARNING: This tool is not officially supported by Microsoft, Azure or AKS applying these changes can result in broken clusters, nodes or otherwise remove the official support for your AKS cluster.
aks-turbobutton
applies fixes and configuration settings to resolve many
common AKS issues. Many of the issue are related to one another, sharing
a common set of basic root causes.
Issues aks-turbobutton
resolves or partially mitigates:
- Cluster nodes going NotReady (intermittent/under load, periodic tasks)
- Performance and stability issues when using istio or complex operator configurations.
- Networking Errors (intermittent/under load) and Latency (intermittent) (Pod, container, networking, latency inbound or outbound) including high latency when reaching other azure services from worker nodes.
- API server timeouts, disconnects, tunnelfront and kube-proxy failures under load.
- Connection timed out accessing the Kubernetes API server
- Slow pod, docker container, job execution
- Slow DNS queries / core-dns latency spikes
- "GenericPLEG" / Docker PLEG errors on worker nodes
- RPC Context deadline exceeded in kubelet/docker logs
- Slow PVC attach/detach time at container start or under load / fail over
- Unable to schedule workloads to expected VM limits
- Failed/timed out
helm install
commands - Slow docker image pulls / frequent ImagePullBackoff
- Watches and re-applies needed annotations or labels to enable kube-proxy, CoreDNS metrics reporting.
values.yaml
and turbobutton.yaml
:
In order to facilitate user re-creation and identification of these issues as well as the root-cause of the issues a trimmed-down deployment of the [prometheus-operator][prop] is included.
The deployment of the operator is modified to:
- Enable eBPF disk IO latency metrics collection.
- Remove extra options and configuration from the operator
- Includes the Cluster and Node USE (Utilization and Saturation) grafana dashboards
- Adds in the needed re-labelling of metrics for AKS
- Forces metrics persistence
- Installs Brendan Gregg's IO flamegraph tools
Pre reqs:
- Bash compatible shell
- Installed & configured
azure cli
- git
- kubectl configured with context set to target cluster
- Priviliged containers enabled on the target cluster
git clone [email protected]:jnoller/aks-turbobutton.git
./autotune
aks-turbobutton
is a set of (hopefully) clear and concise bash scripts as well
as a kubernetes daemonset to deploy and manage those scripts and the metrics
collection and reporting needed for issue identification.
Simply; aks-turbobutton
applies a set of Linux best practices to the AKS
worker nodes as well as any other supporting tools.
- The Docker (moby) and kubelet working directories are moved to the node's
ephemeral tempfs disk located in
/mnt
. - Disables (sort of) the Linux OOMKiller
vm.panic_on_oom=1
vm.swappiness=1
kernel.panic=5
- Sets common Linux tuning values for higher IO performance including changing
the Linux IO scheduler on worker nodes:
- scheduler: "mq-deadline"
- read_ahead_kb: "4096"
- max_sectors_kb: "128"
- queue_depth: "64"
- transparent_hugepage: "always"
- Node Problem Detector installation (stand alone mode)
- Auto-disable the Azure disk write cache on all drives
- Disable or re-home OMS Agent process IO