This file is structured like a FAQ, with topics around managing and troubleshooting inference containers running in Azure. The intended audience is a user relatively new to Kubernetes who wants a quick reference list of useful commands for various purposes.
- Inference Script Logs
- Pods
- Namespaces
- Agent Pools
- Cores
- Memory
- Disk Pressure (Ephemeral storage)
- Billing
The simplest method is to use the Orcanode Monitor dashboard:
- Go to orcanodemonitor.azurewebsites.net
- Click on the cell in the "OrcaHello Lag" column for the hydrophone of interest
Alternatively, you can see the log using commands on your own machine:
- Get the pod name from the namespace. For example to get the pod for
NAMESPACE=andrews-bay:
> kubectl get pods -n $NAMESPACE
NAME READY STATUS RESTARTS AGE
inference-system-859f4f4dc7-85wpf 1/1 Running 0 172m
- Get the last (say) 30 lines of the log for that pod using the pod name:
kubectl logs -n $NAMESPACE inference-system-859f4f4dc7-85wpf --tail=30Windows:
kubectl get pods --all-namespaces | findstr inferLinux:
kubectl get pods --all-namespaces | grep inferUse the following commands, replacing $NAMESPACE with the appropriate namespace (e.g., andrews-bay).
kubectl apply -f deploy/$NAMESPACE.yaml
kubectl rollout restart deployment inference-system -n $NAMESPACE- Get the pod name from the namespace. For example to get the pod for
NAMESPACE=andrews-bay:
> kubectl get pods -n $NAMESPACE
NAME READY STATUS RESTARTS AGE
inference-system-859f4f4dc7-85wpf 1/1 Running 0 172m
- Get an interactive shell in that pod:
kubectl exec -it inference-system-859f4f4dc7-85wpf -n $NAMESPACE -- /bin/bash
Processor:
lscpu
Python:
python3 --version
Torch:
python3 -c "import torch; print(torch.__config__.show())"
kubectl get pods --all-namespaces -o custom-columns=NAME:.metadata.name,CPU_LIMIT:.spec.containers[*].resources.limits.cpu,CPU_REQUEST:.spec.containers[*].resources.requests.cpu
kubectl top pod --all-namespaces
where the CPU(cores) is in milli-cores. So 929m means 92.9% if a pod has a 1 core configuration limit. Example:
> kubectl top pod --all-namespaces
NAMESPACE NAME CPU(cores) MEMORY(bytes)
andrews-bay benchmark-pod-7c445dfc74-ttqpl 0m 138Mi
andrews-bay inference-system-7d679f4ccb-7jqsw 929m 1488Mi
bush-point inference-system-d8d67c775-4gp2g 983m 2029Mi
kube-system ama-logs-glqjn 50m 241Mi
kube-system ama-logs-l4chf 64m 276Mi
kube-system ama-logs-m5gwk 58m 258Mi
kube-system ama-logs-rs-7bfcc69558-4vb2b 42m 207Mi
kube-system azure-ip-masq-agent-c9xbp 1m 20Mi
kube-system azure-ip-masq-agent-f7zwq 1m 17Mi
kube-system azure-ip-masq-agent-rl4p5 1m 19Mi
kube-system cloud-node-manager-jlr5h 1m 20Mi
kube-system cloud-node-manager-nv7df 1m 17Mi
kube-system cloud-node-manager-qklxn 1m 20Mi
kube-system coredns-6865d647c6-cdkmg 6m 34Mi
kube-system coredns-6865d647c6-gl29g 6m 35Mi
kube-system coredns-autoscaler-67d9d668db-nps46 1m 15Mi
kube-system csi-azuredisk-node-2znl5 1m 62Mi
kube-system csi-azuredisk-node-84682 1m 74Mi
kube-system csi-azuredisk-node-md4bw 1m 66Mi
kube-system csi-azurefile-node-8dhbr 2m 69Mi
kube-system csi-azurefile-node-cw8js 2m 69Mi
kube-system csi-azurefile-node-dc4l9 2m 80Mi
kube-system konnectivity-agent-9d6647f89-f9jdw 4m 26Mi
kube-system konnectivity-agent-9d6647f89-kprrh 8m 27Mi
kube-system konnectivity-agent-autoscaler-6ff7779788-vpqhj 2m 18Mi
kube-system kube-proxy-2x75x 2m 45Mi
kube-system kube-proxy-kjhf6 2m 44Mi
kube-system kube-proxy-ksjqv 3m 55Mi
kube-system metrics-server-5554f5bfbd-4p7h4 10m 51Mi
kube-system metrics-server-5554f5bfbd-9cfwl 7m 50Mi
mast-center inference-system-6cc784cb6c-n2mwv 984m 2081Mi
orcasound-lab inference-system-76c476fdc7-zrs6z 820m 1662Mi
point-robinson inference-system-6487fb8c59-466sz 994m 1873Mi
port-townsend inference-system-6f84d95d79-mf2t6 804m 2000Mi
sunset-bay inference-system-66bb79b8c7-x7vdw 826m 2039Mi
In the example above, point-robinson, mast-center, and andrews-bay are all pegged.
To see just one namespace and verify the config:
kubectl get pods -n $NAMESPACE -o custom-columns=NAME:.metadata.name,CPU_LIMIT:.spec.containers[*].resources.limits.cpu,CPU_REQUEST:.spec.containers[*].resources.requests.cpu
kubectl top pod -n $NAMESPACE
Example using NAMESPACE=andrews-bay:
> kubectl get pods -n andrews-bay -o custom-columns=NAME:.metadata.name,CPU_LIMIT:.spec.containers[*].resources.limits.cpu,CPU_REQUEST:.spec.containers[*].resources.requests.cpu
NAME CPU_LIMIT CPU_REQUEST
benchmark-pod-7c445dfc74-ttqpl 1 1
inference-system-7d679f4ccb-7jqsw 1 1
> kubectl top pod -n andrews-bay
NAME CPU(cores) MEMORY(bytes)
benchmark-pod-7c445dfc74-ttqpl 0m 138Mi
inference-system-7d679f4ccb-7jqsw 994m 1401Mi
In the above example, the CPU is pegged because 0.994/1 = 99.4% CPU
Using (say) north-sjc as the namespace:
kubectl describe configmap hydrophone-configs -n north-sjc
kubectl apply -f deploy/north-sjc-configmap.yaml
kubectl describe configmap hydrophone-configs -n north-sjc
kubectl rollout restart deployment inference-system -n north-sjc
kubectl rollout status deployment inference-system -n north-sjc
Use the following commands, making sure $NAMESPACE is replaced with the namespace:
kubectl scale deployment inference-system -n $NAMESPACE --replicas=0Or, remove the existing ones (including errored ones, etc.) and let a new one load:
kubectl delete pod -n $NAMESPACE -l app=inference-systemkubectl get pod -o wide --all-namespaces -l app=inference-systemTo change the existing max-count to 4:
az aks nodepool update --resource-group LiveSRKWNotificationSystem --cluster-name inference-system-AKS --name $POOLNAME --min-count 1 --max-count 4 --update-cluster-autoscaler
az aks nodepool show --resource-group LiveSRKWNotificationSystem --cluster-name inference-system-AKS --name $POOLNAME --query "{autoscaler:enableAutoScaling, min:minCount, max:maxCount}"Or, to enable auto-scaling:
az aks nodepool update --resource-group LiveSRKWNotificationSystem --cluster-name inference-system-AKS --name $POOLNAME --min-count 1 --max-count 2 --enable-cluster-autoscaler
az aks nodepool add --resource-group LiveSRKWNotificationSystem --cluster-name inference-system-AKS --name $POOLNAME --node-vm-size Standard_F4s_v2 --node-count 1 --mode User
az aks nodepool list --resource-group LiveSRKWNotificationSystem --cluster-name inference-system-AKS -o table
To change the node count later:
az aks nodepool scale --resource-group LiveSRKWNotificationSystem --cluster-name inference-system-AKS --name $POOLNAME --node-count 3
kubectl debug node/aks-agentpool-41025176-vmss00001q -it --image=ubuntu
lscpu
cat /proc/cpuinfo | grep "model name"
free -h
numactl --hardware
and to clean them up later:
kubectl delete pod -n default --field-selector status.phase=Succeeded
kubectl delete pod -n default --field-selector status.phase=Failed
kubectl get pods --all-namespaces | findstr infer
kubectl exec -it inference-system-547c9699-rcxzs -n andrews-bay -- /bin/bash
ps -L -p 1 -o pid,tid,psr,%cpu,comm --sort=-%cpu
kubectl exec -it inference-system-547c9699-rcxzs -n andrews-bay -- /bin/bash
top
Hit 1 and look at output like:
%Cpu0 : 11.7 us, 5.2 sy, 0.0 ni, 81.4 id, 0.0 wa, 0.0 hi, 1.7 si, 0.0 st
%Cpu1 : 48.5 us, 2.2 sy, 0.0 ni, 47.0 id, 0.0 wa, 0.0 hi, 2.2 si, 0.0 st
%Cpu2 : 77.6 us, 3.3 sy, 0.0 ni, 18.3 id, 0.0 wa, 0.0 hi, 0.8 si, 0.0 st
%Cpu3 : 61.5 us, 2.7 sy, 0.0 ni, 33.9 id, 0.0 wa, 0.0 hi, 1.9 si, 0.0 st
Take 100-<id(le)> so 81.4% idle, 47% idle, etc.
To see the current CPU and memory use of each node:
kubectl top node
To see what pods are currently running vs pending:
kubectl get pod -o wide --all-namespaces | findstr inference-system
If one is Pending:
kubectl describe pod inference-system-59d785488-45p7n -n andrews-bay
How do I see memory request of each running pod?
kubectl get pods --all-namespaces -o custom-columns=NAME:.metadata.name,MEM_REQUEST:.spec.containers[*].resources.requests.memory,MEM_LIMIT:.spec.containers[*].resources.limits.memory | findstr infer
Example:
inference-system-59d785488-45p7n 1700Mi 1700Mi
inference-system-5d45857b77-n6zfm 3Gi 3Gi
inference-system-557bcf76dd-5rvqb 1700Mi 1700Mi
inference-system-6d6f8bd78c-hmtz5 1900Mi 1900Mi
inference-system-557bcf76dd-f2xhb 1700Mi 1700Mi
inference-system-5d45857b77-nxg4l 3Gi 3Gi
inference-system-589ddb4546-5bbc6 3G 3G
To see memory usage per pod:
kubectl top pod --all-namespacesTo see why a node is utilized:
kubectl describe node aks-f4sv2pool-22767839-vmss000000To change the limits, edit the deploy/andrews-bay.yaml file
kubectl scale deployment inference-system -n andrews-bay --replicas=0
kubectl apply -f deploy\andrews-bay.yaml
kubectl describe node aks-f4sv2pool-22767839-vmss000001 | findstr memorykubectl get events -A --sort-by=.metadata.creationTimestamp
kubectl top pod --all-namespaces --sort-by=memory
kubectl describe node aks-f4sv2pool-22767839-vmss000000 | findstr /i ephemeral-storage
ephemeral-storage: 129886128Ki <- allocatable
ephemeral-storage: 119703055367 <- capacity
ephemeral-storage 0 (0%) 0 (0%) <- current usage (incorrect)
kubectl describe node aks-f4sv2pool-22767839-vmss000000 | findstr Pressure
kubectl debug node/aks-f4sv2pool-22767839-vmss000000 -it --image=ubuntu
chroot /host
du -sh /var/lib/kubelet/pods/* | sort -h
du -sh /var/log/pods/* | sort -h
# ls -l
total 7552
-rw-r----- 1 root root 7725673 Dec 12 03:24 0.log
# pwd
/var/log/pods/mast-center_inference-system-5bb97c5889-h22lf_dcb5c816-e25c-47a9-9897-eb4324e3dd36/inference-systemkubectl debug node/aks-f4sv2pool-22767839-vmss000000 -it --image=ubuntu --profile=general
du -sh /host/var/lib/containerd/* | sort -hThis might show lines like this:
19G /host/var/lib/containerd/io.containerd.content.v1.content
46G /host/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs
Also check:
du -sh /host/var/lib/kubelet/pods/* | sort -h
du -sh /host/var/log/* | sort -h
chroot /host
crictl image pruneTo recycle the node:
kubectl cordon aks-f4sv2pool-22767839-vmss000000
kubectl drain aks-f4sv2pool-22767839-vmss000000 --ignore-daemonsets --delete-emptydir-data --force
az aks nodepool show --resource-group LiveSRKWNotificationSystem --cluster-name inference-system-AKS --name f4sv2pool --query count -o tsvto get <n>, so if <n> is 2 then:
az aks nodepool scale --resource-group LiveSRKWNotificationSystem --cluster-name inference-system-AKS --name f4sv2pool --node-count 1
az aks nodepool scale --resource-group LiveSRKWNotificationSystem --cluster-name inference-system-AKS --name f4sv2pool --node-count 2- Log into portal.azure.com
- Click "Subscriptions" and then "Microsoft Azure Sponsorship 2"
- Expand "Cost Management" in the left bar and select "Cost analysis"