Debugging on Kubernetes

Debugging on Kubernetes

Even with the best designs, patterns, and processes implemented, there is a tendency for errors to show up and crash the party. If you are coming from another field or just starting up with Kubernetes - there would be needed effort and time to pass so that the shift in mindset happens.

Kubernetes is all about containers and distribution. So first step is to get familiar with the containers and distributed systems. Containers imply Linux and terminal - so get friendly with these. Particularly with networking and storage on Linux.

The art of debugging in this case is the layered problem. Why? Simple, the abstraction of the infrastructure. For example, Kubernetes can run on the cloud or on the premise. Cloud or on the premise - Kubernetes is running on virtual machines. Virtual machines are running on bare metals. Bare metals implement networking, virtual machines implement networking, and Kubernetes is implementing networking. Many moving components increase the chance of failure.

If you take a systematic approach to debug - life gets easier. But don't be too optimistic hair pulling issues will still pull up from time to time.

Categorization of the type of issues on Kubernetes:

  • Network issues
  • Storage issues
  • Compute and memory resource issues
  • Kubernetes object misconfiguration
  • Application misconfiguration
  • Faulty application (Dead end)

This broad categorization will make our job easier since we know which category we need to focus on. The first task: put the problem in one of these categories.

Usual suspects:

  • Is it a network issue?
  • Is it a storage issue?
  • Is it a CPU/Memory issue?
  • Is everything configured correctly in Kubernetes?
  • Is it a software configuration issue?
  • Does the software support this scenario?

Ephemeral containers

As of Kubernetes >=1.25 - ephemeral containers are introduced. It allows the user to spawn a new container in the existing pod for debugging. This ephemeral container can use any container image which can give needed tools for debugging - without modifying the existing workload container.  

kubectl debug -it -c debugger --target=workload --image=busybox ${POD}

New container: debugger will also see the processes from the workload container. Also, it is possible to list files in the debugger container using /proc dir. Note: --target=workload allows sharing process namespaces: Sharing process namespace.

ls /proc/${PROCESS_ID}/root/usr/bin

If the Kubernetes version is lower than 1.25, then you need to provide the needed tools inside the existing container in the pod for debugging purposes.

Another way around this is to spawn a new pod for debugging or attach a new container in the existing pod - by modifying the object itself.

Network issues

As infrastructure is layered - networking is also. The first place to look is if the pod can reach the hostname/IP address. This is where tools like nslookup, ping, telnet, traceroute, curl/wget, etc.

$ kubectl debug -it -c debugger --target=workload --image=leodotcloud/swiss-army-knife ${POD}

$$ nslookup fakeendpoint.local.svc
$$ ping
$$ curl fakeendpoint.local.svc

First, check if the DNS resolver is informed about the fakeendpoint.local.svc. If the resolution is working correctly use the ping to check if the endpoint is reachable.

Ping may not work on some public cloud providers by default - ICMP is disabled.

On the transport layer check if the TCP connection is possible to establish using telnet.

$$ telnet

Tip: Telnet can be used to check for any type of service if it is possible to establish a TCP connection since it's checking L4, not on the L7 application level. For example Kafka endpoints, RabbitMQ, Databases, etc.

Last check if the application is responding correctly on the application layer. For example, if checking a web server or API endpoints use tools like curl. Otherwise, find a suitable client for the type of service eg. MySQL client for example.

If any of the problems occur with the commands above, things to check:

  • DNS resolve not working - check if the DNS server is configured correctly and that records are there.
  • Ping not working - check if the provider supports ping if yes then check the instructions after this list.
  • Telnet not working - check the instructions after this list.
  • Eg. curl not working - check the endpoint service if possible if not contact the owner of the service. Otherwise check the chapter Kubernetes object misconfiguration.

Resolving not being able to establish a TCP connection or ping to the hostname/IP is all depending on where the endpoint is located.

  • Check if the network policy is present on the cluster and if it allows connecting to the specified IP.
  • Check the firewalls on the outgoing point if it allows going outside the network where the cluster is located (Out of Kubernetes).
  • Check if node-to-node communication is possible (Multi-node cluster).
  • Check if the inbound connection to the cluster is allowed.

Storage issues

Kubernetes handles the persistent storage using volumes you can read more about in the blog post I wrote: Kubernetes storage and CSI drivers.

Incorrect mount points

Check if mount points are correct in the pod by using mount.

Mixing subPath and mountPath

Often problems with persistent storage are incorrect mounting points caused by mixing up subPath and mountPath. The subPath is referencing the path in the volume itself and mountPath refers to the mount path in the container.

    - name: test
      image: busybox:1.28
        - name: config-vol
          mountPath: /etc/config
          subpath: /dev/config

Sharing PVC among many pods

If you want to share PVC among many pods check if the volume itself supports the sharing of the storage.

Filling ephemeral local storage

Apart from the persistent storage Kubernetes by default uses local ephemeral storage for the writeable layer of the filesystem of containers. What can happen is that you fill this storage and it will cause the pod to crash.

Scenario: application misconfiguration can happen and the application uses a lot of storage for its needs. The mistake happened and now the application writes the files to the local ephemeral storage. It can easily fill the space and cause the crash of the pod.

How to debug?

Linux df is your friend here.

root@workload-7457457d8b-98vd9:/# df
Filesystem     1K-blocks     Used Available Use% Mounted on
overlay        129886128 22619760 107249984  18% /
tmpfs              65536        0     65536   0% /dev
tmpfs            3558484        0   3558484   0% /sys/fs/cgroup
tmpfs             409600        4    409596   1% /tmp/ca
/dev/sdb1      129886128 22619760 107249984  18% /etc/hosts
shm                65536        0     65536   0% /dev/shm
tmpfs             409600       12    409588   1% /run/secrets/
tmpfs            3558484        0   3558484   0% /proc/acpi
tmpfs            3558484        0   3558484   0% /proc/scsi
tmpfs            3558484        0   3558484   0% /sys/firmware

The first row is showing an overlay of 18% used. The overlay is ephemeral local storage. You can track the usage of this storage and find which process is filling it using lsof.

Compute/memory issues

By default, there are no restrictions on how many resources can be consumed by pods. If the pod uses all resources on the node other pods and that pod itself will have a problem running efficiently. That is the reason why Kubernetes has requests and limits.

Detect CPU/Memory consuming pods and nodes

Compare the usage of pods against available resources on the node.

➜  ~ kubectl top nodes
NAME                              CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%   
aks-default-48120648-vmss000001   159m         8%     2633Mi          57%       
➜  ~ kubectl top pods 
NAME                                            CPU(cores)   MEMORY(bytes)   
workload-55d8b4fcbd-k8zw7              11m          223Mi           

It's easy to pinpoint which node is using the most resources. The harder part is finding out why the process is eating memory or CPU.

Current usage vs reserved resources

If the pod cannot be scheduled by the kubelet but the kubectl top shows there are enough resources available the problem is that kubectl top shows only current usage. On the other hand, Kubernetes reserves the resources by using amounts from limits. Inspect the sum of limits and adjust according to the available resources from the node itself.

Kubernetes object misconfiguration

If the network, storage or compute resources are not the problem and everything is working then you need to validate the Kubernetes objects themselves.

Most basic check list:

  • Check if the secrets are referencing the valid source
  • Check if the configMaps are referencing the valid source
  • Check if secrets are placed in the correct files/environment variables
  • Check if configMaps are placed in the correct files/environment variables
  • Check if volumeMounts are referencing valid volume source
  • Check if volumeMounts is mounting in valid mountPath
  • Check if the service ports are matching containerPorts in the pod
  • Check if persistentVolumeClaim reference is valid PersistentVolume
  • Check if PersistentVolume has enough storage left
  • Check if Container images are valid and the image pull policy

Software misconfiguration

Complex software needs additional configuration for running efficiently.

For example, you can tweak JVM options when running Java software, truststore/keystore for JVM, Nginx server configuration, Go caddy configuration, TLS options and certificates, etc.

It really depends on the technology stack and software used. But if all checks from above passes the debugging process should focus on software configuration.

Faulty software

Sometimes it's possible that software is faulty. Especially if you are running software that is still in development.

This post scratched the surface of debugging on Kubernetes and pointed out some scenarios. It should give you a direction of where one should go when debugging.

  • Categorization
  • Process of elimination
  • Find a specific problem in the category
  • Solve if possible

Simple Linux programs can help you a lot in locating the problem. Happy debugging!

Subscribe to qdnqn

Don’t miss out on the latest issues. Sign up now to get access to the library of members-only issues.
Join other 7 members. Unsubscribe whenever you want.