kubezonnet: Monitor Cross-Zone Network Traffic in Kubernetes

Frederic Branczyk

January 9, 2025

eBPF

We’re excited to open-source the kubezonnet, a solution for identifying and measuring cross-zone pod network traffic in Kubernetes clusters.

Why Does Cross-Zone Network Monitoring Matter?

On many cloud providers, traffic within the same zone is free, but cross-zone traffic often isn’t. When large amounts of data move across availability zones, costs can quickly add up.

Around June/July 2024 we realized that almost half of our entire cloud cost is caused by cross-zone traffic. We tried a few experiments, however, because Google Cloud's billing was our only source of data, any feedback on changes we made was only visible days later. Also, we would only get the total usage/cost for the whole day, not broken down by pod or workload.

As with many other engineering problems we needed to first measure why and where the problem lies before we could try to improve.

At first, we explored whether Cilium was going to be able to provide these features directly, unfortunately, there are fundamental pieces of work that would need to happen, and it was unclear whether they will happen at all. See cilium/cilium#33601, cilium/cilium#34133, and cilium/cilium#16188 for more details. Shout out to various Isovalent engineers for taking the time to listen to us and help us understand our options!

It didn't seem like a particularly difficult project, so this ultimately led to the creation of kubezonnet, and us solving this problem for ourselves and hopefully others.

What Exactly Is kubezonnet?

Kubezonnet is a two-component system that has an agent and a server. The agent leverages eBPF in the Linux Kernel (via a netfilter postrouting hook) to watch packets leaving pods. It aggregates traffic data by source/destination IPs before sending it to a central server. The server then resolves which Pods the IPs belong to, and finally resolves which nodes the Pods are on to understand which zone the nodes and therefore the Pods are in.

The server makes the collected statistics available in two forms:

Metrics: A Prometheus /metrics endpoint exposing total network traffic sent by a pod that crossed a zone boundary. Using the `pod_cross_zone_network_traffic_bytes_total` metric combined with `rate` or `increase` functions, it becomes trivial to understand at a high level which workloads cause the most network traffic that crosses the zone boundary.
Logs: Perhaps more specifically flow-logs, that detail the exact source/destination Pods with the number of bytes transferred over the network. This is useful when it's not obvious which workloads have the highest interactions when there are 1-to-N or N-to-N relationships between services.

Successes

The biggest wins we were able to realize are:

Move most of our monitoring stack with Thanos to be duplicated in each zone. This causes the most expensive part (network traffic caused by rule evaluations) to never leave the zone, and we maintain high availability by having a full replica in each zone.
There was a particular service using our main database excessively. This was previously hard to detect because many services interact with the main database and many services move a lot of bytes over the network. Using the flow logs it became clear that there is a single workload dominating the traffic. Knowing this it was trivial to fix.

There is one more large change that we are currently still working on and will be talking more about in the future. When all of these improvements are done, cross-zone traffic will be a small diminishingly small item of our bill.

Most importantly, now that we have monitoring in place, we can set up alerting to prevent accidental increases from going unnoticed.

Limitations

Unlike many of our blog posts, this is software we only really built for ourselves, not directly for our customers. This means that we stopped at the point where this project was good enough for us.

As a result, some of the limitations the project currently has are:

CNI: Must be Cilium in Legacy host routing mode. This is because in other configurations of Cilium it entirely circumvents netfilter causing the packets to be missed. GKE dataplane v2 clusters use this mode.
Kernel: Requires Linux 6.4+ as that's when netfilter program support landed for eBPF.
IPv4 only, since our clusters are IPv4 only.
The metric data excludes IP header sizes; it’s best used to track relative usage. This can probably be fixed, but we couldn't get the eBPF verifier to agree on extracting the ethernet frame size.
It only captures Pod-to-Pod traffic since that's the only situation where we can control the placement of workloads or routing of packets.

These are all solvable problems, we just didn't need to solve them for us, but we'd love contributions from the community!

Acknowledgments

Shout out to various people who have helped in the process of putting some of the pieces of this project together. In no particular order that includes, but is not limited to:

Try it out!

We've released v0.1.0 of the project and we'd love for you to try it out if you're having the same problems and/or contribute to it to make it more widely applicable.

Discuss: