Lessons from Scaling GKE: L4 ILB Tops at 250 Nodes

Lessons from Scaling GKE: L4 ILB Tops at 250 Nodes

My team at Cruise operates tens of Kubernetes clusters with 10,000s cores and 100s of TB of RAM. Since migration to GCP, we have hit several interesting scaling issues. One of those caused cluster-wide ingress outage for all tenants. In this post, I will revisit the symptom, root cause, and mitigations for this incident. We struggled so you do not have to.

Architecture: Private Ingress with ILB and Nginx Controller

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
            Client
              |
              | L3/L4
              |
              v
Internal TCP/UDP Load Balancer
              |
              | L7 HTTP
              |
--------------+------------------
|             |                 |
|             v                 |
|   Nginx Ingress Controllers   |
|             |                 |
|             | L7 HTTP         |
|             |                 |
|             v                 |
|            Pods               |
|GKE                            |
---------------------------------

Symptom

The Internal Load Balancer (ILB) used for private ingress in the cluster stopped responding to connections. New connection requests hung (e.g. curl & traceroute) except for requests originated from within the cluster.

Root Causes

  1. Currently, L4 ILB only supports at most 250 backends (source). The Kubernetes cloud provider that controls ILB creation sets all Kubernetes nodes in the cluster as ILB backends. As a result, the maximum number of GKE nodes that gets traffic from the ILB is 250. When more than 250 nodes present in the GKE cluster, ILB will deterministically (using node ID) select a subset of 250 backends for healthcheck.

  2. The Nginx controllers are deployed in the cluster as well. They use externalTrafficPolicy: local to bypass iptables and sends traffic directly to any pods located on the host. From the perspective of the ILB every node that has the destination pod will be healthy and ones that do not will be permanently unhealthy.

  3. A tenant workload scale-up caused the number of nodes in the cluster to autoscale above 250. None of the nodes that Nginx resides on were selected by the ILB. As a result, the ILB considered all nodes in the subset as unhealthy. When all ILB backends are unhealthy, no traffic through the ILB will be passed to the cluster.

Mitigations

Immediate

In order to immediately restore cluster ingress service, another ILB was manually created using the cluster’s well-known ingress IP address. This ILB was configured to point to a couple instance groups such that the number of backing nodes did not exceed the limit of 250.

Short-term

One could deploy a cluster with larger nodes (vertical scaling) in the hope that the total number of nodes will be within 250.

Alternatively, one may use externalTrafficPolicy: cluster so Nginx service will use kube-proxy/iptables to load balance traffic to the correct pod(s) regardless of which node traffic lands on in the cluster. The downside of this solution is that

  • Extra latency. When the ILB is sending traffic to the 250-node subset, traffic will go through a second hop to get to the destination pod IPs.
  • Less secure, because it requires opening firewalls for port 80/443 on all nodes.
  • The client source IP is not preserved, because of the iptables hop.

Long-term

Move Nginx controllers out of the cluster and into a GCE instance group, so ILB healthcheck will be done properly.