Resolving Istio 503s and 504s
As kubernetes popularity grows, more and more companies are adopting it and migrating their workload onto kubernetes. The company I currently work for makes no exception, and over the months, we’ve been migrating, almost painlessly, a number of services to kubernetes. Until today.
Before continuing with my story, it’s worth to mention that we use istio as service mesh.
Of all the services, one in particular that we will call “Service X”, has given me and my team-mates a few headaches.
The first attempt to migrate “Service X” to kubernetes, was made in September 2019. “Service X” is a high throughput, low
latency service that, at that time, was suffering from some performance issues. When we migrated to kubernetes it only got worse,
and because of the importance of the service, lack of time to investigate properly the issue and the fact
we couldn’t afford to have it throwing thousands of errors every day, we decided to move it back to ECS.
Fast forward to March 2020 and the service was running super smooth on ECS after several improvements. Because of how well the service was doing, we were confident this time the migration of “Service X” would go smooth. And so it went. Until we noticed that “Service X” was sporadically throwing a bunch of 503s and 504s, all very close to each other, and then run smoothly again.
This came to us with big surprise, because, as we immediately verified, there were no signs of 50x when “Service X” was on ECS.
Network hops with Istio Ingressgateways
Before proceeding, it’s worth discussing how we’ve configured Kubernetes and Istio in production, and what it means for an HTTP call to traverse the networking stack. In the production cluster, we decided to adopt Ingress Gateways to empower Istio to manage both north-south and east-west communication.
Here below a representation of the networking stack:
Let’s briefly explain what happens when the Load Balancer receives an HTTP call.
- In kubernetes land, a Load Balancer is implemented by an actual cloud provider load balancer and backed on each kubernetes worker node by a NodePort listening on a specific port number
- The NodePort examines the host header and checks its routing tables, in this case the request is for pod belonging to the
istio-ingressgatewaypod is nothing else than an Envoy proxy, that in turn examines the payload and re-route the request to the pod that will eventually serve the request.
- The request is now about to hit (finally!) the pod that will actually serve the request, but, because we’ve enabled mTLS, the istio sidecar will intercept the request, verify the request legitimacy, and then, at last, hit the container that will serve the request.
Pretty straightforward. Isn’t it?
I’ve never realised up until now how many hops an HTTP request goes through a networking stack as the one above, but luckily, all the intermediate proxies and underlying network, add very little overhead.
So now, back to our 503s and 504s. Why are they happening, and how can we fix them.
If you, like me, weren’t familiar with the stack I’ve described above, you probably, like me, wouldn’t have thought that,
all the request going through the load balancer, will also go through the
istio-ingressgateway pod, and this, like all
other deployment, needs to scale.
Our first problem was here. Periodically, the traffic towards our “Service X” would surge very quickly and the
pods, wouldn’t scale quick enough to handle the request, they would start refusing connection and generate a
503 Service Unavailable.
Once you know the problem, fixing it was a piece of cake.
istio-ingressgateway pods are super lightweight, and a bit of over-provisioning wouldn’t cost too much, so I decided to double
the number of pods and implement a very aggressive scaling policy. The result is in the snippet of yaml below:
custom-ingressgateway: enabled: true labels: app: istio-ingressgateway istio: ingressgateway serviceAnnotations: service.beta.kubernetes.io/aws-load-balancer-ssl-cert: "my-certificate" service.beta.kubernetes.io/aws-load-balancer-cross-zone-load-balancing-enabled: "false" service.beta.kubernetes.io/aws-load-balancer-connection-idle-timeout: "60" service.beta.kubernetes.io/aws-load-balancer-backend-protocol: "http" service.beta.kubernetes.io/aws-load-balancer-ssl-ports: "https" type: LoadBalancer ports: - port: 80 targetPort: 80 name: http2 nodePort: 31781 - port: 443 name: https nodePort: 31782 autoscaleEnabled: true autoscaleMin: 10 autoscaleMax: 20 rollingMaxSurge: 100% rollingMaxUnavailable: 25% resources: requests: cpu: 100m memory: 128Mi limits: cpu: 2 memory: 1Gi cpu: targetAverageUtilization: 50
As you can see I set a minimum of 10 pods and a 50%
targetAverageUtilization. This means traffic could double, and we
should technically still be able to manage the load.
This fixed for us the
503 Service Unavailable issue.
Idle and Connection Timeout
It’s always the same story again and again. At some stage, in every company I work for, I have to deal with these nasty timeouts.
This time though, because of the many hops a http call has to go through, it was very difficult to understand where these params had to be tuned.
I eventually found out that both can be specified in Istio manifests. So without further ado, let’s jump right into the Istio manifests.
Let’s start from the easy one, as per Istio official documentation on timeouts:
The default timeout for HTTP requests is 15 seconds, which means that if the service doesn’t respond within 15 seconds, the call fails.
Boom, that was easy. Let’s fix it:
apiVersion: networking.istio.io/v1alpha3 kind: VirtualService metadata: name: my-virtual-service labels: app.kubernetes.io/name: "service-name" spec: hosts: - myhost.mycompany.com gateways: - my-gateway http: - route: - destination: host: my-service timeout: 30s
This is pretty self-explanatory. Now down to the least intuitive idle timeout.
Establishing a connection costs time and resources. Keeping a connection alive is, almost, free. This is the reason why
TCP connections can be kept alive. If they don’t get utilised for a certain amount of time, they should be recycled.
And that’s what the idle timeout is for. One issue though is that, these values, have to be larger at every step of the chain.
In our example there are three main actors: the Load Balancer, the
istio-ingressgateway and the actual service running into the pod.
The idle timeout on the service has to be longer than the one on the
istio-ingressgateway that, in turn, has to be at least a
second longer than the one on the ELB. The reason is simple. Let’s say the Load Balancer has an idle timeout of
60s and the
30s instead, and let’s say at second
35 the Load Balancer receives a request, and it’s ready to forward it,
it verifies the timeout is not expired, and then, start sending packets. On the other end though, the
has already recycled the connection (as it was expired 5 seconds before!) and will discard packets sent its way. The Load Balancer won’t
hear back from the
istio-ingressgateway and will just keep waiting until it will eventually just timeout. Great, we know the problem,
how do we fix it?
Istio DestinationRule manifests to the rescue:
apiVersion: networking.istio.io/v1alpha3 kind: DestinationRule metadata: name: my-service-dest-rule spec: host: my-service trafficPolicy: tls: mode: ISTIO_MUTUAL loadBalancer: simple: ROUND_ROBIN connectionPool: http: idleTimeout: 160s
Now my Load Balancer has a
60s idle timeout, Istio has
160s and the service has an idle timeout of
As our production cluster is running on Kubernetes 1.14, I’ve also found interesting this article: kube-proxy Subtleties: Debugging an Intermittent Connection Reset, that, maybe, could be responsible for the last few 504s that we are still experiencing. We will hopefully soon migrate to 1.15 and this issue should disappear… so stay tuned!