Resolving Istio 503s and 504s

As kubernetes popularity grows, more and more companies are adopting it and migrating their workload onto kubernetes. The company I currently work for makes no exception, and over the months, we’ve been migrating, almost painlessly, a number of services to kubernetes. Until today.

Before continuing with my story, it’s worth to mention that we use istio as service mesh.

Of all the services, one in particular that we will call “Service X”, has given me and my team-mates a few headaches. The first attempt to migrate “Service X” to kubernetes, was made in September 2019. “Service X” is a high throughput, low latency service that, at that time, was suffering from some performance issues. When we migrated to kubernetes it only got worse, and because of the importance of the service, lack of time to investigate properly the issue and the fact
we couldn’t afford to have it throwing thousands of errors every day, we decided to move it back to ECS.

Fast forward to March 2020 and the service was running super smooth on ECS after several improvements. Because of how well the service was doing, we were confident this time the migration of “Service X” would go smooth. And so it went. Until we noticed that “Service X” was sporadically throwing a bunch of 503s and 504s, all very close to each other, and then run smoothly again.

This came to us with big surprise, because, as we immediately verified, there were no signs of 50x when “Service X” was on ECS.

Network hops with Istio Ingressgateways

Before proceeding, it’s worth discussing how we’ve configured Kubernetes and Istio in production, and what it means for an HTTP call to traverse the networking stack. In the production cluster, we decided to adopt Ingress Gateways to empower Istio to manage both north-south and east-west communication.

Here below a representation of the networking stack:

Kubernetes and Istio Ingress Gateway
Kubernetes and Istio Ingress Gateway

Let’s briefly explain what happens when the Load Balancer receives an HTTP call.

  1. In kubernetes land, a Load Balancer is implemented by an actual cloud provider load balancer and backed on each kubernetes worker node by a NodePort listening on a specific port number
  2. The NodePort examines the host header and checks its routing tables, in this case the request is for pod belonging to the istio-ingressgateway deployment
  3. The istio-ingressgateway pod is nothing else than an Envoy proxy, that in turn examines the payload and re-route the request to the pod that will eventually serve the request.
  4. The request is now about to hit (finally!) the pod that will actually serve the request, but, because we’ve enabled mTLS, the istio sidecar will intercept the request, verify the request legitimacy, and then, at last, hit the container that will serve the request.

Pretty straightforward. Isn’t it?

I’ve never realised up until now how many hops an HTTP request goes through a networking stack as the one above, but luckily, all the intermediate proxies and underlying network, add very little overhead.

So now, back to our 503s and 504s. Why are they happening, and how can we fix them.

Scaling istio-ingressgateway

If you, like me, weren’t familiar with the stack I’ve described above, you probably, like me, wouldn’t have thought that, all the request going through the load balancer, will also go through the istio-ingressgateway pod, and this, like all other deployment, needs to scale. Our first problem was here. Periodically, the traffic towards our “Service X” would surge very quickly and the istio-ingressgateway pods, wouldn’t scale quick enough to handle the request, they would start refusing connection and generate a 503 Service Unavailable. Once you know the problem, fixing it was a piece of cake. The istio-ingressgateway pods are super lightweight, and a bit of over-provisioning wouldn’t cost too much, so I decided to double the number of pods and implement a very aggressive scaling policy. The result is in the snippet of yaml below:

  custom-ingressgateway:
    enabled: true
    labels:
      app: istio-ingressgateway
      istio: ingressgateway
    serviceAnnotations:
      service.beta.kubernetes.io/aws-load-balancer-ssl-cert: "my-certificate"
      service.beta.kubernetes.io/aws-load-balancer-cross-zone-load-balancing-enabled: "false"
      service.beta.kubernetes.io/aws-load-balancer-connection-idle-timeout: "60"
      service.beta.kubernetes.io/aws-load-balancer-backend-protocol: "http"
      service.beta.kubernetes.io/aws-load-balancer-ssl-ports: "https"
    type: LoadBalancer
    ports:
    - port: 80
      targetPort: 80
      name: http2
      nodePort: 31781
    - port: 443
      name: https
      nodePort: 31782
    autoscaleEnabled: true
    autoscaleMin: 10
    autoscaleMax: 20
    rollingMaxSurge: 100%
    rollingMaxUnavailable: 25%
    resources:
      requests:
        cpu: 100m
        memory: 128Mi
      limits:
        cpu: 2
        memory: 1Gi
    cpu:
      targetAverageUtilization: 50

As you can see I set a minimum of 10 pods and a 50% targetAverageUtilization. This means traffic could double, and we should technically still be able to manage the load. This fixed for us the 503 Service Unavailable issue.

Idle and Connection Timeout

It’s always the same story again and again. At some stage, in every company I work for, I have to deal with these nasty timeouts.

This time though, because of the many hops a http call has to go through, it was very difficult to understand where these params had to be tuned.

I eventually found out that both can be specified in Istio manifests. So without further ado, let’s jump right into the Istio manifests.

Let’s start from the easy one, as per Istio official documentation on timeouts:

The default timeout for HTTP requests is 15 seconds, which means that if the service doesn’t respond within 15 seconds, the call fails.

Boom, that was easy. Let’s fix it:

apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: my-virtual-service
  labels:
    app.kubernetes.io/name: "service-name"
spec:
  hosts:
  - myhost.mycompany.com
  gateways:
  - my-gateway
  http:
  - route:
    - destination:
        host: my-service
    timeout: 30s

This is pretty self-explanatory. Now down to the least intuitive idle timeout.

Establishing a connection costs time and resources. Keeping a connection alive is, almost, free. This is the reason why TCP connections can be kept alive. If they don’t get utilised for a certain amount of time, they should be recycled. And that’s what the idle timeout is for. One issue though is that, these values, have to be larger at every step of the chain. In our example there are three main actors: the Load Balancer, the istio-ingressgateway and the actual service running into the pod. The idle timeout on the service has to be longer than the one on the istio-ingressgateway that, in turn, has to be at least a second longer than the one on the ELB. The reason is simple. Let’s say the Load Balancer has an idle timeout of 60s and the istio-ingressgateway has 30s instead, and let’s say at second 35 the Load Balancer receives a request, and it’s ready to forward it, it verifies the timeout is not expired, and then, start sending packets. On the other end though, the istio-ingressgateway has already recycled the connection (as it was expired 5 seconds before!) and will discard packets sent its way. The Load Balancer won’t hear back from the istio-ingressgateway and will just keep waiting until it will eventually just timeout. Great, we know the problem, how do we fix it?

Istio DestinationRule manifests to the rescue:

apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
  name: my-service-dest-rule
spec:
  host: my-service
  trafficPolicy:
    tls:
      mode: ISTIO_MUTUAL
    loadBalancer:
      simple: ROUND_ROBIN
    connectionPool:
      http:
        idleTimeout: 160s

Now my Load Balancer has a 60s idle timeout, Istio has 160s and the service has an idle timeout of 180s.

As our production cluster is running on Kubernetes 1.14, I’ve also found interesting this article: kube-proxy Subtleties: Debugging an Intermittent Connection Reset, that, maybe, could be responsible for the last few 504s that we are still experiencing. We will hopefully soon migrate to 1.15 and this issue should disappear… so stay tuned!

comments powered by Disqus