Make Your K8s Apps Istio-Retry Aware!

Learn how to effectively code API requests to microservices in Istio-enabled Kubernetes.

Ted Spinks
Engineer
Made the Kessel Run in less than twelve parsecs.

Published on January 08, 2025


Table of Contents

Overview

Istio Changes Communication

Mitigate The Additional Retries

Get Insight Into Istio’s Retries

Summary


Overview

The addition of Istio to Kubernetes (K8s) can remove a lot of the burden of retry and timeout logic from our microservice code. BUT, if our code isn’t aware of what Istio is doing, it can undo these benefits or worse, it can introduce new problems. In this article, we are going to explore how Istio changes API requests, and how to code our requests to respect those changes and maximize their benefits.

Istio Changes Communication

When microservices on K8s communicate with each other, they must be tolerant of transient issues like pod terminations, network glitches, and request overloads. This means that their retry logic needs to be robust.

On the flip side, when an upstream microservice is floundering (perhaps it’s overloaded with requests, or maybe one of its upstream services is in trouble), we don’t want our retries to be too aggressive, as that could further overload the upstream microservice and actually prevent its recovery.

NOTE: In network applications, “upstream” refers to the direction data flows when a request is made from a client to a server. When your microservice makes a request to another service, that service is said to be “upstream.”

So, let’s say you’re part of an application development (app dev) team that has achieved this delicate balance for all its microservices by implementing some sophisticated retry logic in its request code. To help other app dev teams quickly achieve similar functionality, the K8s platform team has decided to install Istio in all the K8s clusters, utilizing Istio’s VirtualServices to provide similar retry logic for all microservices.

What does that mean for your team? The addition of Istio-level retry logic presents two challenges. First, the additional retries could push your microservices out of their balanced retry stance and into an overly aggressive stance. Secondly, since Istio is adding retries outside of your code, your code will need some way to get insight into what’s happening with those retries so that it can react appropriately. Let’s start with the first challenge…

Mitigate The Additional Retries

To add retries to microservices, a K8s platform team would typically attach an Istio VirtualService resource to each one, similar to this:

apiVersion: networking.istio.io/v1
kind: VirtualService
metadata:
  name: example-api
spec:
  hosts:
    ...
  http:
  - route:
    - destination:
        ...
    retries:
      retryOn: 5xx
      attempts: 3
      perTryTimeout: 1s

And here’s the really interesting part–when your microservice requests an upstream microservice, that will trigger the retry logic in the VirtualService attached to the upstream microservice (not yours). So, if you don’t have the correct permissions to see their VirtualService in K8s, you’ll want to reach out to the app dev team or K8s platform team that owns it and ask to see a copy. Then, you’ll want to check out the retries section–you can read more about each field in this section of the Istio Docs.

NOTE: In the absence of VirtualServices, Istio still adds a default timeout and a default retry policy. These are initially set to 10s and 2 retries on errors “connect-failure, refused-stream, unavailable, cancelled, or retriable-status-codes.” Retriable-status-codes includes 503 errors.

Your first order of business should be to figure out the total time a request could take and then make sure the timeout in your request code is larger. For example, if you’re requesting a microservice with the above VirtualService that 1 request could result in 3 retries at 1s intervals, thus taking 4 * 1s = 4s. So, the request timeout in your code should be at least 4s (plus a bit of buffer). Otherwise, your timeout could cut off the request while Istio is still working on it.

Next, you’ll want to calculate the maximum amount of retries that could happen–too many too quickly could overwhelm the upstream microservice. If you’re making another request to the microservice with the above VirtualService, and your request code has 3 retries, then the math looks like this: your 4 requests * the upstream VirtualService‘s 4 requests = 16 requests.

I’ll illustrate this scenario by requesting 3 retries to a microservice in my K8s cluster that has its VirtualService set to 3 retry attempts.

Here is the log from my request:

DEBUG:urllib3.connectionpool:Starting new HTTP connection (1): test-api.test-api.svc.cluster.local:80
send: b'GET / HTTP/1.1\r\nHost: test-api.test-api.svc.cluster.local\r\nUser-Agent: python-requests/2.32.3\r\nAccept-Encoding: gzip, deflate\r\nAccept: */*\r\nConnection: keep-alive\r\n\r\n'
reply: 'HTTP/1.1 503 Service Unavailable\r\n'
header: server: istio-envoy
header: date: Wed, 08 Jan 2025 18:54:24 GMT
header: content-type: text/html; charset=utf-8
header: access-control-allow-origin: *
header: access-control-allow-credentials: true
header: content-length: 0
header: x-envoy-upstream-service-time: 138
DEBUG:urllib3.connectionpool:http://test-api.test-api.svc.cluster.local:80 "GET / HTTP/1.1" 503 0

DEBUG:urllib3.util.retry:Incremented Retry for (url='/'): Retry(total=2, connect=None, read=None, redirect=None, status=None)
DEBUG:urllib3.connectionpool:Retry: /
send: b'GET / HTTP/1.1\r\nHost: test-api.test-api.svc.cluster.local\r\nUser-Agent: python-requests/2.32.3\r\nAccept-Encoding: gzip, deflate\r\nAccept: */*\r\nConnection: keep-alive\r\n\r\n'
reply: 'HTTP/1.1 503 Service Unavailable\r\n'
header: server: istio-envoy
header: date: Wed, 08 Jan 2025 18:54:25 GMT
header: content-type: text/html; charset=utf-8
header: access-control-allow-origin: *
header: access-control-allow-credentials: true
header: content-length: 0
header: x-envoy-upstream-service-time: 109
DEBUG:urllib3.connectionpool:http://test-api.test-api.svc.cluster.local:80 "GET / HTTP/1.1" 503 0

DEBUG:urllib3.util.retry:Incremented Retry for (url='/'): Retry(total=1, connect=None, read=None, redirect=None, status=None)
DEBUG:urllib3.connectionpool:Retry: /
send: b'GET / HTTP/1.1\r\nHost: test-api.test-api.svc.cluster.local\r\nUser-Agent: python-requests/2.32.3\r\nAccept-Encoding: gzip, deflate\r\nAccept: */*\r\nConnection: keep-alive\r\n\r\n'
reply: 'HTTP/1.1 503 Service Unavailable\r\n'
header: server: istio-envoy
header: date: Wed, 08 Jan 2025 18:54:25 GMT
header: content-type: text/html; charset=utf-8
header: access-control-allow-origin: *
header: access-control-allow-credentials: true
header: content-length: 0
header: x-envoy-upstream-service-time: 120
DEBUG:urllib3.connectionpool:http://test-api.test-api.svc.cluster.local:80 "GET / HTTP/1.1" 503 0

DEBUG:urllib3.util.retry:Incremented Retry for (url='/'): Retry(total=0, connect=None, read=None, redirect=None, status=None)
DEBUG:urllib3.connectionpool:Retry: /
send: b'GET / HTTP/1.1\r\nHost: test-api.test-api.svc.cluster.local\r\nUser-Agent: python-requests/2.32.3\r\nAccept-Encoding: gzip, deflate\r\nAccept: */*\r\nConnection: keep-alive\r\n\r\n'
reply: 'HTTP/1.1 503 Service Unavailable\r\n'
header: server: istio-envoy
header: date: Wed, 08 Jan 2025 18:54:25 GMT
header: content-type: text/html; charset=utf-8
header: access-control-allow-origin: *
header: access-control-allow-credentials: true
header: content-length: 0
header: x-envoy-upstream-service-time: 90
DEBUG:urllib3.connectionpool:http://test-api.test-api.svc.cluster.local:80 "GET / HTTP/1.1" 503 0
Request failed: 503 Server Error: Service Unavailable for url: http://test-api.test-api.svc.cluster.local/

The upstream microservice is responding (via Istio) with 503 errors, so we see 4 responses as expected. Now let’s look at the Istio logs and see what else is going on…

test-api-cd4bf8b77-rnw47 test 127.0.0.6 [08/Jan/2025:18:54:24 +0000] GET / HTTP/1.1 503 Host: test-api.test-api.svc.cluster.local}
test-api-cd4bf8b77-rnw47 test 127.0.0.6 [08/Jan/2025:18:54:24 +0000] GET / HTTP/1.1 503 Host: test-api.test-api.svc.cluster.local}
test-api-cd4bf8b77-rnw47 test 127.0.0.6 [08/Jan/2025:18:54:24 +0000] GET / HTTP/1.1 503 Host: test-api.test-api.svc.cluster.local}
test-api-cd4bf8b77-rnw47 test 127.0.0.6 [08/Jan/2025:18:54:24 +0000] GET / HTTP/1.1 503 Host: test-api.test-api.svc.cluster.local}
test-api-cd4bf8b77-rnw47 test 127.0.0.6 [08/Jan/2025:18:54:24 +0000] GET / HTTP/1.1 503 Host: test-api.test-api.svc.cluster.local}
test-api-cd4bf8b77-rnw47 test 127.0.0.6 [08/Jan/2025:18:54:25 +0000] GET / HTTP/1.1 503 Host: test-api.test-api.svc.cluster.local}
test-api-cd4bf8b77-rnw47 test 127.0.0.6 [08/Jan/2025:18:54:25 +0000] GET / HTTP/1.1 503 Host: test-api.test-api.svc.cluster.local}
test-api-cd4bf8b77-rnw47 test 127.0.0.6 [08/Jan/2025:18:54:25 +0000] GET / HTTP/1.1 503 Host: test-api.test-api.svc.cluster.local}
test-api-cd4bf8b77-rnw47 test 127.0.0.6 [08/Jan/2025:18:54:25 +0000] GET / HTTP/1.1 503 Host: test-api.test-api.svc.cluster.local}
test-api-cd4bf8b77-rnw47 test 127.0.0.6 [08/Jan/2025:18:54:25 +0000] GET / HTTP/1.1 503 Host: test-api.test-api.svc.cluster.local}
test-api-cd4bf8b77-rnw47 test 127.0.0.6 [08/Jan/2025:18:54:25 +0000] GET / HTTP/1.1 503 Host: test-api.test-api.svc.cluster.local}
test-api-cd4bf8b77-rnw47 test 127.0.0.6 [08/Jan/2025:18:54:25 +0000] GET / HTTP/1.1 503 Host: test-api.test-api.svc.cluster.local}
test-api-cd4bf8b77-rnw47 test 127.0.0.6 [08/Jan/2025:18:54:25 +0000] GET / HTTP/1.1 503 Host: test-api.test-api.svc.cluster.local}
test-api-cd4bf8b77-rnw47 test 127.0.0.6 [08/Jan/2025:18:54:25 +0000] GET / HTTP/1.1 503 Host: test-api.test-api.svc.cluster.local}
test-api-cd4bf8b77-rnw47 test 127.0.0.6 [08/Jan/2025:18:54:25 +0000] GET / HTTP/1.1 503 Host: test-api.test-api.svc.cluster.local}
test-api-cd4bf8b77-rnw47 test 127.0.0.6 [08/Jan/2025:18:54:25 +0000] GET / HTTP/1.1 503 Host: test-api.test-api.svc.cluster.local}

Whoa, that’s 16 total requests to the upstream microservice, even though our request code only had 3 retries! We better reduce our code’s retries or find a way to make more intelligent retry decisions. This takes us to our second challenge…

Get Insight Into Istio’s Retries

When you request an upstream microservice that is struggling, Istio will iterate through the retry logic specified by the VirtualService, trying to get you a good response. But what happens if it exhausts its retry logic? Or if it runs into other problems? Will you be stuck with some generic 503 or 504 error? The answer is…

It depends! You had to see that coming–I am a consultant, after all.

Fortunately, there is a very powerful, yet little-known feature of Istio that can help us with this: VirtualServices allow us to dynamically inject or remove headers in requests and responses. In our case, we want to see Istio’s RESPONSE_FLAGS (more on that in a moment) by adding a headers section to the upstream VirtualService like this:

apiVersion: networking.istio.io/v1
kind: VirtualService
metadata:
  name: example-api
spec:
  hosts:
    ...
  http:
  - route:
    - destination:
        ...
    retries:
      retryOn: 5xx
      attempts: 3
      perTryTimeout: 1s
    headers:
      response:
        set:
          x-envoy-response-flags: "%RESPONSE_FLAGS%"

Dynamically injecting headers with Istio is a very underdocumented feature that even seasoned Istio users might not be aware of, so it’s likely that you will have to ask your K8s platform team to add this headers section to the upstream VirtualServices.

In this example, I’ve called the header x-envoy-response-flags, but it could be called anything. I know the x- prefix is somewhat out of vogue now, but I chose x-envoy- to match the other headers that Istio adds to the response, like x-envoy-upstream-service-time.

Anyway, Istio can add a lot of RESPONSE_FLAGS to this header. For the complete list, go to this doc page and search for “RESPONSE_FLAGS.”

Don’t be weirded out by the fact that this is an Envoy doc page–Istio uses Envoy to manage its inter-service network traffic

Let’s say your upstream microservice+VirtualService does include this x-envoy-response-flags header. When Istio exhausts the retry logic in the upstream VirtualService, it will send back a URX flag (UpstreamRetryLimitExceeded) in the x-envoy-response-flags header.

When added to our previous example about retries, we can see this additional header in the log:


DEBUG:urllib3.util.retry:Incremented Retry for (url='/'): Retry(total=0, connect=None, read=None, redirect=None, status=None)
DEBUG:urllib3.connectionpool:Retry: /
send: b'GET / HTTP/1.1\r\nHost: test-api.test-api.svc.cluster.local\r\nUser-Agent: python-requests/2.32.3\r\nAccept-Encoding: gzip, deflate\r\nAccept: */*\r\nConnection: keep-alive\r\n\r\n'
reply: 'HTTP/1.1 503 Service Unavailable\r\n'
header: server: istio-envoy
header: date: Wed, 08 Jan 2025 20:48:03 GMT
header: content-type: text/html; charset=utf-8
header: access-control-allow-origin: *
header: access-control-allow-credentials: true
header: content-length: 0
header: x-envoy-upstream-service-time: 89
header: x-envoy-response-flags: URX
DEBUG:urllib3.connectionpool:http://test-api.test-api.svc.cluster.local:80 "GET / HTTP/1.1" 503 0
Request failed: 503 Server Error: Service Unavailable for url: http://test-api.test-api.svc.cluster.local/

Now we need our code’s request logic to check for that URX flag. If it sees it, then it should stop retrying because it knows that Istio has already sufficiently retried.

Here is some example code in Python that will make a request and then look for this particular header in the response. Python folks typically use a requests.Session with a urllib3.util.Retry object to manage their retries. However, I haven’t found a good way to inject a check function into the urllib3.util.Retry object, so I’m using my own retry loop. This allows me to include my should_retry() function, which can look for the x-envoy-response-flags header along with all my other checks.

#!/usr/bin/env python3

import requests
import time

def should_retry(response, max_retries=0, retries_completed=0):
    if retries_completed >= max_retries:
        print ("Finished all ({}) retries".format(max_retries))
        return False
    # Check for any response codes that you want to be retried
    if response.status_code not in [500, 502, 503, 504]:
        print("{} is not a retry-able status code".format(response.status_code))
        return False
    # Check for any Istio Response Flag headers that should stop retries
    if response.headers.get('x-envoy-response-flags') == 'URX':
        print('UpstreamRetryLimitExceeded (URX) response flag received from Istio, stopping retries')
        return False
    return True

def simple_get_with_retry(url, retries=0, data_dict=None, headers=None, timeout_secs=5):
    retries_completed = 0
    response = requests.get(url.strip(), headers=headers, json=data_dict, timeout=timeout_secs)
    while should_retry(response, retries, retries_completed):
        time.sleep(timeout_secs)
        response = requests.get(url.strip(), headers=headers, json=data_dict, timeout=timeout_secs)
        retries_completed += 1
    return response

def main():
    response = simple_get_with_retry("http://my-microservice.local", 3)
    print("Response code: " + str(response.status_code))
    print("Response Headers: " + str(response.headers))

if __name__ == "__main__":
    main()

Feel free to run this example code in a debug pod or from wherever you can reach an upstream microservice–just replace http://my-microservice.local with the upstream URL. The point here is that you’ll need some sort of check function in your retry loop, where you can react to the x-envoy-response-flags header.

Checking for URX is a great starting point. Over time, as your microservices evolve, be sure to check the doc page for additional RESPONSE_FLAGS that could help make your check function more intelligent.

And while we’re talking about the evolution of our microservices, let’s also consider the evolution of our K8s platform. Once our K8s platform team has defined appropriate retries within VirtualServices for every microservice, we can shift our thinking and consider retries to be the platform’s responsibility. Ideally, we’ll stop managing retry logic within each microservice and allow the platform to own and manage all retries centrally within its VirtualServices. In that case, our microservice request code would still want to check the x-envoy-response-flags response header to use in its error handling.

Summary

When requesting an upstream microservice, you should calculate the total possible retries time defined in its Istio VirtualService, and use that to size the timeout of your request code. You should also reduce the number of retries in your request code since the VirtualService will already be instructing Istio to do them.

Additionally, your request code should check for Istio’s RESPONSE_FLAGS within response headers so that it can react intelligently to them in its retry and error-handling logic.

And finally, we should move to get rid of all retries from our code and rely on the K8s+Istio platform to own them centrally and do what is best. Coming up with appropriate retry and timeout logic for every upstream microservice that our code requests is tedious. When we multiply that effort by all the other teams whose code is also making requests to the same upstream microservices, it wastes a lot of time. Instead, let’s spend that time developing the features that our users really care about, and leave the network details like retries and timeouts to the platform!

If you found this guide helpful, you might also enjoy our live Istio training workshops. We spend > 50% of our workshop time doing hands-on lab work, which is a really fun way to learn. Sometimes I help out as a lab coach–maybe I’ll see you there!

Ted Spinks
Engineer
Made the Kessel Run in less than twelve parsecs.