Retry resiliency policies

Configure resiliency policies for retries

Requests can fail due to transient errors, like encountering network congestion, reroutes to overloaded instances, and more. Sometimes, requests can fail due to other resiliency policies set in place, like triggering a defined timeout or circuit breaker policy.

In these cases, configuring retries can either:

  • Send the same request to a different instance, or
  • Retry sending the request after the condition has cleared.

Retries and timeouts work together, with timeouts ensuring your system fails fast when needed, and retries recovering from temporary glitches.

Dapr provides default resiliency policies, which you can overwrite with user-defined retry policies.

Retry policy format

Example 1

spec:
  policies:
    # Retries are named templates for retry configurations and are instantiated for life of the operation.
    retries:
      pubsubRetry:
        policy: constant
        duration: 5s
        maxRetries: 10

      retryForever:
        policy: exponential
        maxInterval: 15s
        maxRetries: -1 # Retry indefinitely

Example 2

spec:
  policies:
    retries:
      retry5xxOnly:
        policy: constant
        duration: 5s
        maxRetries: 3
        matching:
          httpStatusCodes: "429,500-599" # retry the HTTP status codes in this range. All others are not retried. 
          gRPCStatusCodes: "1-4,8-11,13,14" # retry gRPC status codes in these ranges and separate single codes.

Spec metadata

The following retry options are configurable:

Retry optionDescription
policyDetermines the back-off and retry interval strategy. Valid values are constant and exponential.
Defaults to constant.
durationDetermines the time interval between retries. Only applies to the constant policy.
Valid values are of the form 200ms, 15s, 2m, etc.
Defaults to 5s.
maxIntervalDetermines the maximum interval between retries to which the exponential back-off policy can grow.
Additional retries always occur after a duration of maxInterval. Defaults to 60s. Valid values are of the form 5s, 1m, 1m30s, etc
maxRetriesThe maximum number of retries to attempt.
-1 denotes an unlimited number of retries, while 0 means the request will not be retried (essentially behaving as if the retry policy were not set).
Defaults to -1.
matching.httpStatusCodesOptional: a comma-separated string of HTTP status codes or code ranges to retry. Status codes not listed are not retried.
Valid values: 100-599, Reference
Format: <code> or range <start>-<end>
Example: “429,501-503”
Default: empty string "" or field is not set. Retries on all HTTP errors.
matching.gRPCStatusCodesOptional: a comma-separated string of gRPC status codes or code ranges to retry. Status codes not listed are not retried.
Valid values: 0-16, Reference
Format: <code> or range <start>-<end>
Example: “4,8,14”
Default: empty string "" or field is not set. Retries on all gRPC errors.

Exponential back-off policy

The exponential back-off window uses the following formula:

BackOffDuration = PreviousBackOffDuration * (Random value from 0.5 to 1.5) * 1.5
if BackOffDuration > maxInterval {
  BackoffDuration = maxInterval
}

Retry status codes

When applications span multiple services, especially on dynamic environments like Kubernetes, services can disappear for all kinds of reasons and network calls can start hanging. Status codes provide a glimpse into our operations and where they may have failed in production.

HTTP

The following table includes some examples of HTTP status codes you may receive and whether you should or should not retry certain operations.

HTTP Status CodeRetry Recommended?Description
404 Not Found❌ NoThe resource doesn’t exist.
400 Bad Request❌ NoYour request is invalid.
401 Unauthorized❌ NoTry getting new credentials.
408 Request Timeout✅ YesThe server timed out waiting for the request.
429 Too Many Requests✅ Yes(Respect the Retry-After header, if present).
500 Internal Server Error✅ YesThe server encountered an unexpected condition.
502 Bad Gateway✅ YesA gateway or proxy received an invalid response.
503 Service Unavailable✅ YesService might recover.
504 Gateway Timeout✅ YesTemporary network issue.

gRPC

The following table includes some examples of gRPC status codes you may receive and whether you should or should not retry certain operations.

gRPC Status CodeRetry Recommended?Description
Code 1 CANCELLED❌ NoN/A
Code 3 INVALID_ARGUMENT❌ NoN/A
Code 4 DEADLINE_EXCEEDED✅ YesRetry with backoff
Code 5 NOT_FOUND❌ NoN/A
Code 8 RESOURCE_EXHAUSTED✅ YesRetry with backoff
Code 14 UNAVAILABLE✅ YesRetry with backoff

Retry filter based on status codes

The retry filter enables granular control over retry policies by allowing users to specify HTTP and gRPC status codes or ranges for which retries should apply.

spec:
  policies:
    retries:
      retry5xxOnly:
        # ...
        matching:
          httpStatusCodes: "429,500-599" # retry the HTTP status codes in this range. All others are not retried. 
          gRPCStatusCodes: "4,8-11,13,14" # retry gRPC status codes in these ranges and separate single codes.

Demo

Watch a demo presented during Diagrid’s Dapr v1.15 celebration to see how to set retry status code filters using Diagrid Conductor

Next steps

Try out one of the Resiliency quickstarts:


Last modified January 16, 2025: add Bilgin review (783dae9e)