Article From:

First, reference

Reference: 607/p/8330551.html

ribbon+spring retrySource code analysis of retry strategy:


Two, background

High availability of services these days.

In order to ensure that customer requests are unavailable due to a failure of a machine providing services, we need to retry the server or route it intelligently to the next available server.

For this reason, we searched some information on the internet, and finally chose the retry strategy of ribbon + spring retry.


As can be seen from the reference technical article, the core of fault retry

1It’s the dependency of introducing spring retry


2Is to open zuul and ribbon retry configuration

  retryable: true    #Retrial matchRibbon:MaxAutoRetries Next Server:2     #Number of Service Instances ReplacedMaxAutoRetries:0        #Number of current service retries
OkToRetryOnAllOperations: true #Set to false to handle only get request failures


Of course, the purpose of this article is not limited to this.

After adding these configurations, we find that there are still some limitations.

1、When the cluster machine instance providing services is smaller than MaxAutoRetriesNextServer, only the load using polling policy can be used normally.

2、When the instances of cluster machines providing services are larger than those of MaxAutoRetriesNext Server, the load with polling or random policies can occasionally be used normally.

A minimum concurrency strategy or a single load (usually to solve the session loss problem, i.e., the same client requests fixed access to a server)

It can’t work properly at all.

      Why do you say that? For example, we have five machines to provide services. The first machine can provide services normally, and the second one has the smallest concurrency.

      When the second to fifth servers hang up, polling is used and MaxAutoRetriesNextServer = 2. Then ribbon will try to access the third and fourth servers.

      The result is self-evident. Of course, if lucky, the third or fourth server is available, then it can provide services normally.

      Using random strategies also depends on luck.

      Minimum concurrency or single strategy is that no matter how many retries are made, the second node always chooses to hang and completely fails.
So, what’s the solution?


3. Dynamic setup of MaxAutoRetriesNextServer

One of the keys to these problems is that Max AutoRetriesNext Server is written to death, and the number of servers we provide may increase with the load of the cluster (reduction does not affect).

You can’t change the configuration of MaxAutoRetriesNext Server every time you increase the number of servers, can you? Since you don’t want to change the configuration, of course, you set the value of MaxAutoRetriesNextServer dynamically.

Look at the retry source code Ribbon LoadBalanced RetryPolicy. Java

    public boolean canRetryNextServer(LoadBalancedRetryContext context) {
        //this will be called after a failure occurs and we increment the counter
        //so we check that the count is less than or equals to too make sure
        //we try the next server the right number of times
        return nextServerCount <= lbContext.getRetryHandler().getMaxRetriesOnNextServer() && canRetry(context);

You can see that the value of MaxAutoRetriesNextServer is obtained from DefaultLoad BalancerRetryHandler. But Default Load Balancer RetryHandlIt also does not provide an interface for setting up MaxAutoRetriesNextServer.

Tracing up the source code for DefaultLoadBalancerRetryHandler instantiation

    public RibbonLoadBalancerContext ribbonLoadBalancerContext(ILoadBalancer loadBalancer,
            IClientConfig config, RetryHandler retryHandler) {
        return new RibbonLoadBalancerContext(loadBalancer, config, retryHandler);

    public RetryHandler retryHandler(IClientConfig config) {
        return new DefaultLoadBalancerRetryHandler(config);

It was found that the DefaultLoadBalancerRetryHandler object can be retrieved from the RibbonLoadBalancerContext instance, while the RibbonLoadBalancerContext can be retrieved from the RibbonLoadBalancerContext instance.Spring Client Factory gets, so we just need to create a new retryHandler and reassign it to Ribbon Load Balancer Context.


1、Hosting IClientConfig to spring

    public IClientConfig ribbonClientConfig() {
        DefaultClientConfigImpl config = new DefaultClientConfigImpl();
        config.set(CommonClientConfigKey.ConnectTimeout, DEFAULT_CONNECT_TIMEOUT);
        config.set(CommonClientConfigKey.ReadTimeout, DEFAULT_READ_TIMEOUT);
        return config;

2、New retryHandler and update to Ribbon Load Balancer Context

    private void setMaxAutoRetiresNextServer(int size) {  //size: Number of clusters providing servicesSpring Client Factory= SpringContext.getBean(SpringClientFactory.class); //Get spring managed singleton objectsIClientConfig ClientConfig= SpringContext.getBean(IClientConfig.class);
        int retrySameServer = clientConfig.get(CommonClientConfigKey.MaxAutoRetries, 0);//Get the value in the configuration file, default 0boolean retryEnable = clientConfig.get(CommonClientConfigKey.OkToRetryOnAllOperations, false);//Default false.RetryHandler retryHandler= new DefaultLoadBalancerRetryHandler(retrySameServer, size, retryEnable);//New retryHandlerFactory. getLoadBalancerContext (name). setRetryHandler (retryHandler);}

MaxAutoRetriesNextServerThe problem of dynamic setting is solved.


4. Eliminate unavailable services.

ErukaIt seems to have the ability to remove and restore services, so if you use the Eruka registry, you don’t have to look down. I’m not sure about the specific configuration.

Because we don’t use eruka, the list of services we retry from failures still contains the servers that have been suspended.

This leads to problems with the load of the minimum concurrency policy and the single policy.

Tracking the source code, we found that the server failure will call the canRetryNextServer method, so it is better to do an article in this method.


Custom RetryPolicy inherits Ribbon Load Balanced RetryPolicy and overrides canRetryNextServer

public class ServerRibbonLoadBalancedRetryPolicy extends RibbonLoadBalancedRetryPolicy {

    private RetryTrigger trigger;
    public ServerRibbonLoadBalancedRetryPolicy(String serviceId, RibbonLoadBalancerContext context, ServiceInstanceChooser loadBalanceChooser, IClientConfig clientConfig) {
        super(serviceId, context, loadBalanceChooser, clientConfig);

    public void setTrigger(RetryTrigger trigger) {
        this.trigger = trigger;

    public boolean canRetryNextServer(LoadBalancedRetryContext context) {
        boolean retryEnable = super.canRetryNextServer(context);
        if (retryEnable && trigger != null) {
            //Callback trigger
        return retryEnable;

    public interface RetryTrigger {
        void exec(LoadBalancedRetryContext context);


Custom RetryPolicyFactory inherits Ribbon LoadBalanced RetryPolicyFactory and overrides the Create method

public class ServerRibbonLoadBalancedRetryPolicyFactory extends RibbonLoadBalancedRetryPolicyFactory {
    private SpringClientFactory clientFactory;
    private ServerRibbonLoadBalancedRetryPolicy policy;
    private ServerRibbonLoadBalancedRetryPolicy.RetryTrigger trigger;

    public ServerRibbonLoadBalancedRetryPolicyFactory(SpringClientFactory clientFactory) {
        this.clientFactory = clientFactory;

    public LoadBalancedRetryPolicy create(String serviceId, ServiceInstanceChooser loadBalanceChooser) {
        RibbonLoadBalancerContext lbContext = this.clientFactory
        policy = new ServerRibbonLoadBalancedRetryPolicy(serviceId, lbContext, loadBalanceChooser, clientFactory.getClientConfig(serviceId));
        return policy;

    public void setTrigger(ServerRibbonLoadBalancedRetryPolicy.RetryTrigger trigger) {
        policy.setTrigger(trigger);//The setTrigger doesn't know who will trigger first, so it's set on both sides.this.trigger = trigger;


Hosting LoadBalanced RetryPolicy Factory to spring

    @ConditionalOnClass(name = "")
    public LoadBalancedRetryPolicyFactory loadBalancedRetryPolicyFactory(SpringClientFactory clientFactory) {
        return new ServerRibbonLoadBalancedRetryPolicyFactory(clientFactory);

Then we can implement the RetryTrigger method on our rule class.

public class ServerLoadBalancerRule extends AbstractLoadBalancerRule implements ServerRibbonLoadBalancedRetryPolicy.RetryTrigger {

    private static final Logger LOGGER = LoggerFactory.getLogger(ServerLoadBalancerRule.class);
     * Unavailable servers*/
    private Map<String, List<String>> unreachableServer = new HashMap<>(256);
     * Last request tag*/
    private String lastRequest;

    LoadBalancedRetryPolicyFactory policyFactory;

    public Server choose(Object key) {
        //Initialization Retry Trigger
        return getServer(getLoadBalancer(), key);

    private Server getServer(ILoadBalancer loadBalancer, Object key) {
       //Filtering unavailable services

    private void retryTrigger() {
        RequestContext ctx = RequestContext.getCurrentContext();
        String batchNo = (String) ctx.get(Constant.REQUEST_BATCH_NO);
        if (!isLastRequest(batchNo)) {
            //Clean up all cached unavailable services instead of the same request

        if (policyFactory instanceof ServerRibbonLoadBalancedRetryPolicyFactory) {
            ((ServerRibbonLoadBalancedRetryPolicyFactory) policyFactory).setTrigger(this);

    private boolean isLastRequest(String batchNo) {
        return batchNo != null && batchNo.equals(lastRequest);

    public void exec(LoadBalancedRetryContext context) {
        RequestContext ctx = RequestContext.getCurrentContext();
     //UUID,Failure retries do not change. Each time a customer requests a new batchNo, which can be generated in preFilter.  
        String batchNo = (String) ctx.get(Constant.REQUEST_BATCH_NO);
        lastRequest = batchNo;

        List<String> hostAndPorts = unreachableServer.get((String) ctx.get(Constant.REQUEST_BATCH_NO));
        if (hostAndPorts == null) {
            hostAndPorts = new ArrayList<>();
        if (context != null && context.getServiceInstance() != null) {
            String host = context.getServiceInstance().getHost();
            int port = context.getServiceInstance().getPort();
            if (!hostAndPorts.contains(host + Constant.COLON + port))
                hostAndPorts.add(host + Constant.COLON + port);
            unreachableServer.put((String) ctx.get(Constant.REQUEST_BATCH_NO), hostAndPorts);


In this way, we get the unavailable services, and then filter out the services in unreachable Server when we retry.

One thing to note here is that the value of MaxAutoRetriesNextServer must be the size of the unfiltered list of services.


Of course, some people will wonder what happens if the number of servers is too large and the retry time exceeds ReadTimeout. I don’t have timeouts here, because it’s not a reasonable requirement to keep customers waiting too long.

So it’s good to set up a reasonable ReadTimeout in the configuration file. In this time period, if you retry the available service, you will throw the timeout information directly to the customer.

Source address:

Leave a Reply

Your email address will not be published. Required fields are marked *