load rejection to avoid overloading

Posted by lipsius at 2020-02-19

For many years, I have been working in the Amazon service platform team. Our team developed tools to help our service owners, such as Amazon route 53 and elastic load balancing, quickly create services, and customer call them. Other functions provided by Amazon commands, such as metrics, authentication, and monitoring service owners, And the establishment of customer libraries and documents. In order to make the service administrator not have to add these functions manually, the service platform development team has realized integration once, They can then access each service through configuration.

However, there are also some difficulties, for example, we are not sure how to provide appropriate default settings, especially options, (a) In terms of production or availability: For example, it is difficult to determine the standard customer wait time because our platform does not know the characteristics of API latency. Service owners and customers will encounter similar problems, so we continue to develop and do some useful discovery.

In addition, it is difficult to determine the number of standard connections that can be opened to customers on the server at the same time. This setting helps prevent server overload. In particular, we want to configure the server with the maximum number of connections and a similar load balancing index. This is before the emergence of elastic load balancing, so hardware load balancing is widely used.

We decided to help the owners and customers of Amazon services determine that the ideal value of the maximum number of connections is load balancing, It is also related to the platform we provide. It is agreed that if we can make a choice based on our own reasoning, we can write software to simulate these reasoning.

The definition of ideal value is a very difficult task. If the maximum number of connections is too low, the load balancer may reject redundant requests and increase their number even if the service has sufficient resources. If the setting is too large, the server may slow down or reject. If you specify that the maximum number of connections for a particular workload is optimal, the dependent productivity or workload itself may change. In this case, the value will again fail, leading to unavoidable disconnection and overload.

Finally, we find that the maximum connection rate is too inaccurate to completely solve the problem. This article describes other methods, such as reducing the burden, which have brought us good results.

Overload anatomy

Amazon systems use predicted scaling to prevent overloading. However, system protection must be multi-level. It starts with automatic scaling and includes a mechanism to safely release the remaining load, most importantly, continuous testing.

Our service load test shows that the latency of the server at low load is lower than that at high load. Stressful issues such as traffic conflicts, background changes, deleting unnecessary data and input-output conflicts. With the passage of time and the arrival of the critical moment, the productivity of services began to decline faster.

The theory on which this observation is based is called the universal law of scale. This is the result of amdala method. According to this theory, although the network bandwidth can be improved by parallelization, it is ultimately limited to the throughput of batch processing points (that is, the task to be completed,) Not parallel

Unfortunately, trafficability is not only limited by system resources, but also is often affected when the system is overloaded. When the workload of the system exceeds its resource capacity, the work slows down. Computers take on tasks even when they are overloaded, but they spend more time switching contexts. They become too slow for their own good.

In a distributed system where the client interacts with the server, this usually causes the client to lose patience for a period of time and stop waiting for the server's response. This time is called waiting time. When the server is overloaded to the maximum, the timeout expires and the request fails. The following table shows As the response time of the server increases, the required processing power (per second) increases, and as time goes on, it reaches a turning point when its performance It began to decline rapidly.

When the response time exceeds the waiting time of the customer, it is understandable that the situation is not good, but the time shown in the figure above is not obvious. To illustrate this, a visit schedule can be developed from the customer's perspective, depending on the delay. Instead of using the total response time indicator, we can use the average response time as a guide. Note that 50% of requests are processed faster than the average latency. If the average delay of the service is equal to the waiting time of the customer, half of the requests will expire, so the availability rate is 50%. In this case, if the delay increases, there will be availability problems. This is the chart below.

Unfortunately, this schedule is hard to read. For simplicity, accessibility can be reduced to the definition of effective capacity. Throughput is the total number of requests to the server per second. Effective throughput - this is part of a request that handles error free and fairly short delays to enable customers to take advantage of feedback.

Positive feedback cycle

It's a tricky issue and they've strengthened their feedback. If the customer's waiting time has expired, this will not be our only problem. What's worse, all the work the server has done will be in vain. In the case of limited resources, when the system is overloaded, it can not waste its time.

In addition, customers often repeat requirements. This will increase the system load. If the service-oriented building call graph is very deep (that is, customer call service will cause other services, and these calls will cause other services to call), then the service quality is very high. Each level will have several repeated attempts, and the overload of the next level will lead to an increase in its number of cascades and an exponential increase in workload.

When these factors are combined, overload forms an inherent feedback cycle, which leads to overload in a stable state.

Prevent useless work

At first glance, it's easy to unload. As the server approaches overload, it begins to reject redundant requests and focus on what has passed. Step down target - the client times out before the low latency request server decides to accept the service to send a response. In this case, the server supports high access requests. Overload only affects the availability of overload flow.

By eliminating the residual load control delay, the system is easier to operate. However, the benefits of this approach are difficult to account for in the previous timetable. The public availability line continues to move down, which looks bad. The difference is that the queries accepted by the server are still available, Since they have been processed quickly, load reset allows the server to maintain effective capacity and handle as many requests as possible, even when the load is increasing. However, the uninstaller also involves costs, so as time goes on, the server is subject to amdala law, and the effective transmission capacity decreases.


When I discuss with other engineers how to reduce the pressure, I often say: if the service fails the load test, it should be ready after the failure point This will lead to the most inappropriate failure. At Amazon, we spend a lot of time testing our service load. As described in this article, graphs can help us evaluate the basic performance in overloading and track changes as services change.

There are many types of load tests. Some load tests ensure that teams scale automatically to accommodate increased loads, while others maintain a fixed yard size. If the service availability drops rapidly to zero as the throughput increases during the overload test, this is a sign, The service requires additional depressurization mechanisms. Ideally, in load testing, the effective bandwidth should be stable close to the full utilization of the service and remain constant even when Further increase the load.

Tools like chaos monkey help with the messy engineering of test services. For example, they may overload the office of central support services or cause software package damage to simulate overload conditions. Another test method we use is to use the existing load test or test system. Use constant load in the test environment instead of increasing and begin to exclude servers. This increases the load on each mechanism, allowing it to be tested for capacity. This is a manual increase method, which reduces the size of the fleet and facilitates separate test services, But it is not a complete load test alternative. A complete, comprehensive load test also increases the burden of service dependency and helps identify other bottlenecks.

In the test process, we not only measure the availability and latency of the server, but also measure the availability from the perspective of the client. When this availability starts to decline, we will further increase the load. In case of depressurization, the capacity will remain stable even if the load greatly exceeds the rated capacity of the service.

Overload testing must be carried out before the prevention mechanism can be studied. Each mechanism has its own complexity. For example, it's hard to remember all the configuration options on the service platform mentioned at the beginning of the article, and how to choose the right value. Each mechanism to prevent overloading has added some protective measures with limited effectiveness. During the test, the instruction can find the bottleneck of the system and choose the combination of protection measures needed to eliminate overload.


At Amazon, no matter what method is used to prevent overload, we are seriously considering, What kind of measurement and visibility do we need when the overload protection measures take effect.

When failsafe rejects a request, the denial reduces the availability of the service. When the service makes an error, it rejects the request, despite the available resources),) for example, if the maximum number of connections is set too low Record fake actions. We tried to get the wrong job to zero. If the instruction detects that the frequency of false actions is greater than zero on a regular basis, either the sensitivity set by the service is too high, or a single node will be permanently constrained. And real overload. This may indicate load sizing or balancing issues. In this case, you may need to adjust the work of the application or move to a larger type of mechanism that is more suitable for load changes.

From an intuitive point of view, when the request is rejected due to unloading, it is necessary to ensure that there are control and measuring devices to know who the customer is and what to do His phone calls and other information will help us adjust our security measures. We also use alerts to determine if the protection has rejected a significant amount of traffic. In the case of partial abandonment of the system, increasing resources and eliminating current bottlenecks are priorities.

Another non obvious but important factor is visibility during unloading. We realize that it's important not to measure the latency of a failed request in our service. However, the delay request should be very low compared to other requests. For example, if a service throws 60% of its traffic, the average latency might look good, despite a terrible delay, a successful request, Because the latter is due to insufficient fast reject requests.

Effect of load rejection on automatic scaling and accessibility

Incorrect load reset may result in unresponsive auto scaling. Consider an example of a service configured for jet scaling using CPU and rejecting the requested load in a similar Download Center. In this example, the decompression system reduces the number of requests to keep the CPU download low, And the jet scaling function receives signals, indicating the need to delay the launch of new mechanisms, or not receive them at all.

We also consider the depressurization logic when adjusting the autoscale limit to deal with an obstacle free zone failure. Due to maintaining the required level of latency, the scale of various services has been unable to achieve the same level of resources as multiple availability zones. Amazon's team often uses system metrics, such as downloads of central support services, to determine whether services are close to the limit of resource usage. However, when depressurization is used, the park may be closer to rejecting requests than the system's measurement shows, Moreover, due to the lack of sufficient reserve resources, the problem of barrier free zone cannot be solved. When using reset, you must carefully test the service failure, To understand the team's resources and capabilities at some point.

In fact, by reducing the load, the flow during non critical off peak period can be controlled, thus reducing the cost. For example, if the device library handles the traffic on, it can, This means that search engine bypass traffic is not worth the size of the resource until it's completely redundant. However, we have taken this approach with great caution. Not all requirements are the same cost, but it turns out that, The service must ensure the redundancy of the accessibility area of user traffic, and eliminate the redundant bypass traffic, which needs to be carefully designed, Continuously test and support company management. If the customer does not know how the service is configured, its behavior may be more like a serious failure in the barrier free environment when the barrier free area crashes, Better than normal uninstallation. Therefore, in the service-oriented architecture, we strive to finish the processing as soon as possible [for example, in the service, it receives the customer's initial inquiry], [correct answer], [correct answer], [correct answer], [correct answer], [correct answer], [correct answer], [correct answer], [correct answer]] [correct answer] We don't want to make a global decision to prioritize the size of the stack.

Unloading mechanism

With regard to depressurization and unpredictability, attention must also be paid to many foreseeable conditions that lead to partial abandonment. At Amazon, the service has enough redundancy to deal with the breakdown of the barrier free zone without using additional resources. They use regulation to provide a fair environment for their customers.

However, despite these protection measures and operation methods, the service has certain resources at any time, so it may be overweight for various reasons. These include sudden increase in traffic volume and sudden loss of park resources due to improper deployment or other reasons. And users from simple queries (for example, read from cache) to complex ones, such as errors or missing cache entries. In transit, the service must meet the accepted requirements. Therefore, the service must avoid partial rejection. Under this section, we will discuss some of the features and methods we have adopted over the years to control overload.

Determine the cost of rejecting a request

We must carry out load test on our own services, and continue to increase the load even after the effective capacity is stable. One of the main purposes of this approach is to ensure that the cost of rejecting requests at uninstall is as low as possible. We saw how easy it is to skip random socket operators, which will further increase the cost of processing requests.

In rare cases, it can be more expensive to quickly reject a request than to keep it. In this case, our request to slow down was rejected, according to( However, this is important if the cost of retaining requests is minimal (for example, if they do not delay application sequencing).

Priority of requests

When overloaded, the server can analyze the received requests and select the requests to accept and the rejected requests. The most important query the server will receive is Ping - demand load balancing. If the server does not respond to the Ping bug in a timely manner, load balancing will stop sending new requests to the server for a period of time, and the server will stop. If we close some of it, we certainly don't want to reduce the area of the park. As for other requests, their prioritization will be unique to each service.

Let's take a look at a website service providing data to redevelop Amazon's website. Invoking a service to help draw a web page's search index inspector is certainly not as important as a human initiated query. It's important to serve bypass requests, but ideally, it can move to off peak times. However, in a complex environment, such as, a large number of services are shared. If these services use conflicting heuristic prioritization algorithms, This can affect the availability of the entire system and lead to simplification of work.

Priority allocation and regulation can be used together to avoid strict regulatory restrictions while protecting service overload. In Amazon, when we allow customers to exceed the specified limits, exceeding the requirements of these customers can take precedence over those of other customers, In quotas. We spend a lot of time studying accommodation algorithms, which minimize the possibility of no additional resources. However, given these compromises, we tend to use predictable workloads.

Time observation

If the service part processes the request and notices that the customer times out, it can skip other tasks and reject the request. Otherwise, the server will continue to process the request and its delayed response will be ignored. From the server's point of view, it returned a successful response. However, from the perspective of customers whose waiting time has passed, an error has occurred.

To avoid unnecessary resource consumption, customers can include a prompt in each request to tell the server how long they are ready to wait. The server can evaluate these prompts and reject "doomed" requests at a low cost.

The prompt of waiting time can be expressed as absolute time or duration. Unfortunately, servers in distributed systems are notorious for timing problems. The Amazon time sync service compensates for the bias and synchronizes the clock of the Amazon Elastic Compute Cloud (Amazon EC2) Institute with the satellite and atomic clock repositories in each AWS area. Synchronous clocks are also important for Amazon magazine. Comparing the two log files on the server with the unsynchronized watch makes the problem more difficult to solve.

Another way to track time is to measure the duration of each computer's request. Servers are able to measure past times well in the local environment because they don't need to match other servers. Regrettably, there are also disadvantages to the time frame expected. For example, the timer used must be monotonous, not returned when the server is synchronized with the network time protocol)( A more complex problem is that the server should know when to start a stopwatch length measurement. In the case of overload, the buffer of TCP protocol may produce a large number of requests. When the server reads the request from the buffer, the client wait time ends.

When Amazon's system sends the prompt, the waiting time, we try to use their transshipment. Where a service-oriented architecture involves multiple jumps, we allocate the rest of our time to each, Let subordinate service know how long to send effective response at the end of call chain.

Complete the work started

We don't want to waste useful work, especially when it's overloaded. Work in vain to create a positive feedback loop, improve the overload, because customers often repeat the inquiry, if the service did not respond in time. In this case, one requirement for resources becomes multiple times the service load. When the customer's waiting time is over and they try again, they often stop waiting for the response of the first connection until they make a new connection request. If the server performs the first request and sends a response, the customer may ignore it because he is now waiting for a response to the request again.

It is for this reason that we try to design services to provide limited work. If the API is available, it can return a large dataset( This API interface returns partial results and token, and customers can request additional data. We find it easier to predict additional service load if the server processes the maximum memory limit of the request, CPU and network trafficability. If the server does not know how long it will take to process the query, it is very difficult to control the availability of resources.

How users use the API service interface is a less obvious opportunity to prioritize requirements. For example, the service has two APIs: start) and end. To do its job, the customer must be able to call the API. In this case, the requirement (end) that the service should take precedence over the start request. If the priority request (start), the customer cannot complete the work already started, resulting in partial rejection.

Another reason to do extra work is paging. If the customer needs to send multiple sequential requests, view the results page by page from the service, but it will find faults and delete the results after the N-1 page, It's futile to ask for services on page n-2 and try again. This means that, like the end) request, the request for the first page should take precedence over subsequent page requests. This also explains why the limited work of the services we design does not involve unlimited paging of services invoked from synchronous operations.

Queue observation

In managing internal queues, it is also useful to consider the duration of the request. In many modern service architectures, a collection of memory queue connection threads is used to process requests at different stages of work. With the executor web services platform definitely included in the foreground configuration queue. For each service based on TCP protocol, the operating system supports buffers for each socket, which can contain a large number of requests.

Load balancers can also be added to queues of requests or connected to service overloads using so-called peak queues. Using these queues can result in partial rejection because even if a request is received, the server does not know when it was queued. In general, a safe solution is to use an overflow configuration to quickly reject too many requests instead of queuing. Amazon incorporated this solution into the next generation elastic load balancing service). The classic load balancer service uses a peak queue, but the application load balancer rejects excessive traffic. Regardless of the configuration, the Amazon team is tracking the appropriate load balancing metrics [for example, peak line depth or overflow] Service.

Our experience shows that the importance of tracking queues cannot be overemphasized. To my surprise, I often find a place in my memory where I don't even think of looking for them - in the system and library on which my service depends. When analyzing the system, it is necessary to assume that there are some lines, which we do not know yet. Of course, overload testing provides more useful information than code analysis, as long as you can find a suitable and realistic test plan.

Low level overload protection

Services are divided into several layers - from load balancers to operating systems with Netfilter and iptables capabilities, service platforms and code. Each level provides some protection function.

At the beginning of this article, I described a task since my work in the service platform development team. We are trying to determine a recommended standard connection limit that Amazon teams can place in their load balancing unit. Finally, we recommend that staff set a high connection limit for load balancing and proxy servers, Authorizes the server to perform a more accurate download algorithm for local information. However, it is important to ensure that the maximum number of connections does not exceed the number of threads and listening processes, as well as the file descriptors on the server. This requires that the server has sufficient resources to handle critical requirements to check the performance of load balancing.

The ability to limit the use of resources in a business system provides a huge opportunity that can be useful in an emergency. As we are aware of possible overloads, we are taking protective measures, using appropriate guidelines and specific instructions. The iptables tool can set the maximum number of connections received by the server and reject too many connections more economically than any server program. This option can also be adjusted by more complex means, For example, allow new connections to be installed at a limited speed, or even a limited speed or number of connections to the source IP address. The source IP address filter is valid, but not suitable for traditional load balancing. However, the ELB network load balancer even maintains the IP address of the originator through network virtualization to ensure the correct iptables rules, such as filters. Source IP address.

Multilayer protection

In some cases, server resources are not even sufficient to reject requests without delay. With this fact in mind, we are considering the understanding between all the jumping servers and customers, how they can interact and help reduce overload. For example, some AWS default services include reset options. With Amazon API gateway, you can configure the maximum speed of any API request. Through API gateway, application load balancer or Amazon cloudfront, you can configure the discharge of remaining traffic in AWS WAF by specifying a series of parameters.

Intuition is a complex contradiction. It's important to reject requests early, because at this stage, it's cheapest to emit excess traffic, but visibility is also affected. Because of this, our protection is divided into several levels: the server receives more work than it can do, eliminates excessive traffic, and writes enough information to the log to ensure Loss of flow can be determined. Since the server can reduce limited traffic, we hope that its previous level will provide protection to prevent traffic restrictions.

New view on overload

This article mentioned that the need to reduce the load is due to the slow progress of the system, while undertaking a large number of tasks. When factors such as resource constraints and conflicts become important. Because the delay will eventually lead to useless work, increase the frequency of requests, and create a greater burden, so there is an overload feedback cycle. It is important to avoid the impact of universal and amdal laws of this scale, while avoiding excessive burdens and maintaining a predictable level. Stable performance, even in case of overload. For predictable and stable performance - key principles of Amazon service design.

Amazon dynamodb, for example, is a database service that provides predictable performance and usable scale. Even if the workload increases significantly and exceeds the allocated resources, dynamodb supports predictable latency of effective capabilities. Factors such as automatic dynamodb, adaptive resource allocation and work as required react quickly to improve effective throughput and adapt to changing environments. Increase workload. During this period, the effective bandwidth remains stable, while maintaining the predictable service productivity higher than dynamodb, and improving the stability of the whole system.

AWS lambda is a more universal service for predictable performance. Each API call runs in a separate environment, which provides a stable computing resource. This execution environment is always a single requirement. This method is different from the server mode, according to which each server has multiple API interfaces.

Some aspects of amdal's law can be bypassed by isolating each API challenge independent resources (in computer systems, memory, disks, or networks), Because the resources of one API challenge do not conflict with those of another. Therefore, if the load exceeds the effective capacity, the latter will remain stable, rather than decline as in a more traditional server environment. It's not a panacea, because dependence slows down and surgery increases. However, at least in this case, the hub resources we discussed in this article will not conflict.

This kind of isolated resources is relatively low, but what is important is the advantages of modern non compressed computer environment, such as AWS fargate, Amazon elastic container service (Amazon ECS) and AWS lambda. At Amazon, we found that, Offloading takes a lot of effort: from setting up the streamer to choosing the ideal configuration for the maximum connection to the load balancer. It is difficult or impossible to choose reasonable standard values for these configurations because they depend on the unique operating characteristics of each system. These new paperless computing environments ensure low levels of resource isolation and provide high levels of regulators( In order to prevent overloading, limit and simultaneous operation times control equipment shall be used. To some extent, rather than pursuing the standard value of perfection, We can completely avoid this configuration and make sure that there is no configuration for different types of overloads.

Supplementary information

David atzek is the chief engineer in AWS lambda. David has been developing Amazon software since 2006, previously working on Amazon dynamodb and AWS IOT, as well as internal network service platforms and fleet automation systems. One of David's favorite courses is journal analysis and careful inspection of operational indicators. As a result, he is looking for ways to make the system run unobstructed.

Similar materials

Time out, try again, delay the existence of twitter, implement performance test, and indicate the operation control of distributed system