0x01 background
With the rapid development of the company's business, there is an urgent need for a high-performance, highly reliable four-layer load balancing equipment. In the face of expensive commercial equipment and solutions, we decided to start to build our own four layer load device (TVS). In the selection process, the LVS project of open source was investigated, and its new performance and availability did not achieve the expected results. In view of the short board of LVS and the needs of the company, TVs carried out a lot of optimization, greatly improving the new performance, platform stability, security and ease of use.
TVs supports two forwarding models, fullnat and tunnel, to cope with different business scenarios. This article introduces the fullnat scenario.
0x02 introduction to load balancing
In the early days of the Internet, the business flow of enterprises is not large, and the reliability requirements are not high, so only one server is needed for external services.
Subsequently, with the requirements of load and reliability, different server IP is resolved to different customers through DNS. However, when the server fails, there will be caching at all levels of DNS. Even if the authoritative DNS domain name resolution changes, it will not take effect on the client in time, which will affect the access of some customers. Moreover, the DNS LOAD algorithm is relatively simple and can not meet the service requirements.
So without using DNS resolution polling algorithm, how to implement the load of back-end servers? At this time, the load balancing equipment is needed. Multiple servers are mounted behind the load balancing equipment. The traffic is distributed to different servers through weighted polling algorithm. At the same time, health check is carried out. When the back-end server fails, the load balancing equipment will automatically eliminate it to ensure high availability of business.
The load balance here is a four layer load balancing device. When the device receives traffic, it forwards the traffic to different servers for processing through polling algorithm and session table. There are many forwarding models for industry load balancing. The following is a comparison of the advantages and disadvantages of these models:
Because the main deployment mode of TVs is fullnat, the fullnat process is described in detail here, as shown below.
1. Client sends request to VIP
2. The load selects real server and local IP through polling algorithm
3. Create a new session, record clientip, VIP, RSIP, and localip
4. Change the clientip of the message to local IP and VIP to RSIP
5. Real server processes and returns messages after receiving them
6. After the load receives the packet back, find the session
7. Change message RSIP to VIP and local IP to clientip
8. Send packet return message
Load balancing equipment plays a very important role in the whole business process. For this reason, we designed a set of TVs forwarding platform with high reliability and performance.
0x03 overall structure
TVs uses the scheme of two clusters in the same computer room for mutual primary and standby. TVs is divided into cluster A and cluster B, which work separately. Four machines in the cluster are loaded and standby for each other. At the same time, the cluster is also standby for each other. The topology is as follows:
Main characteristics of TVs:
1. Support per core conversation table
2. Support establish and long connection session synchronization
3. Support the dynamic up and down of TVs servers in the cluster
4. Support hot switching between TVs clusters of upper business
0x04 high availability between clusters
TVs is connected with the upper switch through BGP. When two clusters announce the same VIP1 at the same time, when cluster B announces, use route policy to lower the priority, and cluster a keeps the default priority. At this time, all traffic is processed in cluster A. when cluster A fails, the traffic will automatically switch to cluster B for processing. TVs does not synchronize sessions between clusters.
0x05 high availability in cluster
In TVs cluster, high availability is achieved by routing ECMP, long and short mask route announcement and session synchronization. If any one or more servers are abnormal, the business will not be affected.
1. ECMP load high availability
Four machines in the TVS cluster announce VIP at the same time with the same priority, and form ECMP equivalent route on the switch side. The switch divides according to sip + dip. When a server fails, the ECMP equivalent route will be reduced, but all traffic will still be sent to the remaining machines in the cluster for processing to ensure that the traffic is not lost.
When tvs1 fails, ECMP route entry changes and ECMP re hash. Even if tvs2 has no fault, the session traffic on tvs2 will be rescheduled to another machine for processing, which is unnecessary jitter.
Ideally, the traffic of the failed TVs is equally distributed to the remaining three TVs, and the original traffic on the normal TVs remains unchanged. Therefore, it is necessary to configure consistency hash on TVs.
2. High availability of long and short mask routing load
The real server receives the message after tvs1 conversion, processes and responds to the corresponding message. Under normal circumstances, the response message must return to tvs1 to ensure the same path back and forth. For this reason, TVs has done the following:
Each TVs is assigned a local IP
Ø announce the local IP through BGP
When tvs1 fails, the local IP route announced by tvs1 will disappear, and the response message of real server will not be forwarded, even if session synchronization is done.
According to the longest mask matching principle of routing table. Therefore, when TVs declares its own long mask local IP, it also needs to declare the short mask local IP. When tvs1 is normal, the traffic from RS to TVs will match the long mask route back to tvs1. When tvs1 fails, the real server response message will match the short mask route to other TVs. At this time, the cluster has done session synchronization again, so the original traffic on tvs1 can also be forwarded normally.
3. Session synchronization model
Session synchronization within the cluster, new sessions created on core1 are synchronized to core1 of other TVs, and so on.
The RSS hash key on TVs in the cluster must be consistent. If it is inconsistent, when tvs1 fails, if the traffic enters tvs3 and the RSS shunting does not send the message to the specified CPU, the session cannot be found and the forwarding fails. Therefore, we modified the ixgbe driver to replace the random generation function of RSS HASK key in the driver.
0x06 high performance implementation
1. Reduce context switching caused by scheduling
Under normal circumstances, Linux takes over all CPU resources. When a task is busy, other cores will be found to share the task. In this way, the task of Linux will directly affect the performance of packet forwarding and session creation, and the context switching during scheduler scheduling will consume a lot of resources.
Therefore, the TVS related core is isolated from the control of the Linux scheduler, which will minimize the impact of Linux tasks on TVs. At the same time, the processing threads of TVs are bound to the specified core through thread affinity, without mutual interference.
2. Speed up memory access
The speed of memory access directly determines the performance of the program. TVs accesses the nearest node through CPU, making full use of prefetch function, reducing cache miss and improving memory access efficiency.
Numarization of data structure: TVs uses a two-way server architecture, with two CPUs. Each CPU has the fastest access to memory nearby. If access to end-to-end memory across QPI, the efficiency will be significantly reduced. In the scenario of network receiving and contracting, frequent access to data across QPI will cause 15% - 20% loss.
Using the hugepage: the data structure allocated from the hugepage, the physical memory remains continuous. In this way, the CPU prefetch function can be fully used in data traversal and search. At the same time, the query times of MMU can be reduced, the memory access loss can be reduced, and the cache miss can be reduced.
Cache line alignment of data structure: the data structure of Percore must be aligned with cache line. The read and write memory of CPU will guarantee cache consistency through the MESI protocol. When core1 and Core2 share the same cache line size memory, core1 and Core2 have a competitive relationship, so as to reduce the prefetch efficiency and increase the access time.
3. Bypass the kernel
There are serious performance bottlenecks in receiving and sending data through Linux kernel. The performance of forwarding is directly affected by interrupt packet receiving, CPU copy packet, Linux Netfilter framework and lengthy protocol stack, and debugging in kernel mode is troublesome.
Therefore, TVs uses the UIO mechanism of Linux to share memory space directly in user state and kernel state. Through DMA, the packets are copied directly to the shared memory space. The data in shared memory can be read / written directly by the user state forwarding program, which reduces the steps of receiving and contracting and improves the efficiency. At the same time, receiving packets by polling and copying packets by DMA reduce the time of interrupt and CPU copy.
According to the above requirements, TVs preferred Intel dpdk as an efficient and stable network receiving and contracting platform.
4. Solve lock competition
TVs adopts the design architecture of Percore session. Each core has its own session table, which is not competitive with each other and makes full use of the performance of all cores. Global session has its limitations. The more cores, the more fierce competition, and the worse performance.
A session is only stored on one session table. In fullnat mode, if only RSS is used for shunting, the traffic of clientip - > VIP and RSIP - > localip will be distributed to different cores, resulting in the exception of forwarding when the session cannot be found.
In order to ensure that the traffic of request and response falls on the same core in the same session, TVs uses the FDIR feature of Intel 82599 network card. Each core is assigned its own local IP. When creating a session, only the local IP is selected from the local IP pool of the core. At the same time, set the network card FDIR configuration to specify that the local IP is sent to the previously allocated core. So as to realize the requirement that the request traffic and the response traffic fall on the same core.
The priority of FDIR is higher than that of RSS. If FDIR rules are matched, it will be sent directly to the corresponding receiving queue. However, if the RSS module does not match FDIR, it will directly enter the RSS module for shunting. In this way, it can not only ensure that the traffic is evenly distributed to all cores during the request, but also ensure that the response traffic returns to the specified core, killing two birds with one stone.
0x07 summary
The existing TVs architecture fully combines the network architecture and business requirements, and fully meets the existing requirements of the same process. But TVs still has some things to improve. For example, alg module supports more protocols, better scheduling algorithm, lower forwarding delay (hardware forwarding direction), preliminary cleaning of malicious traffic and so on.
reference material:
1. http://dpdk.org/
2. https://github.com/alibaba/LVS/
3. https://www.intel.com/content/www/us/en/embedded/products/networking/82599-10-gbe-controller-datasheet.html
Previous articles
Implementation of Java string obfuscation tool based on ASM
Add string obfuscation to ollvm
Jenkins high risk Code Execution Vulnerability Detection / open source vulnerability shooting range
Eat.hack.sleep.repeat - ysrc second summer safety tour
Let's take you there, guards, assemble!
Analysis of two URL jump vulnerabilities in Django: cve-2017-7233 & 7234
S2-045 vulnerability detection is now supported in emergency patrol
Thanks to Google and Tencent security
Android string and dictionary obfuscation open source implementation
I may have used fake stealth mode
Hadoop security practice of same journey Tourism
Click my link and you'll get a calculator
Click my link and I'll know what chrome plug-ins you're using
Ysrc's sincere work, Patrol - fast emergency and cruise system for enterprise security loopholes
Android simulator detection based on file characteristics
Android reverse and virus analysis
F-scratch weak password detection script
CSP bypass in unsafe mode
Passive scanner gourdscan v2.0 released!
Common reverse tools and usage tips of Android App
XSS trap - a simple attempt to protect XSS DNS
A black path road the sun - Introduction to HTTP tunnel
Windows fuzzy artifact -- winafl