platform practice of nginx in jingdong

Posted by barello at 2020-03-04

Author introduction

Wu Jianmiao, currently working in Jingdong financial Hangzhou R & D center, is responsible for nginx and MQ projects, and has a strong interest in the development and optimization of high-performance servers. Welcome to communicate (wechat wujm1230).

(the author has authorized the operation and maintenance team to release)

Nginx is an excellent HTTP and reverse proxy server, which is widely used by all departments of JD. However, it generally faces some problems:

The configuration is complex and professional.

The configuration file cannot be modified in batch and the configuration change depends on the restart operation.

Different applications depend on different modules and configuration items, which leads to confusion in management.

Nginx of the same application cannot be expanded in batches and quickly.

The root of all the problems lies in that nginx is a stand-alone system. Although it is modular and high-performance, with the rapid development of the Internet today, all the problems may be magnified infinitely in the scenario of large-scale nginx and business cluster like JD. Aiming at this situation, we have designed and developed Jen (JD extended Nginx), up to now, Jen has covered most of the core businesses of Jingdong finance, such as treasure taking bar, card supermarket, Baitiao, etc.

1ใ€ Overall structure

Figure 1: Jen structure

As shown in the figure above, the operation and maintenance shall be configured through the web console. In case of shunting, current limiting and other configurations, the information warehousing shall wait for nginx to take effect after passing the restful API synchronization rules. In case of smooth upgrade, restart and other strong operation and maintenance operations, the web console shall control ansible to operate nginx accordingly.

Figure 2: multiple machine rooms deployment of nginx and web console

JEN features:

Support nginx auto discovery, group management and status monitoring.

Unified entry, through abstract configuration, simplifies the life cycle of operation control nginx cluster, supports rule batch configuration and operation batch execution.

It extends the function of diffluence and current limitation of native nginx, supports real-time memory synchronization of rules, does not need to modify configuration files, and does not need to restart the nginx process.

1. Basic information

All display and operation on the web are based on the calculation integration of basic information, mainly including two types:

Group information (business line, application, computer room, nginx IP)

Nginx attributes, such as upstream information, server name, listen port, etc., are mainly from the information report (heartbeat) after nginx reads the content of nginx.conf

For grouping information, Jen supports the following two ways of filling:

Call the restful API of the external service to import the complete basic information.

Group and edit the automatically discovered nginx manually.

Figure 3: relationship between groups

As shown in the figure above, the grouping includes four layers of relationships: business line, application, computer room, and nginx. In large-scale cluster environment, this relationship and the nginx attribute can be used to support batch execution of all operations, such as batch modification of configuration files, batch upgrade and restart, etc., to liberate productivity.

2. Rule acquisition

After the user configures in the web console, we have implemented a fully asynchronous module at the nginx end to support timing to obtain the rule information belonging to the current nginx from the web. The rule stores memory and takes effect immediately

a) Each process stores one copy of rule information to avoid lock competition caused by resource sharing among processes.

b) The version number is designed to ensure the absolute order of rules and heartbeat. It does not cause version confusion due to network factors such as packet loss and delay. Moreover, when the rules are not changed, nginx does not need to frequently parse a large number of rule information and consume CPU resources.

3. safety

Jen supports three types of roles. Each role supports different operation permissions (the default is ordinary user role, no write permission). Any operation of any role on the web will be recorded. In addition, it provides an entry to support multi-dimensional operation log query, which is convenient for audit

4. monitoring

We have achieved a more comprehensive collection and display of monitoring information, including:

a) The active detection module of Tengine is extended to support the average and current delay statistics of upstream servers.

b) Support nginx survivorship monitoring by maintaining heartbeat with the web.

c) It supports TCP connection information, in / out traffic, QPS, 1XX to 5xx response message and other information monitoring.

The above monitoring information supports group statistics (business line, application, computer room) and large screen display, so that relevant personnel (business, operation and maintenance) can monitor the application status in real time.

Two. Shunt

Concept: according to the request characteristics (IP, any keyword in the header), some specific requests can be split into a single or multiple upstream servers, as shown in the following figure:

Figure 4: example of shunt

Shunting is mainly applicable to gray-scale publishing, AB testing and other scenarios. In addition, we also extend the shunting function to support the web console to start and stop the upstream server with one click, so that when the application server needs to be maintained or upgraded, the user can request normal access.

Three. Current limiting

Jingdong 618 and other big promotions, goods are stacked in the shopping cart ahead of time, waiting for zero seconds to kill. In the language of engineers, the QPS of the first second is very low, but the QPS of the next second is very high. Large traffic means high machine load. If one or two machines of an application are not carried, the whole application cluster will avalanche.

Flow restriction should not be blind. First, it is necessary to select appropriate flow restriction algorithm (leaky bucket algorithm, token bucket algorithm) according to business characteristics. Second, it is necessary to comprehensively evaluate flow restriction parameters in combination with historical traffic, application service ability, marketing strength and other factors. Finally, it is necessary to decide which elegant way to feed back users.

Nginx realizes the state unification among multiple processes by sharing memory and limiting intermediate information. In the original intention of Jen design, the original plan is consistent with the shunting, that is, each process stores a flow restriction rule, and the flow restriction is only limited in the current process, but the following problems will inevitably occur:

Each process "you limit your, I limit my", the inconsistent information leads to inaccurate current limit.

Similar to the current restriction of user ID, in the scenario of JD with a large number of daily users, each process needs to open up enough memory to avoid frequent replacement of red black tree nodes in the current restriction algorithm, so that the memory occupied by nginx will be multiplied with the number of processes.

Our approach:

Before allocating shared memory, nginx dynamically adapts a piece of shared memory when it gets the current limit rule.

Rule sharing, which synchronizes to all processes in real time after taking effect, ensures that all old version rules will be deleted only after the current traffic is updated, as shown in the following figure:

Figure 5: regular chain

We extend the current limiting function:

Support error page customization, in addition to returning nginx static page, it also supports 302 error page redirection, which can be redirected to any external link according to the configuration in the web console, but there is a problem in 302 redirection: the URL and content of the user's browser have changed, which means that the user needs to re-enter the URL and re request or repeat the previous operation steps, and the user experience is poor Can cause the user to abandon this purchase behavior and transfer to another home. Logically, we use nginx's subrequest mechanism to support the return content to be changed while the URL remains unchanged, so that every time the user is restricted, just refresh the page again to repeat the previous steps.

Figure 6: comparison of two error pages

The extended current limiting algorithm supports that the current is not available for a period of time after the current limiting. For example, if the current is limited by IP and a certain IP has triggered the current limiting, the IP can not be accessed for a period of time without re calculation by the algorithm.

The functions of blacklist and whitelist are realized synchronously, which can avoid the "accidental killing" of current limiting in some complex scenes (such as pressing IP to limit current in NAT network).

4ใ€ Operation and maintenance characteristics

The operation and maintenance features mainly refer to the installation, upgrade, configuration file modification, startup and shutdown of nginx. The biggest difference between the operation and maintenance features and the previous introduction is that the operation needs to be restarted, so it is a more appropriate idea to combine the third-party tool ansible (compared with the operation and maintenance tools such as puppet, the migration cost of ansible is relatively small).

In actual production, ansible and web are deployed on the same PC to avoid single point cluster deployment. DB storage is used instead of ansible local file storage for relevant data. Through this simple transformation, ansible and web "suite" can be easily expanded.

Figure 7: logic diagram of automatic operation and maintenance

As shown in the figure above, users control ansible to upgrade and restart nginx through web operation. Web is the unified entry of nginx operation, which is the important significance of platform. You can give up SSH, shell or even monitoring system, and start self-sufficiency in Jen.

Jen will store configuration files for all nginx through active pull or user import and manual configuration on the page, which not only improves the situation of management confusion caused by each application relying on different configuration items, but also makes some convenient extensions to configuration files, such as history tracing, configuration comparison, configuration reuse, operation rollback, etc.

When the page performs related operations, the web will read ansible's standard output and display it on the page in real time. In order to let users know the progress in a relatively friendly way, we optimized ansible:

It enriches the content of standard output and refines it to every step as much as possible.

Format standard output for easy web access and presentation.

Nginx is deployed on a large scale in the production environment. If large-scale exceptions are caused by some reasons, we do not want to see this. Therefore, in terms of reliability, Jen also provides a variety of mechanisms to ensure:

1. Three layer error checking ensures that the process can be restarted and updated only when it is completely correct. Any error in the middle will not affect the online service

a) Do the first level of validation when the web populates the form.

b) Perform the second level detection when the target machine is operating, for example, perform nginx โ€“ t verification first.

c) Check the third layer after execution, such as whether the port is started and whether the number of processes is consistent.

2. Gray execution

a) A single nginx is executed in turn, and any exception is immediately interrupted to start manual intervention.

b) Batch execution is supported by percentage. For example, the nginx of a computer room is upgraded by 10% first.

Five, summary

The above summarizes some of JD's practices in nginx platform. Jen provides unified entry control over the entire nginx life cycle, and supports batch modification of rules to take effect immediately. We hope that these practical experiences can help all readers.

Operation and maintenance World Conference ยท Shenzhen station (opsworld)

Held on December 3

Ticket consultation contact polar bear: 13560482483, wechat: hadxiaer

Business cooperation contact: 13910180921

Fast ticket channel, scan QR code or click to read the original

Happy sharing, happy life

Business cooperation, please add wechat yunweibang555