design and practice of wechat massive data monitoring

Posted by deaguero at 2020-02-28

This paper is organized from GOPs 2017 Shanghai station speech "design and practice of wechat massive data monitoring"

Author brief introduction

Chen Xiaopeng entered Tencent in 2008 and transferred to wechat operation and maintenance development team in 2012 to be responsible for the transformation of operation and maintenance monitoring system. He is the main design and development personnel of wechat current operation and maintenance monitoring system.


This paper shares the design practice of wechat operation and maintenance monitoring system. Before sharing, first look at the status quo of wechat background system as shown in the figure below. In the face of huge call volume and complex call link, it is difficult to maintain by manpower alone, only relying on a comprehensive monitoring, stable and fast operation and maintenance monitoring system.

Our operation and maintenance monitoring system mainly has three functions:

The first is fault alarm;

The second is fault analysis and location;

The third is the automation strategy.

Today's theme of sharing is mainly composed of the following three parts:

The first is monitoring data collection lightweight;

The second is the development process of wechat data monitoring;

The third design idea of data storage under mass monitoring and analysis.

1、 Monitoring data collection lightweight

First, take a look at the common data collection process. Generally speaking, the common collection process is to collect data from the log, then collect and pack data locally, and then send it to the global server for summary.

But for wechat, 200 W / min calls generate 200 billion / min monitoring data reporting, which may be a conservative estimate.

In the early days, we used custom text type log reporting, but due to the large number of business and background services, the log format grew very fast, and it was difficult to maintain continuously. In addition, no matter CPU, network, storage, statistics were under great pressure, and it was difficult to ensure the stability of the monitoring system itself.

In order to achieve stable minute level, or even second level data monitoring, we have carried out a series of transformation.

For our internal monitoring data processing, there are two steps:

The first is data classification

The second is the customized processing strategy

We classify the data, and there are three kinds of data in our company:

The first is real-time fault monitoring and analysis;

The second is non real time data statistics, such as business reports;

The third is single user exception analysis, for example, when a user reports a failure, he or she must analyze the user's failure separately.

The following is a brief introduction to non real time data statistics and single user exception analysis, and then focuses on real-time monitoring data processing.

1.1 non real time data

For non real time data, we have a configuration management page.

When reporting, users will first apply for logid + user-defined data fields. Instead of using the way of writing log files, reporting uses the way of sharing memory queue and batch packaging to reduce the calling pressure of disk IO and log server. The use of distributed statistics is a common practice.

1.2 single user exception analysis

For single user exception analysis, we focus on exceptions, so the reporting path is similar to the non real time path just now.

The fixed format is adopted: logid + fixed data field (server IP + return code, etc.), and the amount of data reported is much larger than the non real-time log just now, so we report by sampling. In addition to storing the data in the TDW distributed storage, we will also forward it to another cache for a query cache.

1.3 real time monitoring data

Real time monitoring data is the key part of sharing, which is also the vast majority of the 200 billion / min log reporting.

In order to realize the monitoring in different directions, there are many types of real-time monitoring data, with different formats, sources and statistical methods. In order to achieve fast and stable data monitoring, we classify the data, simplify and unify the data formats, and then take the optimal data processing strategies for the simplified data.

For our data, we think there are the following:

Background data monitoring, used for monitoring data of wechat background service;

Terminal data monitoring, in addition to the background, we also need to pay attention to the specific performance of the terminal, exception monitoring and network exceptions;

External monitoring service. We now have services provided by external developers such as merchants and applets. We and external service developers need to know what kind of exceptions exist between this service and wechat, so we also provide external monitoring service.

1.3.1 background data monitoring

For our background data monitoring, we think it can be divided into four categories according to the level, each with different formats and reporting methods:

1. Hardware level monitoring, such as server load, CPU, memory, IO, network traffic, etc.

2. Process running state, such as consumed memory, CPU, IO, etc.

3. The call chain between modules and the call information between modules and machines are one of the key data for fault location.

4. Business indicators, data monitoring at the overall level of business.

Different types of data are simplified into the following format, which is convenient for data processing.

The lower two layers use the format of IP + key. After containers appear, the formats of containerid, IP and key are used.

The module call information extracts the overall information of the module and shares the data format of ID and key with the business indicators.

Let's focus on idkey data. This idkey data is the early key monitoring data, but its reported amount accounts for more than 90% of the reported data. As just mentioned, it is difficult to achieve stable and fast reporting with text data, so we have customized a very simple and fast reporting method, which can quickly summarize directly in memory. See the following figure for the specific reporting scheme.

Two pieces of shared memory are applied in each machine. The format of each piece of memory is: uint32 [Max [ID] [Max [key]. There are two reasons for the convenience of periodic data collection (once in 6S).

There are only three internal reporting methods: accumulation, new value setting and maximum setting. All of these three ways are to operate a uint32ut, with very small performance consumption. In addition, there is a biggest advantage, that is, to summarize in memory in real time. The average number of records extracted from memory each time is only about 1000, which greatly reduces the difficulty of second level statistics.

Another important data in the background data is call relation data, which plays a very important role in fault analysis and location.

The specific format is as above, and the fault point (machine, process, interface) and influence surface can be located. Its reported amount is the second largest data smaller than idkey, and each background call generates a piece of data, so it is difficult to use log mode.

We use another shared memory statistics method similar to idkey in the service. For example, a service has n workers, each worker will allocate two pieces of small shared memory for reporting, and then the collection thread will package the data and send it out.

This reporting is conducted by the framework layer, and service developers do not need to manually add reporting code (99% of wechat uses internal developed service framework).

1.3.2 terminal data monitoring

We have finished the background data introduction. Let's talk about the terminal monitoring data. We are concerned about the specific performance and exceptions of wechat app on the mobile phone, the time-consuming and abnormal calls to wechat background, and the network exceptions.

The log data generated by the mobile terminal is very huge. If it is reported in full, there will be a lot of pressure on the terminal and the background, so we have not reported in full.

We have different sampling configurations for different data and terminal versions, and the background will periodically issue sampling policies to the terminal.

When the terminal reports the data sampling, it will not send it in real time, but use temporary storage to record it, and then pack and send it at intervals, so as to minimize the impact on the terminal.

1.3.3 external monitoring service

Here is a brief introduction to our latest external monitoring service, which refers to some cloud monitoring schemes. Users can configure dimension information and monitoring rules by themselves.

Now, this function has been developed on our merchant management interface and the page of applet developer tool, but now the user-defined reporting has not been opened, only some fixed data items collected in the background are provided.

2、 The development process of wechat data monitoring

The data reporting method is described above, and then we will introduce how to monitor the data.

2.1 anomaly detection

First of all, generally speaking, three methods may be used for anomaly detection:

The first is the threshold, which is very different even in the morning and evening. This threshold cannot be divided by itself, so it is only applicable to a small number of scenes for us;

The second is year-on-year. The problem is that our data are not the same at the same time every day. There will be big differences between Monday and Saturday. Only by reducing the sensitivity can we guarantee the accuracy;

The third is the month on month ratio. In our data, the adjacent data is not stable, especially when the order of magnitude is small. Similarly, only reducing the sensitivity can ensure the accuracy.

So these three common data processing methods are not very suitable for our scenario, we have improved the algorithm in the past.

The first improved algorithm we use is the mean square error, which is to calculate the mean square error and mean square error from the data of the same time every day in the past month, and use the data of multiple days to adapt to the data jitter.

This algorithm has a wide range of applications, but for curves with large fluctuations, the sensitivity will be relatively low and it is easy to miss the report.

Our improved second algorithm is polynomial fitting prediction, which is suitable for smooth curves, just like the improved loop ratio.

However, if the data increases or decreases steadily and there is no mutation, it will be judged as normal and underreported.

Therefore, although the above two algorithms have many improvements compared with the previous algorithms, they also have some defects. At present, we are trying other algorithms, or a combination of multiple algorithms.

2.2 monitoring configuration

In addition to the algorithm itself, we also have problems in the configuration of monitoring items. Because our services are very many, we may need to manually configure more than 300000 monitoring items. Each time we configure the observation curve, we choose different algorithms, different sensitivities, and after a period of time, the data changes, which needs to be readjusted. So it's not sustainable.

At present, we are trying to automatically configure monitoring items, such as using historical data, historical exception samples, extracting features, data classification, and then automatically applying the optimal monitoring parameters. We are trying to achieve some results in this, but it is not very perfect and is still improving.

3、 Design idea of data storage based on mass monitoring analysis

How to collect and monitor the data is shared above. Finally, how to store the data is introduced.

For us, data storage is also important. As mentioned just now, one month's data is needed for monitoring every minute. For example, for our fault analysis, a module needs to read all machine call information, CPU, memory, network, various process information, etc. if there are too many machines, the amount of data read at a time will exceed 50W * 2 days.

Therefore, we have very high requirements for the read-write performance of monitoring data storage.

First of all, the basic requirement of write performance is that the total stock in volume may be more than 200 million pieces per minute, and a single machine requires at least 500 W / min to be able to input this amount of data. The data reading performance needs to be able to support the monitoring reading of 50W × 22 days per minute.

In terms of data structure, our various data are multi-dimensional. For example, the dimensions of call relationship are very many. We also need to support partial matching queries based on different dimensions, such as client side, SVR side, module level, host level, etc. we cannot only support simple key value queries.

Note that our multi-dimensional key is divided into two parts: main key and sub key. We will explain why to do this later.

In the past, we used to refer to other open-source schemes in the transformation of monitoring data storage, but at that time, we did not find a ready-made scheme that fully meets the requirements of performance and data structure, so we developed our own time series server.

First, for data writing, if one record per minute, the amount of data is too large. So we will cache data for a certain period of time first, and then batch merge them into one record a day. This is also a common way to improve write performance. Our data cache time is one hour.

The key point of our own key value storage is the implementation of key. First, the key will reside in memory. In addition, because of the large amount of data, it is impossible for one machine to support, so we use multi machine cluster to write and query the data using hash (main key).

The partial matching query uses the modified binary search method to realize the pre matching query. In this way, the query performance is very high, which can exceed 100W / s, and the performance of adding a query result cache is higher.

However, it also has some problems, such as unbalanced hash (main key) data, and one record a day, and the key takes up too much memory.

Because of the above problems, we have made a second improvement.

The second improvement method is to split the key value into key ID value, and control the value data balance through the ID allocation service. The key ID is reallocated once every 7 days to reduce the memory consumption.

For storage, the biggest problem is disaster recovery. Since the server is monitored, its own disaster recovery capacity is also very high.

Generally speaking, it is difficult to achieve high disaster tolerance and strong data consistency. However, the wechat background has opened its own phxpaxos protocol framework, which can be easily used to achieve data disaster tolerance.

In addition, the multiple master feature of the phxpaxos framework can improve the concurrent read performance.

Read more articles

Developing an enterprise level monitoring platform with Python

Grab train ticket automatically with Python code

How to refine the monitoring platform with 10 billion visits?

Second level monitoring of Alibaba trillion Trading

Jingdong large scale data center network operation and maintenance monitoring eye

Salvation of it operation and maintenance

Tencent's massive business intelligence monitoring practice in the AI Era

How to realize multi-dimensional intelligent monitoring? --Practice and exploration of AI operation and maintenance

What is the monitoring system of Unicorn in Silicon Valley?

What are the innovative applications of Tencent social network monitoring data?

Come to gops2018 Shenzhen station Tencent operation and maintenance system special session

Mr. Wu Shusheng will give a wonderful speech: "innovative application of monitoring data"

Click to read the original text and enter the official website of gops2018 Shenzhen station Conference