The previous article talked about a framework and basic thinking method of the overall business risk control. Today, a systematic introduction is made to the "risk discovery" link in the risk control link, so as to help you quickly find abnormalities, reduce the corresponding business losses and quickly stop bleeding.
Before we start, let's take a look at how early warning systems are designed and implemented in general:
In general business, when making an alert, "quick exception discovery" is usually used as the core problem to solve, and exception discovery is carried out by setting some alert rules.
for instance:
When the number of releases increases by more than 15% month on month, by more than 20% year on year, and the absolute value is more than 3000, send email & SMS & call to relevant personnel for early warning.
There are two disadvantages of this early warning method:
1. The configuration of early warning rules is relatively rigid, and most of the thresholds are determined by the operation or operation and maintenance personnel. The early warning rules are essentially the same as the risk control rules, which need to extract effective features and continuously ensure the call rate. However, most of the early warning system design has only a few poor feature references and almost never moving threshold. Moreover, the more indicators are maintained, the greater the corresponding operating cost, and the more difficult it is to maintain.
2. This early warning method can only locate "abnormal" but can not confirm "risk". For example, when the release volume suddenly increases in a certain day, then according to this design method, operators will definitely receive the corresponding alarm email. After positioning, it may be found that the business side has actually done the activity of pulling new or promoting activities. At that time, the alert only found that today's release volume is different from that of normal. However, it's impossible to know whether this exception is due to large-scale attacks of black production or normal business increment.
Therefore, because of the existence of these two disadvantages, we often pay close attention to the relevant alarms in the early stage of the early warning system online. After a long time, because of the frequent false alarms and missed alarms, we gradually ignore the existence of alarms, so the significance of the early warning system will be greatly reduced.
Based on the above example, we find that when we only consider how to find exceptions quickly, the effect of early warning system is difficult to reach our expectation, so to build an effective early warning system, we must solve two core problems:
1. How to find exceptions quickly
2. How to define risks accurately
How to find exceptions quickly
I. find abnormalities through changes in core indicators
No matter what kind of work and projects we are in charge of, the trend of core indicators is the focus of our attention. Generally, when the curve of core indicators changes significantly, we will realize that something has happened, so the analysis and prediction of indicator curve is our first method to find anomalies.
In order to find the outliers more accurately, we use the ability of some algorithms to achieve the prediction of core indicators and the ability of outlier / fragment location.
Technical framework of intelligent prediction and anomaly detection
Offline part:
In this part, we mark the original data of an indicator, extract the corresponding features, make prediction through regression model, and make model output through classification model.
Real time part:
Then the off-line generated model is marked in real-time data monitoring to mark outliers and abnormal fragments.
Through this method, we can input any core index data. As long as the index is time-series and periodic, we can accurately locate the occurrence of exceptions.
Let's see the effect in the actual work:
Normal exception:
Serious exception:
Abrupt change anomaly:
By this way, we can expand our monitoring range almost without cost, and can continue to subdivide an indicator into the most fine-grained, and can also greatly expand different indicators to assist monitoring. In this way, when an exception occurs, it is easy for us to catch and locate the most fine-grained.
2. Find anomalies by clustering
In the field of risk control, in order to maximize the profits, the black production or attackers will make the most of the existing resources, which can be a piece of content, an image can also be an IP, a mobile phone number. So for this reason, when there is a risk, there will be a relatively concentrated aggregation behavior, which may be the same behavior track, the same technique, the same text content and so on. So clustering analysis is also an important means to find anomalies.
Resource based clustering
When a resource appears frequently and performs some business operations, such as a mobile phone number or an account number. We think there are some unusual behaviors.
Content based clustering
When a paragraph of text or a picture appears frequently in UGC content, we think there are some abnormal behaviors in it.
Behavior based clustering
When a certain behavior track is repeated frequently, we also think that there are some abnormal behaviors in it.
Clustering based on attribute or relation
Through some community mining or unsupervised learning algorithms, some entities are clustered into a cluster or a group. When the cluster becomes very large, there is a great possibility of abnormal behavior.
Through clustering, we can detect many anomalies that are difficult to find through indicator monitoring. The above mentioned methods are not all, and may be able to monitor or analyze according to more dimensional clustering behavior. This part is very significant for the whole security risk discovery.
III. abnormal discovery by other means
In addition to the prediction and anomaly extraction of index data, and anomaly location through clustering, there are many other means for anomaly detection, such as whether there is a smooth feedback mechanism, whether there is a front-end patrol or sampling mechanism. These mechanisms can effectively help us find exceptions quickly. Different means cover different areas, but they can complement each other to avoid business impact caused by hidden exceptions.
Here, if there are other effective ways to find abnormalities, we also welcome private communication and common progress.
The above is about the method of anomaly detection and location. With an exception, it is only the first step of early warning. Whether this exception is a normal business fluctuation or a major risk is what we focus on.
How to define risk accurately
We have obtained the exception in various ways, so how can we make an effective definition of the exception?
When it comes to definition, there must be people involved. Only through people's experience can we effectively judge whether an exception is a risk. But if all the work defined is handed over to product operation or technical personnel, the cost of human resources will be huge, and it will be interrupted or shelved by other work. So since someone has to be involved, why not let the reviewers or taggers with lower labor cost complete the definition of exceptions in a state of no perception? So we adopt such a relatively convenient way.
Risk definition process
We will collect all the exceptions through various ways, find the entity corresponding to the exception, such as the exception of information release, and get the corresponding information in the exception segment; if it is the cluster exception, we will find all the entities under the cluster. When an abnormal entity is sampled in real time and directly transferred to the audit platform as an audit task, the auditor will audit the corresponding entity without awareness, and the audit result can clearly judge the entity corresponding to the exception. In this way, we can draw a conclusion that 80% of the abnormal information is false and 20% is normal information.
After we get the conclusion, we can flexibly adjust the threshold of the ratio of normal information and false information according to the situation. When the threshold meets certain conditions, we can directly define the abnormality as risk. Even according to various means of anomaly detection, we can clearly explain where the risk occurs and what specific performance it has.
The above are some experiences and methods on the establishment of risk control early warning system. We can generate reliable alarms through the two core parts of "abnormal discovery" and "risk definition". Because this system can produce clear conclusions and corresponding entity distribution, it can also use a set of intelligent analysis engine to undertake the downstream of the early warning system, according to the clear positive and negative samples of the output This method automatically combines the existing risk control characteristics for calculation, and produces the corresponding risk control strategy recommendation or effective strategy through the screening of call rate. This is what I'll tell you later. I'll share it with you later.