Author Zhao Nengwen
brief introduction
This paper will introduce a paper from KDD 2018, research track, feedback guided anomaly discovery via online optimization. Based on the idea of active learning, this paper proposes a feedback based anomaly detection framework, which integrates domain knowledge into anomaly detection algorithm successfully. According to the abnormal score ranking of the output of the abnormal detection model, the operation and maintenance personnel give feedback (normal or abnormal) to the samples at the top of the ranking, and continuously optimize the abnormal detection model according to each feedback result, so as to reduce the false alarm rate and the false alarm rate of the model.
background
Anomaly detection plays an important role in network intrusion detection, system stability and service reliability. Generally, the general anomaly detector is to identify outlier in statistical sense, and then give a ranking according to the output anomaly score. The higher the ranking, the more likely the sample is to be abnormal, but the statistical outlier is different from the exception in real business scenarios, which will lead to a high false alarm rate in the current anomaly detector. An effective way to reduce the false alarm rate is to integrate the domain knowledge of the operation and maintenance personnel into the detector.
Tagging exception is a method of integrating violence into domain knowledge. By tagging exception and normality, anomaly detection problem is transformed into classification problem. But tagging data is always a very difficult problem. 1) professionals with business background are needed; 2) time cost is high; 3) too much data needs to be tagged. Therefore, most Internet companies and operation and maintenance personnel are reluctant to label data. In the field of machine learning, the purpose of active learning is to make the best of the model performance by using as few labeled data as possible.
In this paper, based on the idea of active learning, the author proposes an interactive anomaly detection framework with the participation of operation and maintenance personnel. According to the feedback provided by the operation and maintenance personnel to the anomaly detection results, the anomaly detection model is updated iteratively to reduce the false alarm rate. In the traditional exception detection, the algorithm will output the rank of the exception score for the reference of the operation and maintenance personnel. The operation and maintenance personnel will check one by one according to the order of the exception score from high to low. If the business exceptions that the operation and maintenance personnel are interested in are ranked behind (as shown in Figure 1, the red dots are abnormal), it will take a very high time cost to find them.
Figure 1 Schematic diagram of abnormal verification by operation and maintenance personnel
In order to save troubleshooting time, in this paper, the operation and maintenance personnel will integrate their own domain knowledge into anomaly detection. For the initial ranking of abnormal scores, after giving the feedback of the top samples, the model will readjust the parameters according to the feedback, so as to output a new ranking of abnormal scores. The purpose is to make the operation and maintenance personnel interested in the abnormal ranking in the top position So as to save the time for troubleshooting. In addition, this paper transforms the feedback guided anomaly detection problem into an online convex optimization algorithm framework, defines a convex loss function for the feedback of operation and maintenance personnel, and derives a simple, efficient and effective optimization algorithm. Through the application of this framework to the isolation forest, the effectiveness of the framework is proved by the experiments in open datasets and a large-scale space security application.
Problem description
For M sample set D, it consists of normal sample set N and abnormal sample set A. generally speaking, the number of normal samples is far more than that of abnormal samples. Because the sample size of D is very large, it is unrealistic to search exceptions manually. The exception detector calculates an exception score for each sample, in order to make the exception sample rank in front of the normal sample, but no exception detector is perfect, and there are often cases where the normal sample ranks in front. In the later experiment, we use two evaluation indexes, one is how many abnormal samples are found after each feedback, and the other is how long it takes to manually check the first abnormal sample.
System framework
1. Online convex optimization
Adjusting the parameters of the anomaly detector according to the feedback of the operation and maintenance personnel can be converted into the problem of online convex optimization. Firstly, given a set of parameters WT and a convex loss function ft of the anomaly detector (it is assumed in the paper that the anomaly detector is a generalized linear model), the loss ft (WT) under WT is calculated. The optimization goal is to make regrett close to 0, where w * is the optimal parameter configuration of the detector.
2. Loss function
First define YT = 1 for exception, YT = - 1 for normal. Obviously, if the top rank feedback obtained is YT = 1, that is to say, the abnormal samples are ranked at the top, indicating that the current model is good and the value of loss function needs to be reduced; otherwise, if the feedback obtained is YT = - 1, indicating that the normal point is ranked at the top, then the model is problematic and the loss function needs to be increased.
• linear loss
Linear loss is a linear function of WT, which is obviously a convex function. Score is normalized. Obviously, if the feedback YT = 1, then ft (WT) < 0, the value of loss function will decrease, otherwise, YT = - 1, then YT (WT) > 0, the value of loss function will increase, which is consistent with our expectation.
• log likelihood loss
Where Z is the normalization constant:
• logistic loss
It is similar to linear loss, but the experiment proves that it is not as good as linear loss, so it is not explained in detail in the paper.
3. Image descent learning algorithm
Through the first two steps, the loss function and the optimization objective are defined. Now we need to use the optimization algorithm to get the optimal solution. The image learning algorithm based on gradient descent is adopted in this paper, and the details are shown in Figure 2.
Figure 2 image descent optimization algorithm
application
Apply the above framework to the isolation forest. In iforest, the path length from root node to leaf node reflects the abnormal degree of the sample. The shorter the path, the more likely it is to be abnormal. Based on this, w e can define the parameter W and the eigenvector Q (x). We represent the weight of edge e (or node) with we, and QE (x) represents whether the edge (node) passes from the root node to the leaf node. Then the exception score can be expressed as score (x; W) = - w * q (x). When the leaf node of a sample is far away from the root node, the path will pass through more sides, obviously score will be smaller (minus sign), which is consistent with the original intention of iforest. In the experimental initialization phase, the weight of all the edges is equal to 1. The iterative feedback process of the isolation forest is shown in Figure 3.
Figure 3 Schematic diagram of iterative feedback of isolation forest
Experiment
The experiment is mainly divided into two parts: one is the experiment on the open data set. Considering that there are few open data sets for anomaly detection, the classification data set is used here, and the categories that account for a very small proportion are regarded as the exception samples; the other is the intrusion detection data set in the space security of practical application, and the exception samples are the entities of malicious attacks.
The experimental results on the open dataset are shown in Figure 4. The abscissa is the number of feedback iterations K, and the ordinate is the number of exceptions found in the top-k. obviously, feedback guided anomaly detection (log likelihood, linear) can find more exceptions.
Figure 4 Comparison of the number of exceptions found in the open dataset
The results in the practical application of space security are shown in Figure 5. It can be seen that feedback guided anomaly detection can find more abnormal attack samples. In addition, the experiment also counts the time needed to find the first malicious attack sample, as shown in Figure 6. Obviously, feedback guided anomaly detection can find the first malicious attack sample earlier, which is of great significance to space security, intrusion detection, etc.
Figure 5 Comparison of the number of exceptions found in the attack data set
Figure 6 time required to discover the first malicious entity (exception)
In addition, the experimental part also analyzes the influence of learning rate on the experimental results, and whether the feedback samples are normal or abnormal on the experimental results. Due to the space limitation, it is not introduced here in detail. For those who are interested, please refer to the original text.
conclusion
Based on the idea of active learning, this paper proposes an anomaly detection system with artificial interaction and iterative optimization. Operation and maintenance personnel can provide feedback to the results of unsupervised anomaly detection, so as to optimize the anomaly detection model and reduce the false alarm rate. In this paper, the problem of iterative optimization is solved by online convex optimization and defining two loss functions. Experimental results on large-scale open datasets show that compared with the baseline algorithm without feedback and the active anomaly detection (AAD) algorithm proposed in the previous paper, the feedback guided anomaly detection significantly improves the effect of anomaly detection. In addition, when applied to large-scale spatial security data sets, the method in this paper can detect malicious attacks earlier and find more exceptions.
As for the future work, the assumption of anomaly detector in this paper is that it can be expressed as a generalized linear model, and the framework can be applied to any anomaly detector. In addition, the development of an interactive interface helps to explain the experimental results, and better understand the business exceptions and statistical exceptions.
Thesis title: feedback guided anomaly discovery via online optimization
Author: MD Amran Siddiqui, Alan fern, Thomas g. Dieterich (Oregon State University), Ryan Wright, alectherault, David W. Archer (Galois Inc.)
Due to the limitation of space, some of the contents are not described in detail. Interested readers can follow the reply after the official account: 2018103, download the original text.