thinking and safety attempt of semi supervised learning

Posted by deaguero at 2020-04-01

Why use semi supervised learning in some safety situations?

Security scenario security data security algorithm

In most security scenarios, the corresponding security sample data is relatively small, including black sample and white sample. The lack of sample data directly limits the application of machine learning technology, which is one of the problems in the current application of machine learning in security practice. Is it to solve the problem or to avoid it? This can be inferred from the perspective of supervised / unsupervised / semi supervised learning. If we want to adopt the supervised learning method, we need to accumulate a large number of attack samples and normal business samples, but in reality, most of them may only accumulate a small number of attack samples, which needs to solve the problem of sample data. For complex and changeable security scenarios, one solution is to find data sets from public data according to specific business scenarios and attack scenarios. Such data sets are generally finished products, which are convenient, quick and can be used directly. Another solution is DIY, which can mark a batch of white samples from the business data and collect the payloads attacked in the current business environment from various channels. There are problems in both ways of solving problems. Based on the similarity of attack data and the complexity and diversity of business environment, it can be concluded that "the abnormal ones are always similar and the normal ones are always different". The problem of the first method is that it is difficult to match the business environment of the open dataset with our business environment, which directly leads to the difference of white sample data, leading to a large gap in training and testing. The problem of the second way is the cost, representativeness and timeliness of selecting white samples. How to select white samples is better. The possible ways are automatic filtering and manual marking. The data error may be caused by mixing black samples in the automatic filtered white samples, and the cost of manual marking is relatively large. At the same time, the complexity and variability of the business environment and business data may lead to the obvious problem of whether the selected white samples are representative. All in all, in real business scenarios, white samples are actually difficult to select and persist, which confirms the beginning "in most cases, only a small amount of attack samples may accumulate". For the solidified security scenario, the disadvantages of the above two methods will be reduced, but will not disappear. Therefore, in many complex and changeable security scenarios, the supervised machine learning can not solve the security problem well before solving the sample problem positively or well.

As for unsupervised learning, it does not rely on tag data at all, and at the same time, it does not make full use of existing tag resources. I think the performance in the actual security scenario may be limited and difficult to operate. Semi supervised learning is a tradeoff of supervised learning and unsupervised learning. It is better than supervised learning to use the existing labeled resources and a large number of unlabeled data without additional cost. At the same time, according to the "abnormal always have similarities", the use of labeled black sample assisted classification should be able to improve the performance of the model, which is better than unsupervised learning. It seems that the semi supervised learning to avoid the sample data problem is more close to our actual security scenario.

What do we need to do in the safety attempt of semi supervised learning?

Security scenario security data security algorithm data mining security algorithm

For example, semi supervised learning is used to predict and identify windows malware. From the perspective of solutions, the first thing to do is to predict and identify windows malware. In details, the security scenario is the prediction and recognition of windows malware, and the security data is a small number of black samples and a large number of unmarked samples. Secondly, from the perspective of the solution, what needs to be done is the security application of the semi supervised algorithm in this scenario. In terms of refinement, it is necessary to analyze the principles of different semi supervised learning algorithms and select specific semi supervised methods in combination with data mining. Finally, we need to do data analysis and feature engineering to support the security algorithm from the perspective of data mining, combined with windows malware attack behavior mode.

What should we do in the safety attempt of how semi supervised learning?

How to do the three what mentioned above? For the first solution, the software and hardware configuration data of windows machine can be used to evaluate the probability that the machine is infected by malware, and the dynamic behavior data of windows binary executable program running through sandbox degree simulation can be used to identify windows malware. For the second solution, for a small number of positive samples and a large number of unlabeled data, Pu learning (positive unlabeled learning), an important branch of semi supervised learning, is specially used to solve this problem. There are several implementation methods for PU learning, such as using standard classification all the time, treating positive samples and unlabeled samples as positive ones In Pu bagging mode, all positive samples and unlabeled samples are randomly combined to create a training set. Then positive samples and unlabeled samples are regarded as positive and negative respectively. Finally, the classifier is applied to unlabeled samples that are not in the training set. Record and repeat the process. Method three "two-step method", first identify a subset of unlabeled samples that can be 100% labeled negative, then use positive and negative samples to train the standard classifier and apply it to the remaining unlabeled samples. For the third point of data mining, it is necessary to analyze the principle and attack mode of windows malware to supplement the security knowledge, and then conduct targeted exploratory data analysis combined with the security data. Then, from the perspective of statistical learning, multi variable manual mining features (embedded security knowledge guidance), from the Perspective of natural language processing, text automatic extraction features (embedded security knowledge guidance) for data mining 。

Do specific practices

Taking the solution as an example, this paper introduces the safe application of semi supervised learning. Suppose we have characterized a small number of black samples and a large number of unlabeled samples and started to use them directly. Figure 1 is the real distribution of unlabeled data from the perspective of God,

Figure 2 shows that the method always uses the standard classification method to predict the results of unlabeled data. It seems that the method can not approach the real distribution of unlabeled data from God's perspective, especially the positive data.

Figure 3 is the result of Pu bagging in mode 2. It can be seen that this figure seems to be closer to figure 1 than Figure 2

Fig. 4 is the result of step1 in mode 3. It can be seen that the first round is mainly to select a batch of reliable negative samples. Fig. 5 is the result of Step2. It can be seen that the effect is similar to that of mode 1.

Maybe this approximation is not intuitive enough. It is more intuitive to use the output probability value plus ROC curve and ROC AUC score to evaluate the performance. Data1 STD is always obtained by using the standard classification method. Data2 and data3 are obtained by Pu bagging, and Data4 is obtained by two-step method. It can be concluded that under this evaluation method, Pu bagging > two-step method > standard classification method, and the performance can also be achieved.

It is necessary to compare not only the semi supervised learning algorithm, but also the unsupervised learning algorithm. If the unsupervised learning algorithm does not use the existing tag resources and has good performance at the same time, why not. There is a problem in choosing the isolation forest for experiment. Because the prediction probability range of the isolation forest is [- 1,1], and the probability range of the previous semi supervised algorithm is [0,1], the original intention is to divide the probability value of the isolation forest by 2 and then add 0.5 to reduce the probability value to [0, 1] Then, combined with ROC and ROC AUC score evaluation, the result seems to be a bit of a problem, ROC AUC score is only 0.3 +. One way to alleviate this problem is not to output the probability value, directly predict the output label results, and then use the accuracy score to calculate the accuracy rate. The accuracy rate of the isolation forest model is 0.88, while the accuracy rate of the standard classification method in semi supervised learning is 0.95. It is proved that at least one semi supervised learning algorithm is better than unsupervised learning algorithm in this evaluation method. If the labeled black-and-white samples are sufficient, the supervised learning performance is probably the best. There is no measurement and it is not very good to measure the gap between the semi supervised learning and the supervised learning. An attempt to measure is to keep the accuracy of the semi supervised samples and models unchanged, and how much labeling resources are needed for the supervised learning to achieve the same accuracy.

Finally, if there is no safety sample data accumulation at all, and the business scenario is complex and changeable, and the sample is difficult to obtain, unsupervised learning is preferred. If there is a certain safety sample data accumulation, and the business scenario is complex and changeable, semi supervised learning is preferred. If the business scenario is solidified and the sample is easy to obtain, then whether there is sample data accumulation or not, it can be excellent First, choose supervised learning. So it can be concluded that semi supervised learning may be more suitable for the initial machine learning model construction of security scenarios and complex business scenarios.