In the wave of AI, based on the existing security system, SNG business security center applied machine learning to business security confrontation, and built an intelligent confrontation system of security AI.
1. Overall introduction
Machine learning is gradually applied to multiple business security services. It aims to build a set of highly available, configurable and expandable intelligent countermeasure system.
The intelligent countermeasure system has been applied in the online countermeasure of 7 services such as QQ and space. The purpose of this system is to establish a complete system of applying machine learning to industry safety. Including: unified access to business data, offline self-service model training, model deployment online, online real-time prediction.
From the perspective of business, the intelligent countermeasure system is an online and offline integrated system decoupled from the external and independent machine learning. The main process is as follows:
Online part receives request data of business classification, extracts feature data, forecasts in real time and returns results.
In the offline part, sample data processing, feature engineering and model training are carried out.
2. Function details
2.1 online process
(1) Real time classified forecast
The intelligent countermeasure online system accesses the API in a unified way, and the service side calls the API to report the data and make real-time classification prediction. The specific online architecture is shown in the following figure:
The configuration data of the online system is stored in the native XML file and DB respectively. XML files are used for local configuration loading (including: model information associated with business, online / air run mode, monitoring configuration, data reporting configuration, etc.); the corresponding relationship between configuration model, business and characteristics in dB, and the configuration of this part is connected with model training.
Different classifiers are associated with different feature sets, so it is necessary to ensure that the feature sets and order used in the off-line training process are completely consistent with the feature sets of online prediction input, otherwise the prediction results will be biased. In order to ensure the consistency of online and offline feature sets, set the configuration table of model and feature in dB, read the configuration table information online, and perform feature extraction operations according to the configuration. The configuration representation is as follows:
The feature set corresponding to the specific model is determined by the association relationship in the configuration table. In the offline process, when a new model training task is created, corresponding records are automatically generated. Thus, the consistency of the online and offline model association feature set can be achieved.
The real-time classification request flow and classification result data are put into the database for unified storage, which is used to extract training data according to business flow when offline training model.
(2) Model acquisition and loading
The online prediction system of machine learning is different from the traditional background service. The online prediction process needs to use offline trained model files. In the distributed background service, in order to ensure the high availability and scalability of the service, the model files trained offline are stored uniformly in the intelligent countermeasure system, and the online service actively pulls the model files to the local according to the configuration information.
When the configuration is changed or the equipment is expanded, the unified operation processing is carried out through the service publishing and configuration distribution tools.
2.2 offline process
2.3 sample selection
On the basis of the existing business security attack model data, based on the accumulated large number of malicious sample database, we can get high-quality sample data in the sample annotation process. In addition, in addition to using the existing malicious data as positive samples, we also use unsupervised clustering method to increase the input of supervised samples by identifying malicious clustering, so as to expand more malicious coverage.
Using only the historical malicious data of the existing rule model as the sample data will lead to the narrow coverage of the trained model, which can not have strong recognition ability for the new malicious. Therefore, in the process of sample selection, an unsupervised process is added. By clustering the request flow, a new sample data is found as the training data of the supervised model. After the unsupervised link is added in the offline process, the overall process changes as follows:
2.4 feature selection
There are three main methods in feature selection process:
(1) Rule of thumb
Select based on business experience.
(2) Information gain
The sample data set containing tags is extracted to calculate the information gain of features. The features with high information gain are selected as the input features of model training.
(3) Wrapping method
First, rough select features, use the training data set for model training, and output the corresponding feature weights according to the training model (as shown in the figure below). According to the weight of the secondary selection of features. On the basis of the filtered feature set (other features can be added or not), the above steps are further iterated until the convergence conditions (such as verifying the accuracy of the data set) are met (this method essentially combines 1) and 2) and there is a model training process, so it takes longer.
2.5 model training
Before model training, the data format should be transformed first: xgboost algorithm is used in the supervised model part of the current intelligent countermeasure system, and the training data input should be converted to libsvm format.
There are many articles about xgboost algorithm, so I won't make redundant introduction here, mainly about the setting of the main parameters involved. There are three types of parameters that XGBoost needs to configure: general parameters, booster parameters and task parameters.
- General parameters: parameters control which booster is used in the boosting process. The commonly used boosters are tree model and linear model.
- Booster parameters: it depends on which booster is used.
- Task parameters: Learn objective parameters and control the performance of training objectives.
The main parameters are as follows:
General Parameters:
(1) Booster default gbtree
- Gbtree: tree based model
- Gbliner: linear model
Booster Parameters:
(1) ETA default 0.3
(2) Max? Depth default 6
(3) Subsample default 1
Task Paramters:
(1) Objective default reg: linear
- Reg: linear regression.
- Reg: logistic regression.
- Binary: the logistic regression of binary classification, which returns the probability of prediction (not the category).
- Multi: softmax uses softmax's multiple classifiers to return the predicted category (not the probability). In this case, you also need to set one more parameter: num u class.
- The multi: softprob parameter is the same as the multi: softmax parameter, but it returns the probability that each data belongs to each category.
Model storage: after the model training, it is saved in the model storage pool, and then the model is pulled and stored in the local and online service for loading through the model pulling tool.
3. Configuration management
The existing configuration relationships mainly include:
(1) Sample data configuration table: used to configure original data table information, malicious data table, positive / negative sample data table, etc;
(2) Model feature selection configuration table: model selection, feature selection set, model online status;
(3) Model training information configuration table: sampling proportion, sampling historical data quantity, training data table, etc;
These configuration tables are constructed to facilitate the rapid deployment of online models and the consistency of online and offline model association feature sets. Based on this, a self-service training system is constructed, which greatly reduces the manpower input of offline model training.
4. Offline process self-service
In the process of offline model training, m SQL and N scripts need to be written and several task flows need to be configured in various links such as sample data processing, feature filtering, training data set generation, data format conversion, model training, etc. It is difficult to achieve the accuracy of each link. If there is any problem in any link, the final training model may not be used online. Therefore, an off-line self-service model training system is designed and developed to solve this problem. The background program can automatically control the samples, features, model training and other links, and the intermediate process view and abnormal information can also be transparently conveyed, just by configuring and submitting the corresponding information on the front page.
By configuring and submitting the relevant information of the model to be trained on the web page, the background logic program adds the configuration in the relevant configuration table, and then drives the corresponding script to start processing (including: sample data filtering, feature extraction, construction of training data set). After completing the training data set, it starts and models the training task by itself.
Through the integration of off-line model training function, the overall process time-consuming and manpower consumption of model training are greatly reduced, and the consistency of feature set between model training and online prediction is ensured.
Original statement, this article is authorized by the author to be published by cloud + community, and can not be reproduced without permission.
In case of infringement, please contact Yunjia [email protected] to delete.