0x00 KNN (k-nearest neighbor) algorithm
K-nearest neighbor algorithm is a classification algorithm which uses the distance between different eigenvalues to classify.
- Advantages: high precision, insensitive to abnormal values
Advantages: high precision, insensitive to abnormal values
- Disadvantages: high computational complexity, high spatial complexity, very sensitive to the local structure of data
Disadvantages: high computational complexity, high spatial complexity, very sensitive to the local structure of data
- Required data sources: numerical and nominal
Required data sources: numerical and nominal
- Usually K is an integer no more than 20
Usually K is an integer no more than 20
0x01 an example of KNN algorithm that is easy to understand
What kind of green circle (Blue Square) should be classified in the figure? Red triangle?) take the K samples closest to the green circle, which kind of samples accounts for the most proportion, and which kind will be divided. When k = 3, the proportion of red triangle is 2 / 3, and the green circle will be classified into red triangle. When k = 5, the proportion of blue square is 3 / 5, and the green circle will be classified into blue square.
Do the following for each point in the dataset for an unknown category property in turn:
- Calculate the distance between the points in the known category dataset and the currently unknown category points
Calculate the distance between the points in the known category dataset and the currently unknown category points
- Sort by increasing distance
Sort by increasing distance
- Select k points with the minimum distance from the current point
Select k points with the minimum distance from the current point
- Determine the occurrence frequency of the category of the first k points
Determine the occurrence frequency of the category of the first k points
- Return the category with the highest frequency of the first k points as the prediction classification of the current point
Return the category with the highest frequency of the first k points as the prediction classification of the current point
0x02 practical example of KNN algorithm
There are two types of movies, love movies and action movies. We classify the films by the number of times the fighting and kissing scenes appear in the films. And we have some data sets classified.
0x03 characterize data
Data characterization is used to calculate the distance between the points in the dataset and the currently unknown category points.
Take the number of fighting shots and kissing shots as coordinates.
When classifying a machine with unknown data: the coordinate of the green point in the coordinate system is (10,5)
Calculate the distance between the green point and each red point and each blue point:
The distance between points can be calculated according to Pythagorean theorem.
After calculation, we can get:
d1, d2, d3, d4, d5, d6.....
And so on, when there are n constants:
All the D obtained are sorted in ascending order. The first k points are selected. The category with high frequency of the first k points is the category of unknown quantity.
Numerical normalization
However, when the range of one kind of data in the data set is larger than that of other data, that is, the greater the difference between some numbers in the equation, the greater the impact on the calculation results.
For example:
In this case, if the difference between (3000-1000) and (5-2) is larger, the weight of (3000-1000) in the calculation result will be larger.
In most cases, we should give the same weight to each feature, so when there are multiple values, multiple values should be given the same weight. When processing the eigenvalues of different value ranges, the usual method is to normalize the values, for example, the value range is 0 ~ 1 or - 1 ~ 1.
newValue=(oldValue−min)/(max−min)
For example, in the above movie classification: when the number of fighting shots is 5, normalize 5 to get a new result:
0.2308=(5−2)/(15−2)
0x04 detect abnormal behavior operation
A 1.5W line of cmd.txt file, like:
- Every command executed by 1 behavior, a collection of every 100 behaviors.
Every command executed by 1 behavior, a collection of every 100 behaviors.
A sign.txt file (150 lines in total) that marks whether each set (every 100 lines) is abnormal. 0 is normal, 1 is abnormal.
- If there is at least one exception command in a set (every 100 lines), the whole set is marked as 1
If there is at least one exception command in a set (every 100 lines), the whole set is marked as 1
Step 1: data feature extraction
Since the tag file in the original data is a collection of 100 data, it is necessary to organize and extract the data of a.txt file. Collate all commands to obtain CMD list and frequency:
cmd_list = [['cpp','sh',...,'col'],['sh',...,'windows'],...['mailbox','ksh',...'ls']]
len(cmd_list) = 150
Use freqdist (statistical text word frequency) to sort all commands according to the frequency as follows:
frequency_list = FreqDist(list(cmd_list)).keys()frequency_list = ['vacation', 'sh', 'sendmail', 'cpp', 'xrdb', 'mkpts', 'test',....]
frequency_list = ['vacation', 'sh', 'sendmail', 'cpp', 'xrdb', 'mkpts', 'test',....]
Organize the marking results of the sample, because the first 5K of the sample behaves normally, so the marking results are all
y = list()
a = open('sign.txt')
for line in a.readlines():
line=line.strip('\n')
y.append(line)
Step 2: characterize the data
Use the word set to quantify the operation commands. The command data is converted to a specific value, and the frequency of the command occurrence is used as a reference to convert the command to a specific data feature.
user_cmd_feature = list()
for every_cmd_list in cmd_list:
v = [0]*len(frequency_list)
# 将按频率排序的所有的命令初始为0
for i in range(0, len(frequency_list)):
if list(frequency_list)[i] in every_cmd_list:
# 判断按频率排序的所有的命令是否在每组命令中出现过
v[i] += 1
user_cmd_feature.append(v)
The result format of V is shown in the figure. The value of V can be understood as the coordinate point of coordinate system.
Step 3: test the algorithm and use
Use Python library to pass in data for calculation.
#训练样本数
N = 90x_train = user_cmd_feature[:N]
y_train = y[:N]
# 训练数据集
x_test = user_cmd_feature[N:]
y_test = y[N:]
# 测试数据集
neigh = KNeighborsClassifier(n_neighbors=3)
# 此处设置k=3
neigh.fit(x_train, y_train)
y_predict = neigh.predict(x_test)
# 传入测试数据
print(y_predict)
# 测试数据计算结果
score=np.mean(y_test==y_predict)*100
print(score)
# 正确率
X? Train = user? CMD? Feature [: n] y? Train = y [: n]? Train data set
X? Test = user? CMD? Feature [n:] y? Test = y [n:]? Test data set
Neigh = kneighborsclassifier (n ﹐ neighs = 3) ﹐ set K = 3neigh.fit (x ﹐ train, y ﹐ train) here
Y × predict = neigh.predict (x × test) × incoming test data print (Y × predict) × test data calculation result
Score = NP. Mean (y'test = = y'predict) * 100 print (score) ා accuracy
0x05 summary
The key point of this case is to characterize the data, that is, to convert specific commands into specific values. In the case of detecting abnormal behavior operation, because it is the command executed by the dataset for a user's history, we use the frequency of command occurrence as the feature of data characterization to quantify the operation command. Of course, in practical application, not only frequency is the feature, but also similarity can be added to process.
0x06 reference
- Getting started with machine learning of web security [https://item.jd.com/12158965.html]
Getting started with machine learning of web security [https://item.jd.com/12158965.html]
- Deep learning [https://item.jd.com/12128543.html]
Deep learning [https://item.jd.com/12128543.html]
Donation benefits
Follow the official microblog of mlsrc, and you will get a chance to receive a classic bestseller in the field of deep learning
---Deep learning
Deadline: 18:00, November 15