Overview
The previous article mainly talks about how to use machine learning method to detect webshell. In this chapter, deep learning method will also be used to complete this task
Webshell, in essence, is a code file, which contains text information. The difference is that it is the text information of computer language. So, we can think about whether we can also apply NLP (natural language processing) to the classification of computer language
I think it's ok
Because computer language and natural language are the same in some aspects, for example, they both have very clear grammatical features and standards
In natural language, the morphology includes subject predicate object definite complement, grammar includes subject system table, subject predicate object and so on
In computer language, it also includes variables, types, classes, objects, lives, function names, etc
It can be said that they also have a set of their own grammatical norms and styles, which is also the source of this article's ideas
This paper uses the most common text classification method, word embedding and LSTM to complete this task, and the code is implemented by Python keras
Don't talk much, go straight to the code
First, we import common libraries for data preprocessing and visualization
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn
import os
import subprocess
import re
%matplotlib inline
Read data
As in the previous article, we put the label and the corresponding file path of the dataset into a panda.dataframe object
files_webshell = os.listdir("/webshell/project/php-webshell/")
files_common = os.listdir("/webshell/project/php-common")
labels_webshell = []
labels_common = []
for i in range(0,len(files_webshell)):
labels_webshell.append(1)
for i in range(0,len(files_common)):
labels_common.append(0)
Splice malicious files and normal files together, and then splice file list and label list together
for i in range(0,len(files_webshell)):
files_webshell[i] = "/webshell/project/php-webshell/" + files_webshell[i]
for i in range(0,len(files_common)):
files_common[i] = "/webshell/project/php-common/" + files_common[i]
files = files_webshell + files_common
labels = labels_webshell + labels_common
Convert it to pandas.dataframe type data
datadict = {'label':labels,'file':files}
df = pd.DataFrame(datadict,columns=['label','file'])
Because our sample size is very small, and a code file may contain a lot of variables, strings, functions, classes, etc., we can imagine that if we only do some language processing work at the code level, such as one hot, embedding, etc., then we will get a very large and very sparse matrix (because a large number of words are distributed discretely)
So how to deal with this situation? We think that we can realize it by compiling the code into middle layer code or bottom layer code. We can use opcode to compare this kind of code with X86 assembly, so we can convert a lot of code into low-level code with limited instruction set (just like MOV, JZ, jump in assembly)
For the best language PHP, its underlying code can be obtained through the extension program VLD, so we can compile all our sample files into PHP opcode code and save it. This process is a little long
def getopcode(x):
try:
cmd = "php -dvld.active=1 -dvld.execute=0 " + str(x)
output = subprocess.getoutput(cmd)
oplist = re.findall(r'\s(\b[A-Z_]+\b)\s',output)
print(str(x))
return oplist
except:
print("error" + str(x))
return None
df['opc'] = df['file'].map(lambda x:getopcode(x))
After that, because the LSTM model we use needs a fixed length input, we calculate the length of each file converted into opcode code and save it. Before that, we discard the empty data
df = df.dropna()
def getoplen(x):
return len(x)
df['oplen'] = df['opc'].map(lambda x:getoplen(x))
OK, let's see what opcode looks like. It's similar to x86
Next is the data processing before the model. We use the most common and common word embedding + LSTM method to model. Embedding uses word2vec method, and code implementation uses Python gensim library
(it's really comfortable to repeat Python again, which makes the author want to enter Julia's pit several times and come back 2333)
We encapsulate a function to generate the numerical matrix of the corresponding element through the OPC column through the word2vec algorithm. Here, the author selects the vector length of 100 and stores the matrix in the word2vec.txt file
from gensim.models import Word2Vec
def getw2v(opc_list,label_list):
print(label_list[0:10])
stop = []
w2v_list = []
for i in range(0,7789):
try:
print(label_list[0:10])
tmp = []
name = opc_list[i]
#print(name[0])
for j in range(0,len(name)):
tmp.append(name[j])
w2v_list.append(tmp)
except:
pass
model = Word2Vec(w2v_list, min_count = 1)
#print (model._vocabulary)
model.wv.save_word2vec_format('word2vec.txt',binary=False)
#return 0
#print model['a']
label_vect = []
wv_vect = []
for i in range(0,7789):
try:
#print(i)
name = opc_list[i]
tmp = []
vect = []
for j in range(0,len(name)):
if name[j] in stop:
continue
tmp.append(model[name[j]])
if j >= 99:
break
if len(tmp) < 100:
for k in range(0,100-len(tmp)):
tmp.append([0]*100)
vect = np.vstack((x for x in tmp))
wv_vect.append(vect)
label_vect.append(label_list[i])
#if i ==100000:
# break
except:
pass
wv_vect = np.array(wv_vect)
label_vect = np.array(label_vect)
return wv_vect,label_vect
w2v_word_list,label_list = getw2v(df['opc'],df['label'])
Next is the test model. Select some data as our training set and the rest as the test set (demo will omit the verification set)
x_train = np.concatenate((w2v_word_list[0:2000],w2v_word_list[2500:7000]))
y_train = np.concatenate((label_list[0:2000] , label_list[2500:7000]))
x_test = np.concatenate((w2v_word_list[2000:2500] , w2v_word_list[7000:]))
t_test = np.concatenate((label_list[2000:2500] , label_list[7000:]))
Next, we use keras to model. There is only one reason for choosing keras. It is extremely convenient for small code. Later, I will also use tensorflow to deal with complex tasks
Yes, the following lines of code complete the modeling. Here, a single layer LSTM model is used, which takes the output into a SIGMOD function to form a 0-1 distribution. The optimizer uses Adam
from keras.preprocessing import sequence
from keras.models import Sequential
from keras.layers import Dense,Embedding
from keras.layers import LSTM
model = Sequential()
#model.add(Embedding())
model.add(LSTM(128,dropout = 0.2,recurrent_dropout = 0.2))
model.add(Dense(1,activation='sigmoid'))
model.compile(loss = 'binary_crossentropy',optimizer = 'adam',metrics = ['accuracy'])
Finally, we get the following accuracy
print ('now training....')
model.fit(x_train,y_train,nb_epoch = 50,batch_size = 32)
print ('now evaling....')
score,acc = model.evaluate(x_test,y_test)
print (score,acc)
model = Sequential()
#model.add(Embedding())
model.add(LSTM(128,dropout = 0.2,recurrent_dropout = 0.2,return_sequences=True))
model.add(LSTM(128,dropout = 0.2,recurrent_dropout = 0.2))
model.add(Dense(1,activation='sigmoid'))
model.compile(loss = 'binary_crossentropy',optimizer = 'adam',metrics = ['accuracy'])
print ('now training....')
model.fit(x_train,y_train,nb_epoch = 50,batch_size = 32)
print ('now evaling....')
score,acc = model.evaluate(x_test,t_test)
print (score,acc)