web shell detection based on deep learning (2)

Posted by punzalan at 2020-03-21


The previous article mainly talks about how to use machine learning method to detect webshell. In this chapter, deep learning method will also be used to complete this task

Webshell, in essence, is a code file, which contains text information. The difference is that it is the text information of computer language. So, we can think about whether we can also apply NLP (natural language processing) to the classification of computer language

I think it's ok

Because computer language and natural language are the same in some aspects, for example, they both have very clear grammatical features and standards

In natural language, the morphology includes subject predicate object definite complement, grammar includes subject system table, subject predicate object and so on

In computer language, it also includes variables, types, classes, objects, lives, function names, etc

It can be said that they also have a set of their own grammatical norms and styles, which is also the source of this article's ideas

This paper uses the most common text classification method, word embedding and LSTM to complete this task, and the code is implemented by Python keras

Don't talk much, go straight to the code

First, we import common libraries for data preprocessing and visualization

import numpy as np import pandas as pd import matplotlib.pyplot as plt import sklearn import os import subprocess import re %matplotlib inline

Read data

As in the previous article, we put the label and the corresponding file path of the dataset into a panda.dataframe object

files_webshell = os.listdir("/webshell/project/php-webshell/") files_common = os.listdir("/webshell/project/php-common") labels_webshell = [] labels_common = [] for i in range(0,len(files_webshell)): labels_webshell.append(1) for i in range(0,len(files_common)): labels_common.append(0)

Splice malicious files and normal files together, and then splice file list and label list together

for i in range(0,len(files_webshell)): files_webshell[i] = "/webshell/project/php-webshell/" + files_webshell[i] for i in range(0,len(files_common)): files_common[i] = "/webshell/project/php-common/" + files_common[i] files = files_webshell + files_common labels = labels_webshell + labels_common

Convert it to pandas.dataframe type data

datadict = {'label':labels,'file':files} df = pd.DataFrame(datadict,columns=['label','file'])

Because our sample size is very small, and a code file may contain a lot of variables, strings, functions, classes, etc., we can imagine that if we only do some language processing work at the code level, such as one hot, embedding, etc., then we will get a very large and very sparse matrix (because a large number of words are distributed discretely)

So how to deal with this situation? We think that we can realize it by compiling the code into middle layer code or bottom layer code. We can use opcode to compare this kind of code with X86 assembly, so we can convert a lot of code into low-level code with limited instruction set (just like MOV, JZ, jump in assembly)

For the best language PHP, its underlying code can be obtained through the extension program VLD, so we can compile all our sample files into PHP opcode code and save it. This process is a little long

def getopcode(x): try: cmd = "php -dvld.execute=0 " + str(x) output = subprocess.getoutput(cmd) oplist = re.findall(r'\s(\b[A-Z_]+\b)\s',output) print(str(x)) return oplist except: print("error" + str(x)) return None df['opc'] = df['file'].map(lambda x:getopcode(x))

After that, because the LSTM model we use needs a fixed length input, we calculate the length of each file converted into opcode code and save it. Before that, we discard the empty data

df = df.dropna() def getoplen(x): return len(x) df['oplen'] = df['opc'].map(lambda x:getoplen(x))

OK, let's see what opcode looks like. It's similar to x86

Next is the data processing before the model. We use the most common and common word embedding + LSTM method to model. Embedding uses word2vec method, and code implementation uses Python gensim library

(it's really comfortable to repeat Python again, which makes the author want to enter Julia's pit several times and come back 2333)

We encapsulate a function to generate the numerical matrix of the corresponding element through the OPC column through the word2vec algorithm. Here, the author selects the vector length of 100 and stores the matrix in the word2vec.txt file

from gensim.models import Word2Vec def getw2v(opc_list,label_list): print(label_list[0:10]) stop = [] w2v_list = [] for i in range(0,7789): try: print(label_list[0:10]) tmp = [] name = opc_list[i] #print(name[0]) for j in range(0,len(name)): tmp.append(name[j]) w2v_list.append(tmp) except: pass model = Word2Vec(w2v_list, min_count = 1) #print (model._vocabulary) model.wv.save_word2vec_format('word2vec.txt',binary=False) #return 0 #print model['a'] label_vect = [] wv_vect = [] for i in range(0,7789): try: #print(i) name = opc_list[i] tmp = [] vect = [] for j in range(0,len(name)): if name[j] in stop: continue tmp.append(model[name[j]]) if j >= 99: break if len(tmp) < 100: for k in range(0,100-len(tmp)): tmp.append([0]*100) vect = np.vstack((x for x in tmp)) wv_vect.append(vect) label_vect.append(label_list[i]) #if i ==100000: # break except: pass wv_vect = np.array(wv_vect) label_vect = np.array(label_vect) return wv_vect,label_vect w2v_word_list,label_list = getw2v(df['opc'],df['label'])

Next is the test model. Select some data as our training set and the rest as the test set (demo will omit the verification set)

x_train = np.concatenate((w2v_word_list[0:2000],w2v_word_list[2500:7000])) y_train = np.concatenate((label_list[0:2000] , label_list[2500:7000])) x_test = np.concatenate((w2v_word_list[2000:2500] , w2v_word_list[7000:])) t_test = np.concatenate((label_list[2000:2500] , label_list[7000:]))

Next, we use keras to model. There is only one reason for choosing keras. It is extremely convenient for small code. Later, I will also use tensorflow to deal with complex tasks

Yes, the following lines of code complete the modeling. Here, a single layer LSTM model is used, which takes the output into a SIGMOD function to form a 0-1 distribution. The optimizer uses Adam

from keras.preprocessing import sequence from keras.models import Sequential from keras.layers import Dense,Embedding from keras.layers import LSTM model = Sequential() #model.add(Embedding()) model.add(LSTM(128,dropout = 0.2,recurrent_dropout = 0.2)) model.add(Dense(1,activation='sigmoid')) model.compile(loss = 'binary_crossentropy',optimizer = 'adam',metrics = ['accuracy'])

Finally, we get the following accuracy

print ('now training....'),y_train,nb_epoch = 50,batch_size = 32) print ('now evaling....') score,acc = model.evaluate(x_test,y_test) print (score,acc) model = Sequential() #model.add(Embedding()) model.add(LSTM(128,dropout = 0.2,recurrent_dropout = 0.2,return_sequences=True)) model.add(LSTM(128,dropout = 0.2,recurrent_dropout = 0.2)) model.add(Dense(1,activation='sigmoid')) model.compile(loss = 'binary_crossentropy',optimizer = 'adam',metrics = ['accuracy']) print ('now training....'),y_train,nb_epoch = 50,batch_size = 32) print ('now evaling....') score,acc = model.evaluate(x_test,t_test) print (score,acc)