*The original author of this article: foxscheduler, this article belongs to the freebuf original award program, no reprint without permission
I. Preface
As we all know, deep learning has made great progress in the fields of computer vision, natural language processing, artificial intelligence and so on. It has also begun to emerge in the field of security and become a practical application. The experiment in this paper mainly uses the method of text classification, using deep learning to detect XSS attacks. Because I am a beginner, it is inevitable that my understanding of the algorithm itself is not accurate, so this paper tries to use a simple and popular way to introduce the algorithm, not too many details, so as not to mislead everyone.
2、 Dataset
The open data set in the security field is very scarce. The experimental data provided in this paper consists of two parts: the black sample crawled from xssed is a positive sample with more than 40000 pieces; in addition, about 200000 normal HTTP get request records are provided as a negative sample. In order to ensure the data security, the host, path and other information in the URL are removed, and only the part of payload is retained.
The above data is stored in CSV after URL encoding. Because some of the original data has been URL encoded, it can only be used after URL decoding twice.
Positive example:
topic=http://gmwgroup.harvard.edu/techniques/index.php?topic=<script>alert(document.cookie)</script>
siteID=';alert(String.fromCharCode(88,83,83))//\';alert(String.fromCharCode(88,83,83))//";alert(String.fromCharCode(88,83,83))//\";alert(String.fromCharCode(88,83,83))//--></SCRIPT>">'><SCRIPT>alert(String.fromCharCode(88,83,83))</SCRIPT>
js='"--></style></script><script>alert(/meehinfected/)</script></title><marquee><h1>XSS:)</h1><marquee><strong><blink>XSSTEST</blink></strong></marquee><h1 >XSS </h1></marquee>
topic=http://gmwgroup.harvard.edu/techniques/index.php?topic=<script>alert(document.cookie)</script> siteID=';alert(String.fromCharCode(88,83,83))//\';alert(String.fromCharCode(88,83,83))//";alert(String.fromCharCode(88,83,83))//\";alert(String.fromCharCode(88,83,83))//--></SCRIPT>">'><SCRIPT>alert(String.fromCharCode(88,83,83))</SCRIPT> js='"--></style></script><script>alert(/meehinfected/)</script></title><marquee><h1>XSS:)</h1><marquee><strong><blink>XSSTEST</blink></strong></marquee><h1 >XSS </h1></marquee> 负样例:
_=1498584888937/&list=FU1804,FU0,FU1707,FU1708,FU1709,FU1710,FU1711,FU1712
hid=sgpy-windows-generic-device-id&v=8.4.0.1062&brand=1&platform=6&ifbak=0&ifmobile=0&ifauto=1&type=1&filename=sgim_privilege.zip
iid=11491672248&device_id=34942737887&ac=wifi&channel=huawei&aid=13&app_name=news_article&version_code=621&version_name=6.2.1&device_platform=android&ssmix=a&device_type=FDR-A03L&device_brand=HUAWEI&language=zh&os_api=22&os_version=5.1.1&uuid=860947033207318&openudid=fc19d05187ebeb0&manifest_version_code=621&resolution=1200*1848&dpi=240&update_version_code=6214&_rticket=1498580286466
_=1498584888937/&list=FU1804,FU0,FU1707,FU1708,FU1709,FU1710,FU1711,FU1712
hid=sgpy-windows-generic-device-id&v=8.4.0.1062&brand=1&platform=6&ifbak=0&ifmobile=0&ifauto=1&type=1&filename=sgim_privilege.zip
iid=11491672248&device_id=34942737887&ac=wifi&channel=huawei&aid=13&app_name=news_article&version_code=621&version_name=6.2.1&device_platform=android&ssmix=a&device_type=FDR-A03L&device_brand=HUAWEI&language=zh&os_api=22&os_version=5.1.1&uuid=860947033207318&openudid=fc19d05187ebeb0&manifest_version_code=621&resolution=1200*1848&dpi=240&update_ver sion_code=6214&_rticket=1498580286466
Three. Participle
The method of text classification naturally involves how to segment text. Looking at the above example, how do people distinguish XSS: parameters contain complete executable HTML tags and DOM methods. Therefore, the segmentation principles to be supported are:
Content 'XSS' in single and double quotation marks
Http / HTTPS link
< tag < script >
Start with < H1
Parameter name topic=
Function body alert(
Words composed of characters and numbers
In addition, in order to reduce the number of segmentation, we need to normalize the number and hyperlink, replace the number with "0" and hyperlink with http: / / u.
The implementation code is as follows:
def GeneSeg(payload):
payload=payload.lower()
payload=unquote(unquote(payload))
payload,num=re.subn(r'\d+',"0",payload)
payload,num=re.subn(r'(http|https)://[a-zA-Z0-9\[email protected]&/#!#\?]+',"http://u", payload)
r = '''
(?x)[\w\.]+?\(
|\)
|"\w+?"
|'\w+?'
|http://\w
|</\w+>
|<\w+>
|<\w+
|\w+=
|>
|[\w\.]+
'''
return nltk.regexp_tokenize(payload,r)
def GeneSeg(payload): payload=payload.lower() payload=unquote(unquote(payload)) payload,num=re.subn(r'\d+',"0",payload) payload,num=re.subn(r'(http|https)://[a-zA-Z0-9\[email protected]&/#!#\?]+',"http://u", payload) r = ''' (?x)[\w\.]+?\( |\) |"\w+?" |'\w+?' |http://\w |</\w+> |<\w+> |<\w+ |\w+= |> |[\w\.]+ ''' return nltk.regexp_tokenize(payload,r) 分词后的正样例:
['topic=', 'http://u', '<script>', 'alert(','document.cookie', ')', '</script>']
['siteid=', 'alert(', 'string.fromcharcode(', '0','0', '0', ')', ')', 'alert(', 'string.fromcharcode(', '0', '0', '0', ')', ')','alert(', 'string.fromcharcode(', '0', '0', '0', ')', ')', 'alert(','string.fromcharcode(', '0', '0', '0', ')', ')', '>', '</script>','>', '>', '<script>', 'alert(', 'string.fromcharcode(', '0', '0','0', ')', ')', '</script>']
['js=', '>', '</style>', '</script>','<script>', 'alert(', 'meeh', 'infected', ')', '</script>','</title>', '<marquee>', '<h0>', 'xss', ')', '</h0>','<marquee>', '<strong>', '<blink>', 'xss', 'test','</blink>', '</strong>', '</marquee>', '<h0', '>','xss', ')', '</h0>', '</marquee>']
['topic=', 'http://u', '<script>', 'alert(','document.cookie', ')', '</script>'] ['siteid=', 'alert(', 'string.fromcharcode(', '0','0', '0', ')', ')', 'alert(', 'string.fromcharcode(', '0', '0', '0', ')', ')','alert(', 'string.fromcharcode(', '0', '0', '0', ')', ')', 'alert(','string.fromcharcode(', '0', '0', '0', ')', ')', '>', '</script>','>', '>', '<script>', 'alert(', 'string.fromcharcode(', '0', '0','0', ')', ')', '</script>'] ['js=', '>', '</style>', '</script>','<script>', 'alert(', 'meeh', 'infected', ')', '</script>','</title>', '<marquee>', '<h0>', 'xss', ')', '</h0>','<marquee>', '<strong>', '<blink>', 'xss', 'test','</blink>', '</strong>', '</marquee>', '<h0', '>','xss', ')', '</h0>', '</marquee>'] 分词后的负样例:
['_=', '0', 'list=', 'fu0', 'fu0', 'fu0', 'fu0','fu0', 'fu0', 'fu0', 'fu0']
['hid=', 'sgpy', 'windows', 'generic', 'device', 'id','v=', '0.0.0.0', 'brand=', '0', 'platform=', '0', 'ifbak=', '0', 'ifmobile=','0', 'ifauto=', '0', 'type=', '0', 'filename=', 'sgim_privilege.zip']
['iid=', '0', 'device_id=', '0', 'ac=', 'wifi','channel=', 'huawei', 'aid=', '0', 'app_name=', 'news_article','version_code=', '0', 'version_name=', '0.0.0', 'device_platform=', 'android','ssmix=', 'a', 'device_type=', 'fdr', 'a0l', 'device_brand=', 'huawei','language=', 'zh', 'os_api=', '0', 'os_version=', '0.0.0', 'uuid=', '0','openudid=', 'fc0d0ebeb0', 'manifest_version_code=', '0', 'resolution=', '0','0', 'dpi=', '0', 'update_version_code=', '0', '_rticket=', '0']
['_=', '0', 'list=', 'fu0', 'fu0', 'fu0', 'fu0','fu0', 'fu0', 'fu0', 'fu0']
['hid=', 'sgpy', 'windows', 'generic', 'device', 'id','v=', '0.0.0.0', 'brand=', '0', 'platform=', '0', 'ifbak=', '0', 'ifmobile=','0', 'ifauto=', '0', 'type=', '0', 'filename=', 'sgim_privilege.zip']
['iid=', '0', 'device_id=', '0', 'ac=', 'wifi','channel=', 'huawei', 'aid=', '0', 'app_name=', 'news_article','version_code=', '0', 'version_name=', '0.0.0', 'device_platform=', 'android','ssmix=', 'a', 'device_type=', 'fdr', 'a0l', 'device_brand=', 'huawei','language=', 'zh', 'os_api=', '0', 'os_version=', '0.0.0', 'uuid=', '0','openudid=', 'fc0d0ebeb0', 'manifest_version_code=', '0', 'resolution=', '0','0', 'dpi=', '0', 'update_version_code=', '0', '_rticket=', '0']
4、 Embedded word vector
The first step is to find a way to turn the segmented text into machine learning. The most common method is one hot. In this method, the thesaurus is represented as a very long vector, only one dimension has a value of 1, others are 0, such as "" < script > "is represented as [0,0,0,1,0,0,0,0,0 ]. An important problem of this method is that the vectors that make up the text are extremely sparse, words are independent of each other, and machine learning cannot understand the semantics of words. Embedded word vector is to use word vector to represent the semantic information of words by learning text, and to embed words into space to make the distance of words with similar semantics close in space. Space vector can express synonyms such as "microphone" and "Mike", and words such as "cat", "dog", "fish" will gather together in space.
Here we will use the embedded word vector model to build a XSS semantic model, so that the machine can understand HTML language such as < script > and alert(). Take the 3000 words with the largest number of times in the positive examples to form the glossary, and mark other words as "ukn". Use the word2vec class of gensim module to model, and the word space dimension takes 128 dimensions.
Core code:
def build_dataset(datas,words):
count=[["UNK",-1]]
counter=Counter(words)
count.extend(counter.most_common(vocabulary_size-1))
vocabulary=[c[0] for c in count]
data_set=[]
for data in datas:
d_set=[]
for word in data:
if word in vocabulary:
d_set.append(word)
else:
d_set.append("UNK")
count[0][1]+=1
data_set.append(d_set)
return data_set
data_set=build_dataset(datas,words)
model=Word2Vec(data_set,size=embedding_size,window=skip_window,negative=num_sampled,iter=num_iter)
embeddings=model.wv
def build_dataset(datas,words):
count=[["UNK",-1]]
counter=Counter(words)
count.extend(counter.most_common(vocabulary_size-1))
vocabulary=[c[0] for c in count]
data_set=[]
for data in datas:
d_set=[]
for word in data:
if word in vocabulary:
d_set.append(word)
else:
d_set.append("UNK")
count[0][1]+=1
data_set.append(d_set)
return data_set
data_set=build_dataset(datas,words)
model=Word2Vec(data_set,size=embedding_size,window=skip_window,negative=num_sampled,iter=num_iter)
embeddings=model.wv
5、 Data preprocessing
By building a good word vector model, we can use space vector to represent a text. Combined with the previous process, the complete process is as shown in the figure:
Finally, all the data are randomly divided into 70% training data and 30% test data, which are used for the training and testing of the following three neural networks
from sklearn.model_selection import train_test_split
train_datas,test_datas,train_labels,test_labels=train_test_split(datas,labels,test_size=0.3)
from sklearn.model_selection import train_test_split
train_datas,test_datas,train_labels,test_labels=train_test_split(datas,labels,test_size=0.3)
6、 Multilayer perceptron
MLP consists of an input layer, an output layer and several hidden layers. Keras can use tensorflow as the back-end to implement multi-layer perceptron easily. The accuracy of the whole algorithm is 99.9%, and the recall rate is 97.5%. The core code is as follows:
Model training:
deftrain(train_generator,train_size,input_num,dims_num):
print("Start Train Job! ")
start=time.time()
inputs=InputLayer(input_shape=(input_num,dims_num),batch_size=batch_size)
layer1=Dense(100,activation="relu")
layer2=Dense(20,activation="relu")
flatten=Flatten()
layer3=Dense(2,activation="softmax",name="Output")
optimizer=Adam()
model=Sequential()
model.add(inputs)
model.add(layer1)
model.add(Dropout(0.5))
model.add(layer2)
model.add(Dropout(0.5))
model.add(flatten)
model.add(layer3)
call=TensorBoard(log_dir=log_dir,write_grads=True,histogram_freq=1)
model.compile(optimizer,loss="categorical_crossentropy",metrics=["accuracy"])
model.fit_generator(train_generator,steps_per_epoch=train_size//batch_size,epochs=epochs_num,callbacks=[call])
deftrain(train_generator,train_size,input_num,dims_num): print("Start Train Job! ") start=time.time() inputs=InputLayer(input_shape=(input_num,dims_num),batch_size=batch_size) layer1=Dense(100,activation="relu") layer2=Dense(20,activation="relu") flatten=Flatten() layer3=Dense(2,activation="softmax",name="Output") optimizer=Adam() model=Sequential() model.add(inputs) model.add(layer1) model.add(Dropout(0.5)) model.add(layer2) model.add(Dropout(0.5)) model.add(flatten) model.add(layer3) call=TensorBoard(log_dir=log_dir,write_grads=True,histogram_freq=1) model.compile(optimizer,loss="categorical_crossentropy",metrics=["accuracy"]) model.fit_generator(train_generator,steps_per_epoch=train_size//batch_size,epochs=epochs_num,callbacks=[call]) 测试:
deftest(model_dir,test_generator,test_size,input_num,dims_num,batch_size):
model=load_model(model_dir)
labels_pre=[]
labels_true=[]
batch_num=test_size//batch_size+1
steps=0
for batch,labels in test_generator:
if len(labels)==batch_size:
labels_pre.extend(model.predict_on_batch(batch))
else:
batch=np.concatenate((batch,np.zeros((batch_size-len(labels),input_num,dims_num))))
labels_pre.extend(model.predict_on_batch(batch)[0:len(labels)])
labels_true.extend(labels)
steps+=1
print("%d/%dbatch"%(steps,batch_num))
labels_pre=np.array(labels_pre).round()
def to_y(labels):
y=[]
for i in range(len(labels)):
if labels[i][0]==1:
y.append(0)
else:
y.append(1)
return y
y_true=to_y(labels_true)
y_pre=to_y(labels_pre)
precision=precision_score(y_true,y_pre)
recall=recall_score(y_true,y_pre)
print("Precision score is:",precision)
print("Recall score is:",recall)
deftest(model_dir,test_generator,test_size,input_num,dims_num,batch_size):
model=load_model(model_dir)
labels_pre=[]
labels_true=[]
batch_num=test_size//batch_size+1
steps=0
for batch,labels in test_generator:
if len(labels)==batch_size:
labels_pre.extend(model.predict_on_batch(batch))
else:
batch=np.concatenate((batch,np.zeros((batch_size-len(labels),input_num,dims_num))))
labels_pre.extend(model.predict_on_batch(batch)[0:len(labels)])
labels_true.extend(labels)
steps+=1
print("%d/%dbatch"%(steps,batch_num))
labels_pre=np.array(labels_pre).round()
def to_y(labels):
y=[]
for i in range(len(labels)):
if labels[i][0]==1:
y.append(0)
else:
y.append(1)
return y
y_true=to_y(labels_true)
y_pre=to_y(labels_pre)
precision=precision_score(y_true,y_pre)
recall=recall_score(y_true,y_pre)
print("Precision score is:",precision)
print("Recall score is:",recall)
7、 Cyclic neural network
The recurrent neural network is a kind of time recurrent neural network, which can understand the context knowledge in the sequence. It also uses keras to build the network. The accuracy of the final model is 99.5%, and the recall rate is 98.7%. Core code:
Model training:
def train(train_generator,train_size,input_num,dims_num):
print("Start Train Job! ")
start=time.time()
inputs=InputLayer(input_shape=(input_num,dims_num),batch_size=batch_size)
layer1=LSTM(128)
output=Dense(2,activation="softmax",name="Output")
optimizer=Adam()
model=Sequential()
model.add(inputs)
model.add(layer1)
model.add(Dropout(0.5))
model.add(output)
call=TensorBoard(log_dir=log_dir,write_grads=True,histogram_freq=1)
model.compile(optimizer,loss="categorical_crossentropy",metrics=["accuracy"])
model.fit_generator(train_generator,steps_per_epoch=train_size//batch_size,epochs=epochs_num,callbacks=[call])
def train(train_generator,train_size,input_num,dims_num):
print("Start Train Job! ")
start=time.time()
inputs=InputLayer(input_shape=(input_num,dims_num),batch_size=batch_size)
layer1=LSTM(128)
output=Dense(2,activation="softmax",name="Output")
optimizer=Adam()
model=Sequential()
model.add(inputs)
model.add(layer1)
model.add(Dropout(0.5))
model.add(output)
call=TensorBoard(log_dir=log_dir,write_grads=True,histogram_freq=1)
model.compile(optimizer,loss="categorical_crossentropy",metrics=["accuracy"])
model.fit_generator(train_generator,steps_per_epoch=train_size//batch_size,epochs=epochs_num,callbacks=[call])
Use tensorboard to visualize the network:
8、 Convolutional neural network
Compared with MLP network, convolutional neural network (CNN) reduces the number of parameters to be trained, reduces the amount of computation, and can extract the depth characteristics for analysis. Here, a one-dimensional convolutional neural network similar to Google VGg is used, which includes four convolution layers, two maximum pooling layers and one full connection layer. The final accuracy is 99.5%, and the recall rate is 98.3%. The core code is as follows:
deftrain(train_generator,train_size,input_num,dims_num):
print("Start Train Job! ")
start=time.time()
inputs=InputLayer(input_shape=(input_num,dims_num),batch_size=batch_size)
layer1=Conv1D(64,3,activation="relu")
layer2=Conv1D(64,3,activation="relu")
layer3=Conv1D(128,3,activation="relu")
layer4=Conv1D(128,3,activation="relu")
layer5=Dense(128,activation="relu")
output=Dense(2,activation="softmax",name="Output")
optimizer=Adam()
model=Sequential()
model.add(inputs)
model.add(layer1)
model.add(layer2)
model.add(MaxPool1D(pool_size=2))
model.add(Dropout(0.5))
model.add(layer3)
model.add(layer4)
model.add(MaxPool1D(pool_size=2))
model.add(Dropout(0.5))
model.add(Flatten())
model.add(layer5)
model.add(Dropout(0.5))
model.add(output)
call=TensorBoard(log_dir=log_dir,write_grads=True,histogram_freq=1)
model.compile(optimizer,loss="categorical_crossentropy",metrics=["accuracy"])
model.fit_generator(train_generator,steps_per_epoch=train_size//batch_size,epochs=epochs_num,callbacks=[call])
deftrain(train_generator,train_size,input_num,dims_num):
print("Start Train Job! ")
start=time.time()
inputs=InputLayer(input_shape=(input_num,dims_num),batch_size=batch_size)
layer1=Conv1D(64,3,activation="relu")
layer2=Conv1D(64,3,activation="relu")
layer3=Conv1D(128,3,activation="relu")
layer4=Conv1D(128,3,activation="relu")
layer5=Dense(128,activation="relu")
output=Dense(2,activation="softmax",name="Output")
optimizer=Adam()
model=Sequential()
model.add(inputs)
model.add(layer1)
model.add(layer2)
model.add(MaxPool1D(pool_size=2))
model.add(Dropout(0.5))
model.add(layer3)
model.add(layer4)
model.add(MaxPool1D(pool_size=2))
model.add(Dropout(0.5))
model.add(Flatten())
model.add(layer5)
model.add(Dropout(0.5))
model.add(output)
call=TensorBoard(log_dir=log_dir,write_grads=True,histogram_freq=1)
model.compile(optimizer,loss="categorical_crossentropy",metrics=["accuracy"])
model.fit_generator(train_generator,steps_per_epoch=train_size//batch_size,epochs=epochs_num,callbacks=[call])
Nine, summary
This paper introduces how to use embedded word vector to build XSS semantic recognition model, and uses MLP, cyclic neural network and convolutional neural network to detect XSS attacks, and the three algorithms have achieved good results.
Reference
http://www.aclweb.org/anthology/D14-1181
https://www.leiphone.com/news/201706/PamWKpfRFEI42McI.html
http://www.freebuf.com/column/134319.html
http://blog.csdn.net/churximi/article/details/61210129
http://blog.csdn.net/diye2008/article/details/53105652?locationNum=11&fps=1
http://blog.csdn.net/guoyuhaoaaa/article/details/53188918
Experimental environment
Win7 16g memory
NVIDIA geforce GTX 960 graphics card, 4G memory
Python environment: packages such as Python 3.5, tensorflow, gensim, keras, numpy, etc
Code hosting address:
https://github.com/SparkSharly/DL_for_xss
*The original author of this article: foxscheduler, this article belongs to the freebuf original award program, no reprint without permission