hmm of learning algorithm and security (part 2)

Posted by tetley at 2020-04-02


In the last part, we introduced the basic principle of HMM and the common implementation of parameter based anomaly detection. This time, we changed our thinking to treat the machine as a new white hat. We trained him to learn the XSS attack syntax, and then let the machine find the suspected attack log that conforms to the attack syntax from the access log.

Through lexical segmentation, attack load sequence can be transformed into observation sequence, for example:

Word set / word bag model

Word set and word bag model are very common data processing models in machine learning. They are used to characterize string data. The general idea is to divide the sample words, count the frequency of each word, that is, word frequency, select all or part of the words as the key value of the hash table according to the needs, and number the hash table in turn, so that the hash table can be used to encode the string.

Word set model: a set of words. Naturally, there is only one element in a set, that is, there is only one word in a word set

Word bag model: if a word appears more than once in a document, count the number of times it appears

This chapter uses the word set model.

The following data sets are assumed:

dataset = [['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'],          ['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'],          ['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'],          ['stop', 'posting', 'stupid', 'worthless', 'garbage'],          ['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'],          ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']]

First generate the glossary:

vocabSet = set()

for doc in dataset:    

     vocabSet |= set(doc)

     vocabList = list(vocabSet)

Generate word sets from the glossary:

#Word set model

SOW = []

for doc in dataset:    

     vec = [0]*len(vocabList)    

     for i, word in enumerate(vocabList):        

              if word in doc:            

                       vec[i] = 1


The core code of the simplified word set model is as follows:

Fredist = nltk.freqdist (tokens ﹐ list) ﹐ single file word frequency


Keys = keys [: Max] ා only extract the first n frequently used words and generalize the rest to 0

For localkey in keys: get the non repeated word set after statistics

If localkey in wordbag. Keys(): ා judge whether the word is in the word set



       wordbag[localkey] = index_wordbag

       index_wordbag += 1

Data processing and feature extraction

Common XSS attack loads are listed as follows:




<IMG SRC="javascript:alert('XSS');">

<IMG SRC=javascript:alert("XSS")>

<IMG SRC=javascript:alert('XSS')>      

<img src=xss onerror=alert(1)>

<IMG """><SCRIPT>alert("XSS")</SCRIPT>">

<IMG SRC=javascript:alert(String.fromCharCode(88,83,83))>

<IMG SRC="jav ascript:alert('XSS');">

<IMG SRC="jav ascript:alert('XSS');">

<BODY BACKGROUND="javascript:alert('XSS')">

<BODY ONLOAD=alert('XSS')>

The principles of lexical segmentation to be supported are as follows:

Contents in single and double quotation marks


Http / HTTPS link

< > label


Start of label


Attribute tag


End of label


Function body


Character number scalar

The code implementation example is as follows:

tokens_pattern = r'''(?x)

[^] +







|\W + \ ([^ <] + \) (functions such as alert (string. Fromcharcode (88, 83, 83))



words=nltk.regexp_tokenize(line, tokens_pattern)

In addition, in order to reduce vector space, we need to normalize numbers, characters and hyperlinks. The specific principles are as follows:

#Replace the numeric constant with 8

line, number = re.subn(r'\d+', "8", line)

#ULR day changed to http: / / u

line, number = re.subn(r'(http|https)://[a-zA-Z0-9\[email protected]&/#!#\?]+', "http://u", line)

#Kill notes

line, number = re.subn(r'\/\*.?\*\/', "", line)

Examples of the effect of word segmentation after normalization are as follows:

#Original parameter value: "> < img SRC = x onerror = prompt (0) >

#After participle:

['>', '<img', 'src=', 'x', 'onerror=', 'prompt(8)', '>']

#Original parameter value: < iframe SRC = "x-javascript: Alert (document. Domain);" > < / iframe >)

#After participle:

['<iframe', 'src=', '"x-javascript:alert(document.domain);"', '>', '</iframe>']

#Original parameter value: < marquee > < H1 > XSS by XSS < / H1 > < marquee >)

#After participle:

['<marquee>', '<h8>', 'XSS', 'by', 'xss', '</h8>', '</marquee>']

#Original parameter value: < script > - = alert; - (1) < script > "OnMouseOver =" confirm (document. Domain); "" < script >)

#After participle:

['<script>', 'alert', '8', '</script>', '"onmouseover="', 'confirm(document.domain)', '</script>']

#Original parameter value: < script > alert (2) < script > "> < img SRC = x onerror = prompt (document. Domain) >

#After participle:

['<script>', 'alert(8)', '</script>', '>', '<img', 'src=', 'x', 'onerror=', 'prompt(document.domain)', '>']

Combined with the word set model, the complete process is as follows:

Training model

Input the normalized vector x and the corresponding length matrix X ﹣ lens. The reason why x ﹣ lens is needed is that the length of the parameter sample may be inconsistent, so it needs to be input separately.

remodel = hmm.GaussianHMM(n_components=3, covariance_type="full", n_iter=100),X_lens)

Verification model

The whole system operation process is as follows:

In the verification stage, the trained HMM model is used to input the acquisition probability of observation sequence, so as to judge the validity of observation sequence. The training sample is 1000 typical XSS attack logs. Through word segmentation and word set calculation, 200 features are extracted. All samples are encoded and sequenced with these 200 features, and 20000 normal logs and 20000 XSS attack identification are used (similar to jsfuse) This kind of coding is not supported for the time being), and the accuracy rate is over 90%. The core code of the verification link is as follows:

with open(filename) as f:

   for line in f:

       line = line.strip('\n')

       line = urllib.unquote(line)

       h = HTMLParser.HTMLParser()

       line = h.unescape(line)

       if len(line) >= MIN_LEN:

           line, number = re.subn(r'\d+', "8", line)

           line, number = re.subn(r'(http|https)://[a-zA-Z0-9\[email protected]&/#!#\?:]+', "http://u", line)

           line, number = re.subn(r'\/\*.?\*\/', "", line)

           words = do_str(line)

           vers = []

           for word in words:

               if word in wordbag.keys():




           np_vers = np.array(vers)

           pro = remodel.score(np_vers)

           if pro >= T:

               print  "SCORE:(%d) XSS_URL:(%s) " % (pro,line)