Preface
In the last part, we introduced the basic principle of HMM and the common implementation of parameter based anomaly detection. This time, we changed our thinking to treat the machine as a new white hat. We trained him to learn the XSS attack syntax, and then let the machine find the suspected attack log that conforms to the attack syntax from the access log.
Through lexical segmentation, attack load sequence can be transformed into observation sequence, for example:
Word set / word bag model
Word set and word bag model are very common data processing models in machine learning. They are used to characterize string data. The general idea is to divide the sample words, count the frequency of each word, that is, word frequency, select all or part of the words as the key value of the hash table according to the needs, and number the hash table in turn, so that the hash table can be used to encode the string.
- Word set model: a set of words. Naturally, there is only one element in the set, that is, there is only one word in the word set
Word set model: a set of words. Naturally, there is only one element in a set, that is, there is only one word in a word set
- Word bag model: if a word appears more than once in a document, count the number of times it appears
Word bag model: if a word appears more than once in a document, count the number of times it appears
This chapter uses the word set model.
The following data sets are assumed:
dataset = [['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'], ['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'], ['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'], ['stop', 'posting', 'stupid', 'worthless', 'garbage'], ['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'], ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']]
First generate the glossary:
vocabSet = set()
for doc in dataset:
vocabSet |= set(doc)
vocabList = list(vocabSet)
Generate word sets from the glossary:
#Word set model
SOW = []
for doc in dataset:
vec = [0]*len(vocabList)
for i, word in enumerate(vocabList):
if word in doc:
vec[i] = 1
SOW.append(doc)
The core code of the simplified word set model is as follows:
Fredist = nltk.freqdist (tokens ﹐ list) ﹐ single file word frequency
keys=fredist.keys()
Keys = keys [: Max] ා only extract the first n frequently used words and generalize the rest to 0
For localkey in keys: get the non repeated word set after statistics
If localkey in wordbag. Keys(): ා judge whether the word is in the word set
continue
Else:
wordbag[localkey] = index_wordbag
index_wordbag += 1
Data processing and feature extraction
Common XSS attack loads are listed as follows:
<script>alert('XSS')</script>
%3cscript%3ealert('XSS')%3c/script%3e
%22%3e%3cscript%3ealert('XSS')%3c/script%3e
<IMG SRC="javascript:alert('XSS');">
<IMG SRC=javascript:alert("XSS")>
<IMG SRC=javascript:alert('XSS')>
<img src=xss onerror=alert(1)>
<IMG """><SCRIPT>alert("XSS")</SCRIPT>">
<IMG SRC=javascript:alert(String.fromCharCode(88,83,83))>
<IMG SRC="jav ascript:alert('XSS');">
<IMG SRC="jav ascript:alert('XSS');">
<BODY BACKGROUND="javascript:alert('XSS')">
<BODY ONLOAD=alert('XSS')>
The principles of lexical segmentation to be supported are as follows:
- Contents in single and double quotation marks
Contents in single and double quotation marks
'XSS'
- Http / HTTPS link
Http / HTTPS link
http://xi.baidu.com/xss.js
- < > label
< > label
<script>
- Start of label
Start of label
<BODY
- Attribute tag
Attribute tag
ONLOAD=
- End of label
End of label
>
- Function body
Function body
"javascript:alert('XSS');"
- Character number scalar
Character number scalar
The code implementation example is as follows:
tokens_pattern = r'''(?x)
[^] +
|http://\S+
|</\w+>
|<\w+>
|<\w+
|\w+=
>
|\W + \ ([^ <] + \) (functions such as alert (string. Fromcharcode (88, 83, 83))
|\w+
''
words=nltk.regexp_tokenize(line, tokens_pattern)
In addition, in order to reduce vector space, we need to normalize numbers, characters and hyperlinks. The specific principles are as follows:
#Replace the numeric constant with 8
line, number = re.subn(r'\d+', "8", line)
#ULR day changed to http: / / u
line, number = re.subn(r'(http|https)://[a-zA-Z0-9\[email protected]&/#!#\?]+', "http://u", line)
#Kill notes
line, number = re.subn(r'\/\*.?\*\/', "", line)
Examples of the effect of word segmentation after normalization are as follows:
#Original parameter value: "> < img SRC = x onerror = prompt (0) >
#After participle:
['>', '<img', 'src=', 'x', 'onerror=', 'prompt(8)', '>']
#Original parameter value: < iframe SRC = "x-javascript: Alert (document. Domain);" > < / iframe >)
#After participle:
['<iframe', 'src=', '"x-javascript:alert(document.domain);"', '>', '</iframe>']
#Original parameter value: < marquee > < H1 > XSS by XSS < / H1 > < marquee >)
#After participle:
['<marquee>', '<h8>', 'XSS', 'by', 'xss', '</h8>', '</marquee>']
#Original parameter value: < script > - = alert; - (1) < script > "OnMouseOver =" confirm (document. Domain); "" < script >)
#After participle:
['<script>', 'alert', '8', '</script>', '"onmouseover="', 'confirm(document.domain)', '</script>']
#Original parameter value: < script > alert (2) < script > "> < img SRC = x onerror = prompt (document. Domain) >
#After participle:
['<script>', 'alert(8)', '</script>', '>', '<img', 'src=', 'x', 'onerror=', 'prompt(document.domain)', '>']
Combined with the word set model, the complete process is as follows:
Training model
Input the normalized vector x and the corresponding length matrix X ﹣ lens. The reason why x ﹣ lens is needed is that the length of the parameter sample may be inconsistent, so it needs to be input separately.
remodel = hmm.GaussianHMM(n_components=3, covariance_type="full", n_iter=100)
remodel.fit(X,X_lens)
Verification model
The whole system operation process is as follows:
In the verification stage, the trained HMM model is used to input the acquisition probability of observation sequence, so as to judge the validity of observation sequence. The training sample is 1000 typical XSS attack logs. Through word segmentation and word set calculation, 200 features are extracted. All samples are encoded and sequenced with these 200 features, and 20000 normal logs and 20000 XSS attack identification are used (similar to jsfuse) This kind of coding is not supported for the time being), and the accuracy rate is over 90%. The core code of the verification link is as follows:
with open(filename) as f:
for line in f:
line = line.strip('\n')
line = urllib.unquote(line)
h = HTMLParser.HTMLParser()
line = h.unescape(line)
if len(line) >= MIN_LEN:
line, number = re.subn(r'\d+', "8", line)
line, number = re.subn(r'(http|https)://[a-zA-Z0-9\[email protected]&/#!#\?:]+', "http://u", line)
line, number = re.subn(r'\/\*.?\*\/', "", line)
words = do_str(line)
vers = []
for word in words:
if word in wordbag.keys():
vers.append([wordbag[word]])
else:
vers.append([-1])
np_vers = np.array(vers)
pro = remodel.score(np_vers)
if pro >= T:
print "SCORE:(%d) XSS_URL:(%s) " % (pro,line)