distributed web shell detection system based on machine learning

Posted by barello at 2020-03-09

This blog mainly explains the key link in the process of Web shell detection using machine learning method - the extraction of effective features. Because the quality of feature selection is directly related to the final machine learning detection results.

Combined with the author's understanding and research in security, after a week of repeated thinking and reference of relevant materials, as well as consideration of compatibility with the system, this paper summarizes the following 15 features based on Web log analysis, statistical analysis and file attribute analysis:

1、 Analysis based on Web log

Web log is a file generated by web middleware (such as nginx, Apache, IIS) to record the user's access behavior. The standard web log is in plain text format, with one record per line, corresponding to the client browser's access to the server resources. Typical logs include source IP, access time, access URL, submitted data and other information. By analyzing the log data, we can not only detect the suspicious vulnerability attack behavior, but also extract the access behavior of specific IP to the application in a specific period of time. In other words, all we have to do is identify those that are webshells and those that are normal.

Here is the format of a web log:

Fields to be explained above:

User agent: user agent, client browser, operating system, etc.

Referer: indicates where to jump from.

Based on the analysis of the web access log, five characteristics of the web log can be mined when the web shell is accessed:

(1) entropy of submitted data (post / get)

Generally, requests will submit data to the server, and webshell is no exception. However, if the submitted data is encrypted or encoded, its entropy will increase. For a normal web business system, if the entropy of data submitted to a certain URI is obviously larger than that of other pages, the source file corresponding to the URI is suspicious. Generally, the entropy value of the data submitted by the webshell which has encrypted communication will be larger, so it can be detected. For example, the following comparison:

Normal page: "PID = 12673 & aut = false & type = low"

Webshell: ”ac=ferf234cDV3T234jyrFR3yu4F3rtDW2R354”

② Access frequency of URI

Because webshell is usually hidden, it is only accessed by the attacker alone, and in order to hide, its access times will be relatively small. But the normal page because provides the service to the visitor therefore its visit frequency is relatively high. By calculating the access frequency of each page, find out those pages with relatively less access frequency, and the probability of webshell is higher.

It is worth noting that there will be a certain number of normal pages when the website starts to operate, and webshell usually appears after a certain period of time. Therefore, when counting and calculating the page access frequency, for a certain page, the period from the first visit to the last visit of the page should be used as the statistical interval, and then the number of visits per unit time should be calculated To access frequency. It should be noted that only based on the characteristics of access frequency, we can only find out the abnormal files. It cannot be determined that it must be webshell, and the access frequency of some normal pages will be lower, such as the background management page or the test page left by the technical personnel in the early stage of website construction.

Here, f (a) is used to represent the calculated visit frequency of website page a, tfirst (a) is used to represent the time when website page a is visited for the first time, tende (a) is used to represent the time when website page a is visited for the last time, countfe (a) is used to represent the visit frequency of website page a from time tfirst (a) to tende (a).

Therefore, the visit frequency of page a of the website is calculated as follows:

The time unit can select hours, days, weeks, months, etc. according to the visiting scale of the website.

③ Whether there is a referer field in the request header

The essence of this paper is file association detection based on file access. The entry degree of a web page file measures whether the visitors jump from other pages to this page. Similarly, the output degree of a web page file measures whether the visitors will jump from this page to other pages. The normal website pages will be linked to each other, so there will be a certain degree of access, while the webshell usually has no hyperlink with other website pages, that is, an isolated page, usually the degree of access is 0.

And the referer field in HTTP header indicates where to jump from, so you can get the out / in degree of the file.

④ Frequency of key in submitted data (post / get)

This is a more ingenious but very valuable feature. Generally, the data submitted by users is based on the key value pair. For a web business system, the number of keys to submit data is fixed and known. Then, similar to the URL access frequency calculated above, I can also count the frequency of a key. For webshell, because it is generally different from the source code of the web business system, the key value of the submitted data is also quite different. All, this can distinguish the normal page request data or the request data initiated by webshell.

⑤ Number of pages associated with key in request data (post / get)

Similar to the above, you can also count the number of pages associated with a key. This is also a very clever and valuable feature. For webshell, it generally differs from the source code of the web business system, and the key value of the submitted data is quite different. Its key may only be associated with that webshell page. Therefore, the probability that the corresponding page is webshell is higher. This seems to be similar to ④, but in essence, it is based on different dimensions.

However, based on ④ ⑤ above, the following other situations may also exist, such as:

(1) A key in the webshell may be the same as the key submitted by the normal page ("touch porcelain"), so the frequency of this key is also high, and it also relates to the number of these two (or more) pages.

(2) For some keys, they may only be used for a certain page, but the number of visits to that page is also large. One possible situation is the login page. For example, the submitted key is the "password" string. Because it is the login page, the number of visits is relatively large.

Combining the two similar dimensions, a key may have the following four possible situations:

a. Large frequency of occurrence, many associated pages – (normal page)

b. Small frequency, many associated pages – (such as the first verified key)

c. High frequency of occurrence, few associated pages - (such as login page)

d. The frequency is small, and the number of associated pages is small - (maybe webshell)

2、 Based on statistical analysis

⑥ Index of coincidence (IC)

Definition: let X be a ciphertext string of length N, that is, x = x1x2x3 Xn, where Xi is a ciphertext character, and the coincidence index of X is defined as the same probability of two random elements in X.

Because the ciphertext randomness of the encrypted file becomes larger (its characters can be taken from the first 127 or 254 characters in the ASCII table), its coincidence index becomes lower. So IC is a way to judge whether a file is encrypted. The encrypted file in a normal web business system generally means that it is a web shell to avoid detection.

⑦ Information entropy of documents

Information entropy is an abstract concept in mathematics. It refers to the probability of occurrence of a specific information (the probability of occurrence of discrete random events), which can also be understood as the degree of confusion of substances in chemistry. The more orderly a system is, the lower the entropy of information; conversely, the more chaotic a system is, the higher the entropy of information. All, information entropy can be a measure of the degree of system ordering.

The encrypted or encoded webshell contains a large number of random content or special information characters, which will use more ASCII characters, so its entropy value will become larger. This distinguishes between a normal file and a webshell.

⑧ Longest word in file

The longest string is the longest and uninterrupted string in the detection file. In some encrypted or encoded webshell code, it is usually stored as a super long string. For example, Base64 encoding, which is often used in current webshell, will produce an extra long string without space characters. For normal business code, it is usually written in a neat way, and the longest string (usually function name) is also relatively short. So, this distinguishes between webshell and normal business code.

It is worth noting that machine learning method is needed to deal with this special problem, which requires vectorization. How to compare the longest word with other normal code? Here we use the form of calculating variance. Variance is used to measure the degree of deviation between a random variable and its mathematical expectation (i.e. mean). So calculating the variance between the average string length and the longest word in each file can be a good feature vector.

⑨ Compressibility ratio of files

The compression ratio of the file = the compressed size of the file / the original size of the file.

The essence of compression is to eliminate the imbalance in the distribution of specific characters. By allocating short codes to high-frequency characters, long codes can optimize the length of low-frequency characters. The base64 encoded file eliminates non ASCII characters, making the character imbalance larger, so the compression ratio will be larger. In reality, most of the webshell uses encryption or coding to achieve better concealment. According to this feature, it can be detected.

3、 File based text properties

⑩ File creation time

In web server, for dynamic script (source code) files, the file creation time is generally several, and most of them are concentrated in the time of source deployment. If the creation time of a file is abnormal, for example, the web source code was deployed in 2014, and now a file with the creation time of 2016 is suddenly added, it may be webshell. Of course, it is not ruled out that the business system adds new source files or redeploys the business source, but the administrator should be aware of this situation.

⑪ Modification time of document

Similar to the creation time of a file, its modification time refers to the time when the file was last modified. In the web server, for the deployed dynamic script (source code) files, the file modification time is generally the same, and it is basically concentrated on the time of source code deployment. If the modification time of a file changes, if it is not a configuration class related file, then it is less likely to be modified to webshell.

⑫ The file permissions of a file generally include readable (R), writable (W), and executable (x). Generally, an administrator who knows a little about operation and maintenance knows that the lower the authority is, the better. And for the deployed dynamic script (source) files, the file permissions are generally the same, and most of them are the same. If one day the permission of a file changes suddenly or a file with different permission from most business codes is added, it is more likely to be a webshell. For example, the file permission of using MySQL's outfile function is generally 666, while the normal business code may be 644, so this file is more suspicious.

⑬ File owner of the file

Similar to the permissions of a file, the owner of the file indicates that the file belongs to that user in the system. For normal business code, the file owner is usually specified when the administrator is not the code, such as belonging to a user. If a file with different permissions is added suddenly, all the attributes of the file may be changed, so it seems suspicious. For example, if the owner of a web shell uploaded through the web background is a user of a web server, such as Apache or nginx, and the normal business code is the user of another user, then the file is more suspicious.

⑭ Proportion of hazard function in document

This is essentially a method of eigenvalue matching. Firstly, some relatively dangerous functions are defined based on blacklist, such as file operation class, database operation class, system command execution class, encryption / decryption / obfuscation coding class. Considering that the normal business code will also have the above related operations, combined with the rich functions of webshell, it will contain more dangerous functions. So in the form of proportion, we define:

V = number of functions in the danger function set / number of all functions in the file

In consideration of the need for better results, the pre-processing needs to filter some irrelevant functions out of all functions. And we need to build a blacklist function library. For a scripting language (such as PHP), there are not many dangerous functions, so the workload is not large.

⑮ Fuzzy hashing matching based on document similarity

Similar to the above, this is also based on the principle of eigenvalue matching, but the difference is that text similarity is used for similarity matching based on Fuzzy hashing, and finally a similarity probability is calculated. In engineering, in the early stage, we need to build a fuzzy hashing Library Based on all the known webshell samples. By comparing all the source files in the web server with the fuzzy hashing library, we define the threshold value. Then, webshell may be within the threshold range. Here we plan to use the open source software ssdeep for direct processing. Ssdeep is an application based on the principle of fuzzy hash algorithm. Now it can be used to detect webshell. It judges the similarity of files by calculating the context sensitive segmented hash value.