jsp / jspx webshell detection based on machine learning

Posted by santillano at 2020-04-11

In the "ast based web shell detection" above, the author has proposed that abstract syntax tree can be used to detect web shell, so how to apply this idea to JSP / JSP X Web shell detection is the focus of this paper. After a brief review, the detection of abstract syntax tree turns out that there will be obvious differences in the syntax structure between webshell and normal files. For example, the general process of Trojan horse is to pass parameters and execute commands. In fact, the syntax structure of normal documents is relatively simple. The syntax structure of normal documents is much more complex than this. Therefore, it is necessary to distinguish websh from the syntax structure Ell is also a good choice. However, the disadvantage of syntax structure is that it is difficult to analyze specific parameters, so when "Eval ('1111 ');" and "Eval (file [u put] contents ('shell. PHP', '< PHP phpinfo()');" appear, it is very difficult to use AST to detect specific data. In Java environment, the normal attack process of hackers is to pass Web vulnerabilities into webshell, and then carry out post penetration attacks by passing them from horse to horse. In general, in Java environment, if you want rce to upload any file, or deserialize and execute commands (and other methods, only the mainstream methods are listed here), so uploading any file will leave the following Generally speaking, the executable file of Java is JSP or jspx. In general, jspx is a way to bypass JSP file upload, which is handled here. Looking back at the previous detection principle, the syntactic structure characteristics of JSP files are actually more obvious than that of PHP. The reason is that in general, JSP files are used for page display, and JSP files with the characteristics of webshell execute commands by passing parameters. In fact, there are essential differences in syntax structure, so at least the detection principle here is the past, and it is practical.

We know that the most important step in machine learning is to build feature engineering. In fact, there is no special tool to parse the abstract syntax tree of JSP files. The biggest difference between JSP files and PHP files is that they can be executed directly when they are PHP files. In fact, JSP files need to be compiled through middleware before they can be executed. So this problem is really Transform to: in Tomcat and other middleware servers, what processes have JSP files gone through? Here, the JSP file will be compiled into a java file through JSP parse, and then the java file will be compiled into a class file through servlet analyzer, and finally it will be converted into the corresponding Java bytecode to load and execute. Actually, there is a key step in the process, that is, the step of compiling into a java file.

File compilation

As a detection program, we can't say that Tomcat calls the program once, and then directly grasps the compiled Java file. So we must run Tomcat server first to make it automatically load the program under the web directory. This is certainly not advisable for our automatic detection. If we want to batch detect now, we must have our own compilation tools Fortunately, we found the class of compiling JSP program in Tomcat (org. Apache. Jasper. Servlet. JspServlet), which is actually a relatively large pit. Here, we put the compile command directly.

By running the above command, we can actually convert a directory where webshell files are stored into java files. It is found that the same module is used for the compilation of jspx files, so we can also detect the webshell of jspx here.

The construction of abstract syntax tree

I haven't found any tools for the construction of JSP's abstract syntax tree, but there are at least some options for Java's ast construction tools. Here, we use the Java parser tool to generate the corresponding abstract syntax tree. It's also relatively simple to use the Java parser, and directly load the corresponding POM information through the Maven architecture

Here, after referring to most of the tutorials on the Internet, I decide to use the method of file stream to compile statically.

The methodvistor class here is actually the detection method of the syntax structure type. For example, the function call may be called methodcall, and the comment may be called comment. Therefore, we can generate the global syntax structure node sequence after this class. In the middle of this process, not all the syntactic structure features are taken, and some syntactic structure features are further processed, such as the function call may need to further obtain the function name.

After getting the sequence, we need to use correlation model to transform it into a matrix for later training and learning. For this sequence flow model, I use TFIDF model. The main idea is that if a word or phrase appears frequently in one article and rarely in other articles, it is considered that this word or phrase has a good ability of classification , suitable for classification. This model is actually composed of word frequency and reverse file frequency. The final expression is actually the product of two parameters, which is not worth mentioning..

Then through this model, we can transform the ast syntax structure sequence of each file into a unified matrix, mark black and white samples respectively, and conduct supervised training. Here, the black sample comes from the open source warehouse on GitHub. In fact, it's a little difficult to obtain the white sample. Here, a large number of open source cms are also searched. However, the white sample is still very few. The reason is relatively simple. After all, the JSP file of a CMS is limited, so the only regret here is the data volume. There are 632 black samples and 470 white samples. Finally, the algorithm is selected. Referring to the previous detection experience, three algorithms, xgboost, random forest and MLP, are preliminarily selected. Finally, after a long period of adjustment and comparison, the optimal parameters of each algorithm are determined.

Detection results using random algorithm detection results using xgboost algorithm detection results using MLP algorithm detection results

Looking back, I think the whole implementation idea is relatively simple, that is, there may be several pits that are really annoying, but I feel that this article can only be used as the basic idea of detecting JSP / jspx webshell, and the complex points will be bypassed. If I really want to improve the detection accuracy, I feel that the detection of parameter semantics is very necessary! If there are any improper points in the above, please point out~