discussion on dynamic reptile and weight removal

Posted by santillano at 2020-03-09

Author: fr1day @ 0keeteam

Estimated contribution fee: 600 RMB (if you don't agree with it, you will contribute too!)

Submission method: send an email to Linwei Chen, or log in to the web page for online submission

0x01 introduction

With the development of Web 2.0, there are more and more Ajax in pages. Because traditional crawlers rely on static analysis, they can not accurately capture Ajax requests and dynamic updates in the page, which has become increasingly unable to meet the needs. Web 2.0 crawler based on dynamic parsing came into being. It can effectively solve the above problems by parsing page source code through browser kernel and simulating user operation. In this paper, we will analyze in detail the idea of using phantom JS + Python to write crawlers and de duplicate them.

0x02 PhantomJS

Phantomjs is an interface free WebKit parser that provides JavaScript APIs. Because of the removal of the visual interface, the speed is much faster than the general WebKit browser. At the same time, it provides many monitoring and event interfaces, which can easily operate DOM nodes and simulate user operations.

Next, we will show the difference between dynamic crawler and traditional crawler through a simple example. Goal: load a page (so, play it together) and get all the < a > tags in it.

// example.js var page = require('webpage').create(); page.onAlert = function (message) {     console.log(message);     return true; }; page.onCallback = function() {     page.evaluate(function(){         atags = document.getElementsByTagName("a");         for(var i=0;i<atags.length;i++){             if (atags[i].getAttribute("href")){                 alert(atags[i].getAttribute("href"));             }         }     })     phantom.exit() };"", "get", "", function (status) {     page.evaluateAsync(function(){         if (typeof window.callPhantom === 'function') {             window.callPhantom();         }     }, 10000) });

The results are as follows:

/.mine /cmd2/controls/signin /cmd2/controls/getcode /download.html /.blog /.mine /.at /~发现推荐.findbbs /help.html /江南水乡.bbs/7313348 /摄情男女.bbs/7313242 /欢乐一家亲.bbs/7313356 /深夜食堂.bbs/7313168 /家有熊孩子.bbs/7313321 /乐淘亲子营.bbs/7313320 .../*省略*/... /婚礼记.bbs/7277165 /不知道的事情.bbs/7277164 /不知道的事情.bbs/7277162 /婚礼记.bbs/7277160 /不知道的事情.bbs/7277016 /cmd2/controls/mailpost/内容举报 download.html

The static grab code is as follows:

import requests import pyquery res = requests.get("") count = 0 pq = pyquery.PyQuery(res.content) for i in pq("a"):     print "[%d]: %s" % (count, pq(i).attr("href"))

The grab result is empty.

From the above examples, we can clearly see that dynamic analysis captures more results than static analysis. The reason for the difference is that the data load in the page comes from the Ajax request, and all < a > tags are dynamically updated to the page. Static analysis can't do anything about this situation, and dynamic analysis based on browser kernel can easily deal with these situations.

But we can also see the disadvantages of dynamic analysis clearly: system resources take up a lot of time, there will be some inexplicable holes, and writing is more complex and time-consuming (need to have a certain understanding of front-end programming).

Of course, in addition to phantom JS, there are other dynamic parsers, such as pyqt (the latest version of phantom JS is also based on pyqt) based on WebKit kernel, casperjs based on phantom JS encapsulation, slimerjs based on Firefox gecko kernel, etc. Because there is not a unified standard, the level of API implementation of various dynamic parsers is also uneven, there will be a variety of pits, and there is no "best" solution.

0x03 trigger event and page monitoring

The above example introduces a common scenario in the crawler: loading data through Ajax after the page is loaded. But in reality, the scene is often more complex and needs to be triggered after interaction with the user, such as jumping to a page after clicking a button, scrolling to the end of the page and loading the data of the next page, etc. We need new solutions to simulate normal user operations. So, how do you abstract user interaction into code?

The essence of user operations is actually triggering events bound to DOM nodes. Therefore, the problem of simulating user operations can be simplified as triggering node events. The results of event execution are also diverse, but there are only two results we need to pay attention to for crawlers: 1. Whether a new node is added (< a >, < iframe >, etc.) 2. Whether a new request is initiated (including Ajax request, jump, etc.). After simplification, we need to solve the following problems:

1. How to get binding events?

2. How to trigger an event?

3. How to get the result of event triggering?

Finally, our solution is as follows:

1. How to get binding events? When an event is bound in JavaScript, the addeventlistener function is called. Before the code in the page is executed (oninitialized | phantomjs), hook addeventlistener function can capture which DOM nodes are bound to the event.

_addEventListener = Element.prototype.addEventListener; Element.prototype.addEventListener = function(a,b,c) {     EVENT_LIST.push({"event": event, "element": this})     _addEventListener.apply(this, arguments); };

2. How to trigger an event? The dispatchEvent function is provided in JavaScript, which can trigger the specified event of the specified DOM node. In the previous question, we collected the event list.

for(var i in EVENT_LIST){     var evt = document.createEvent('CustomEvent');     evt.initCustomEvent(EVENT_LIST[i]["event"], true, true, null);     EVENT_LIST[i]["element"].dispatchEvent(evt); }

In addition to binding events through addeventlistener, there are some inline scripts that cannot be obtained through hook addeventlistener. For example:

<div id="test" onclick="alert('hello')"></div>

The solution is to traverse the nodes and execute the values of all the onxxxx properties.

function trigger_inline(){     var nodes = document.all;     for (var i = 0; i < nodes.length; i++) {         var attrs = nodes[i].attributes;         for (var j = 0; j < attrs.length; j++) {             attr_name = attrs[j].nodeName;             attr_value = attrs[j].nodeValue;             if (attr_name.substr(0, 2) == "on") {                 console.log(attrs[j].nodeName + ' : ' + attr_value);                 eval(attr_value);             }             if (attr_name in {"src": 1, "href": 1} && attrs[j].nodeValue.substr(0, 11) == "javascript:") {                 console.log(attrs[j].nodeName + ' : ' + attr_value);                 eval(attr_value.substr(11));             }         }     } }

3. How to get the result of event triggering? The mutationobserver method in HTML5 can check whether the DOM in the page has changed. However, phantomjs does not support (stall support for residence observers). The solution is to listen to the domnodeinserted event. There are two solutions to capture Ajax requests: onresourcerequested can capture non main frame requests, but it needs to filter out valid requests through regular expression matching; hook and xmlhttprequest.send can accurately capture the request content.

document.addEventListener('DOMNodeInserted', function(e) {     var node =;     if(node.src || node.href){         LINKS_RESULT.push(node.src || node.href);     } }, true); _open = = function (method, url) {     if (!this._url) {         this._url = url;         this._method = method;     }     _open.apply(this, arguments); }; _send = XMLHttpRequest.prototype.send XMLHttpRequest.prototype.send = function (data) {     window.$Result$.add_ajax(this._url, this._method, data);     _send.apply(this, arguments); };

To sort out, before the page is loaded, you need three hook interfaces: addeventlistener, and xmlhttprequest.send. After the page is loaded, you need to obtain all < a >, < iframe >, < form > tags, enable the page DOM node to listen, and trigger all events. Finally output the result.

After realizing the basic function of dynamic crawling, there are also some small tips that can improve the stability and efficiency of the crawler: automatically fill in the form (in some cases, the parameter is empty, which causes the form cannot be submitted), prohibit the loading of unnecessary resources (jpg, PNG, CSS, MP4, etc.), prohibit the jump after the page is loaded (to prevent the jump caused by the trigger event), hook will guide The functions that cause page blocking (alert, prompt), trigger event bubbling down (to solve the problem caused by the too wide DOM nodes bound by some nonstandard front-end code, but the actual measurement greatly affects the efficiency), etc.

0x04 remove weights

Weight removal is the core of reptiles, and also the most important part of reptile efficiency and results. To heavy too extensive, in the face of more pages of the site can not finish climbing. If it is too strict, the result of crawling is too few, which will also affect the effect of subsequent scanning.

De duplication is generally divided into two parts: task queue de duplication and result queue de duplication. The difference between the two kinds of de duplication is that the task queue is always changing (increasing & decreasing), while the result queue is increasing. The de duplication of the task queue needs to be repeated during the scanning process, that is, after a certain page crawls, the obtained results need to be de duplicated once before they are added to the task queue for the next crawling task (or once for every deep crawling task completed), and the result queue is de duplicated after all tasks are completed, which does not affect the running efficiency of the crawling Rate, which only affects the final result output. These two kinds of de duplication can use the same de duplication strategy or different strategies (the de duplication of task queue can be adjusted according to the current task amount).

We list the functions and requirements of reptiles one by one:

1. Basic: URL of non crawling target site

2. Basic: completely repeated URLs & URLs with disordered parameters but still repeated in fact

3. Food and clothing: analyze parameters, remove ergodic type, exp: page. PHP? Id = 1, page. PHP? Id = 2, etc

4. Food and clothing: support pseudo static URL de duplication

5. Well off: weight removal of odd URL, exp: test. PHP? A = 1? B = 2? From = 233, test. PHP? A = 1? B = 3? From = Test

6. Well off: dynamically adjust the degree of de gravity according to the current task

The first two basic requirements are relatively simple to implement. Just extract the domain name and parameter list for comparison, and solve the problem in a cycle.

The third requirement is to match parameter values, such as int, hash, Chinese, URL encoding, etc. It should be noted that parameter values in English cannot be processed directly by matching. Such as:

Among them, m, C and a parameters represent different module, controller and action, which belong to "functional parameters" and need to be reserved. In most cases, the value of a functional parameter is a letter (meaningful word), and in some cases, it can be a mixture of numbers or alphanumeric characters. So, what should be the strategy?

The current solution to this problem is also rough. For the parameter values that are all letters, do not deal with them. For the parameter values that are mixed with numbers and letters, carry out "elastic de duplication" according to the amount of tasks (see requirement 6 for details). For example:

# 去重处理前: # 处理过程: {"m": "home", "c": "test", "id":"{{int}}"} {"m": "home", "c": "test", "id":"{{int}}"} {"m": "home", "c": "test", "type":"friday"} {"m": "home", "c": "test", "type":"{{hash}}"} {"m": "home", "c": "test", "type":"{{hash}}"} # 去重结果:

The fourth requirement supports pseudo static de duplication. First of all, we need to define the strategy of path de duplication. We separate the path with / and throw it into the function that processes the parameter values (the conforming ones are replaced with the specified string and the non conforming ones are returned as is), and then we can use the replaced URL to de duplication. Of course, there are also some pseudo static features:

htttp:// htttp://

According to the above strategy, it is too rough. What should we do? Keep looking down.

The fifth requirement is a weird URL. At present, the existing de duplication strategies are implemented by analyzing the replacement parameter value and path name, but this strange URL does not play according to the routine at all, and can only use the very method: before splitting the parameter and path, replace some interference characters. For example:

# 处理前 # 替换后{{int}}?from=te?user={{int}}{{int}}?from=te?user={{mix_str}}

The sixth requirement is to automatically adjust the de duplication strategy according to the current task quantity. In some cases, the above-mentioned de duplication routines are not easy to use, such as:今天是阴天 ...

When the user name is user-defined and there are thousands of users, the above de duplication strategies will fail. What's the problem?

The solution of requirement 3 seems to be too rough, and all pure English strings are ignored directly, but there is no better solution. Only for this special case, add a cycle, find the parameters with too many times, and then force de duplication for these specific parameters. The new strategy is as follows: the first loop only preprocesses, analyzes the current parameter list, and counts. The second time, judge whether the current parameter needs to be forced de duplicated according to the count value of the parameter list. For example:  # 第一轮遍历结果 {     md5(name+m):{count:3, "name":["friday","test","{{int}}"], "m": ["read"]}, }

When the number of URLs with the same parameter list is greater than a specific value, and the number of values of a parameter is greater than a specific value, the parameter is forced to be de duplicated, that is, the whole English string is replaced with {{STR}}.

The above method is a little bit convoluted, and there is a rough solution: do not detect specific parameters, only judge whether the number of tasks in the current task queue exceeds a certain value, once it exceeds, start forced de duplication (as long as the parameter list or root path are the same, remove it directly, it may kill many pseudostatic).

After realizing the above six requirements, a simple and effective De duplication script is completed, and the flow chart is as follows:

0x05 comparison

In order to test the basic function and efficiency of the dynamic crawler (hereinafter referred to as kspider), the same WVS scanner based on dynamic analysis (hereinafter referred to as wvsspider) is selected for comparison.

First test the basic grab function. Aisec vulnerability scanner test platform provides several demo, and the crawling results are as follows:

# 注: WVSSpider无法设置爬虫深度及线程数,针对path相同的url会进行聚合处理,生成SiteFile。 WVSSpider # wvs_console.exe /Crawl /SaveLogs /SaveFolder C:UsersxxxDesktop /ExportXML  Request Count: 31, SiteFile Count: 11, Time Count: 23 KSpider # python {"depth": 5, "thread_count": 5}  Request Count: 23, Result Count: 18, Time Cost: 33 KSpider Basic # python {"depth": 5, "thread_count": 5, "type": "basic"}  Request Count: 11,  Result Count: 8, Time Cost: 1

The first two scans capture five key requests, including:

基础<a>标签: JS自动解析: JS自动解析 + FORM表单: JS自动解析 + AJAX请求: 事件触发 + DOM改变:

Static analysis scans very fast, but only the first of the five requests. The third post request is retrieved through form analysis, but because the < input > tag in the < form > form is generated dynamically by JavaScript (the code is as follows), the specific parameters of the request are not retrieved.

<form method="post" name="form1" enctype="multipart/form-data" action="post_link.php"> <script> document.write('<input type="text" name="i'+'d" size="30" value=1><br>'); document.write('<input type="text" name="m'+'sg" size="30" value="abc">'); </script> <input type="submit" value="提交" name="B1"> </form>

Next is the efficiency test of the crawler. The target is Baidu Post Bar. The results are as follows:

WVSSpider # wvs_console.exe /Crawl /SaveLogs /SaveFolder C:UsersxxxDesktop /ExportXML  Request Count: 201, SiteFile Count: 101, Time Count: 220 KSpider # python {"depth": 5, "thread_count": 10}  Request Count: 410, Result_length: 535, Time_Cost: 339

It can be seen that with the increase of the complexity of the website, the number of requests of WVS crawler is relatively stable, and when the number of online processes of kspider is 10, it also completes the crawling task within 6 minutes, which is normal.

In the process of analysis, although wvsspider has a fast speed and high overall efficiency, it also has some disadvantages: the depth of crawling can not be specified, it can not work across platforms, the effect of URL de duplication in pseudo-static form is poor (as shown in the figure below, there are 43 sitefiles, accounting for 42%), some URL segmentation results in the crawler results (such as: UN = 111 It will be divided into two sitefiles, / home and / home / main, so the actual number of URLs scanned is less than the result).

Due to the large number of URLs on the target website, the coverage rate is difficult to measure. We use scripts to simply compare the results captured by wvsspider and kspider. In addition to static resources, kspider covers 98% of the results captured by wvsspider (that is, 98% of the results captured by wvsspider are also captured by kspider), while wvsspider only covers 38% of the results captured by kspider.

0x06 summary

In addition to the above mentioned de duplication and dynamic analysis, there are also some small tips, such as fuzzy common path, extracting information from robots.txt, identifying sensitive information during crawling, generating website information portraits, etc., which will be helpful for the coverage of crawlers and subsequent scanning tasks.

This paper introduces in detail the possible problems and solutions in the implementation of dynamic crawler. Excellent code will not be achieved overnight. It needs continuous optimization and adjustment. Open source will be considered later. Welcome to communicate.

Reference material

Phantomjs that makes people happy and worries me

Check the pit used by selenium phantomjs

Superspider: a powerful crawler

XSS dynamic detection using PhantomJs