analysis of internet malicious crawlers in the first half of 2018: a panoramic view of crawlers and anti crawlers

Posted by trammel at 2020-03-08


In addition to security experts and hackers, the most fierce battle field of the Internet is probably the field of crawlers and anti crawlers. According to statistics, the reptile traffic has already exceeded the human real access request traffic. The Internet is full of all kinds of reptiles. There are users of different sizes in cloud and traditional industries who are being targeted by reptile enthusiasts. Where do these reptiles come from? Whose data was crawled? Where will the data be used?

Recently, Tencent cloud released a series of research reports on security in the first half of 2018. The series of reports focus on the most common security threats faced by users on the cloud, disclose the current situation of attacks with data statistics, restore attackers by tracing the source, enable enterprise users and other users to follow when responding to attacks, and provide reliable security guidance for them. In this report, Yunding lab has captured a large number of crawler request traffic and real source IP through the deployed threat awareness system, and analyzed the Internet crawler behavior based on hundreds of millions of crawler requests captured in the first half of 2018.

1、 Basic concepts

1. What is a reptile?

Crawler originated from search engine. It is a program that automatically grabs information from Internet according to certain rules.

Search engine is a kind-hearted crawler. It crawls all pages of the website, provides other users with fast search and visit, and brings traffic to the website. For this reason, the industry has also reached the robots gentleman agreement, so that the search on the Internet and the search get along harmoniously.

The original win-win situation was quickly destroyed by some people. Like other technologies, reptiles are also a double-edged sword, becoming no longer a "gentleman". Especially in recent years, the concept of "big data" has attracted many companies to arbitrarily crawl data from other companies, so "malicious crawlers" began to flood the Internet.

This report focuses on "malicious crawlers", not on search engine crawlers and legal crawlers.

2. Classification of reptiles

According to crawler function, it can be divided into web crawler and interface crawler.

Web crawler: mainly search engine crawler, according to the hyperlink on the web page for traversal crawling.

Interface crawler: obtain a large amount of data information by accurately constructing the request data of specific API interface.

According to the authorization, it can be divided into legal reptiles and malicious reptiles.

Legal crawler: it is legal crawler to crawl the web page with the behavior conforming to the robots protocol specification, or crawl the open interface of the network, or purchase the authorization of the interface to crawl. This kind of crawler usually does not need to consider anti crawler and other antagonistic work.

Malicious crawler: through analyzing and constructing parameters on its own to crawl or submit data on the non-public interface, it can obtain data that the other party would not like to be acquired in a large amount, and may cause great loss to the performance of the other party's server. There is usually a fierce fight between reptiles and anti reptiles.

3. Where does the data come from?

Reptiles don't produce data, they're just data movers. To study reptiles, you have to study the source of the data. Especially for small companies, they often need more external data to support business decision-making. How to obtain valuable data in the vast Internet is a problem that many companies have been considering. Generally speaking, there are several major data sources:

User data generated by enterprises

For example, bat and other companies have a large number of users. Every day, users will generate a large amount of raw data.

It also includes PGC (professional production content) and UGC (user production content) data, such as news, self media, microblog, short video, etc.

Public data of government and institutions

Such as Statistics Bureau, industrial and commercial administration, intellectual property rights, Bank Securities and other public information and data.

• third party database purchase

There are many product databases in the market, including commercial and academic databases, such as Bloomberg, CSMAR, wind, HowNet, etc., which are generally purchased in the name of the company, such as consulting companies, colleges and universities, and research institutions.

The crawler gets the network data

Using crawler technology, crawling web pages, or through public and non-public interface calls, to obtain data.

Data exchange between companies

Data exchange between different companies and data completion between them.

Data theft by business spies or hackers

Obtain user data of other companies through internal ghost channels, or use unconventional means such as hackers to acquire data through customized intrusion or purchase data of other companies in the underground black market. There's a lot more internal leakage than hacking.

2、 Target of a malicious reptile

From the data sources summarized above, the third-party database purchase or data theft channels do not involve crawlers, which are the real targets of malicious crawlers, mainly the data of Internet companies and relevant government departments.

Overall distribution of the industry

By tagging the large amount of malicious crawler traffic captured, the top 10 ranking of the industry with the largest malicious crawler traffic is sorted out. The details are as follows:

It can be seen from the statistics that the traffic proportion of malicious reptiles in travel is higher than that of e-commerce and social industry, ranking first, followed by reviews, operators, public administration, etc. Next, it analyzes industry by industry:

1. travel

Real time train ticket information

The malicious crawler visit to the train ticketing platform accounts for nearly 90% of the traffic in the travel industry. It is reasonable to analyze. Hundreds of cities and thousands of trains constitute the domestic railway network. After the train station and train number are arranged and combined, it is a very large data set. With the rapid transition from manual ticketing to Internet Ticketing, there are more and more third-party agents and ticket grabbing service providers, and they are arbitrary A real-time data refresh requires a large number of reptile clusters, which leads to the train ticket purchase site becoming the most frequently visited business.

Real time ticket information

Air tickets account for 8.77% of the traffic of malicious crawlers, mainly crawling the real-time ticket prices of major airlines.

Real time public transport information

It mainly crawls the GPS information of public transportation in the city.

Real time sharing of bicycle information

It mainly crawls the real-time shared bicycle information around a specific area.

Information about hotel vacancies

Hotel crawling accounts for a relatively small proportion, mainly for the hotel price, which can be ignored compared with the traffic category

2. social networking

Because most of the domestic social platforms are purely app based, and some social platforms do not support the function of web pages, the social crawlers captured mainly focus on microblog platform, mainly crawling user information and published content.

3. electricity supplier

Crawlers in the e-commerce industry mainly crawl commodity information, price and other data. Due to the difference of business model, C2C e-commerce has a large number of small and medium-sized sellers, and the number of commodities is far more than B2C e-commerce, which supports nearly 90% of the traffic of malicious crawlers in e-commerce, and B2C E-commerce accounts for about 10% of the total.

4. O2O

In o2o industry, malicious crawlers are mainly focused on review and group buying companies, among which the review data of dynamic information and star rating information of shops are mainly crawled, accounting for more than 90% of the total.

5. Public Administration

Public administration malicious crawlers mainly focus on court documents, intellectual property rights, enterprise information, credit information and other conventional business information fields, while another popular crawler is the registration platform. From the data point of view, it should be the scramble services provided by some registration platforms.

6. operators

The malicious crawler traffic of operators mainly focuses on the query of various Internet package mobile cards of operators. Due to the high cost performance of Internet package mobile cards, there are related industrial chains such as swiping, grabbing and purchasing on behalf of others on the network.

There are many kinds of search tools for smart phone number through crawler technology on the network. Select the type of mobile card, and then continue to crawl for sale mobile phone number to find the number that meets the ideal smart number rule. The following figure is a screenshot of a scanning tool. Dozens of different Internet package cards can be selected:

(screenshot of a mobile phone number scanning tool)

7. self media

According to this statistics, self media crawlers mainly focus on wechat subscription number keyword search and article access, accounting for 64.91% and 20.73% of the total respectively, and other self media platforms account for about 14.36%.

8. map

Map crawlers are more conventional, mainly crawling the detailed information of merchants around the geographical location.

9. SEO

SEO malicious crawlers often search related words frequently to affect the ranking of search engines.

10. news

News malicious crawler is mainly used to crawl the news information of aggregation news app and major portals. It is mainly based on the news platform of search engine and the data of aggregation app, and the traditional portal crawler is less patronized.

11. other

Other main areas visited by reptiles are news, recruitment, Q & A, encyclopedia, logistics, classified information, novels, etc. they are not listed one by one.

3、 Crawler source IP distribution

1. Country distribution

From the perspective of the crawler traffic source IP captured in this semi annual statistics, most of them come from China, more than 90%, followed by the United States, Germany, Japan and other countries.

2. Domestic distribution

We can see that the data from China are mainly from Beijing, Tianjin, Hebei, Shanghai and other provinces, and the above four regions account for more than 70% of the domestic malicious reptile traffic. This is not because the crawler authors are from these areas, but because a large number of crawlers are deployed in the rented IDC rooms, most of which are in developed provinces and cities.

3. Network distribution

The figure shows the network distribution of the source IP of the malicious crawler. It can be seen that more than half of them come from the network of domestic operators, and the majority of them come from IDC rooms of operators. In terms of cloud computing companies, the main domestic cloud companies are listed.

From the overall data, most of the malicious crawlers are from IDC rooms. With the cloud based malicious programs, cloud computing companies should understand and deal with the misuse of cloud resources in a timely manner.

4、 The fight between reptiles and anti reptiles

As one of the most fierce battlefields of Internet confrontation, when it comes to reptiles, we can't help mentioning anti reptiles. When the anti reptile engineer killed the reptile, the reptile engineer would not be slaughtered, and soon developed a variety of anti anti anti technology.

1. Who is the opponent

The fight between reptiles and anti reptiles has a long history. To do a good job of anti reptiles, we need to know what opponents are, so as to make corresponding strategies. The opponents of anti crawler engineers usually come from the following categories:

New graduates

There is usually a wave of reptile peak around March every year, which is related to the fresh graduates (undergraduate, master and doctor). In order to support the paper with data, their reptiles are simple and rough, ignore the server pressure, and the number is unpredictable.

Small start-up companies

Startups lack data support. In order to survive, they crawl other data. However, they usually last for a short time and are easily forced to retreat by anti crawler means.

• established business competitors

The biggest opponent of anti crawler work, rich people have technology, if necessary, they will use distributed, cross provincial computer rooms, ADSL and other means for long-term crawling. If the two sides continue to fight, the final result may be to find a balance between them.

• out of control reptiles

Many crawlers will be forgotten by programmers after running on the server. They may not be able to climb data for a long time, but they will continue to consume server resources untill the expiration of the server where the crawler is located.

2. Technical confrontation

Just like the fight between security experts and hackers, reptile engineers and anti reptile engineers also love each other and kill each other. You come to me and spiral up. After several technical upgrades, the commonly used anti crawler and corresponding anti crawler schemes are as follows:

⑨ verification code

The verification code is the most commonly used anti crawler measure, but the simple verification code can be recognized automatically by machine learning, usually the accuracy can reach more than 50% or even higher.

The complex verification code is manually typed by submitting it to a special coding platform. According to the complexity of the verification code, the coding worker will charge 1-2 cents per code on average. It is also easy to be bypassed, making the data easy to be crawled.

The most effective and easy to kill solution

This is the most effective and easy to kill scheme. This strategy is based on the premise that IP is rare. At present, hundreds of thousands of IP pools can be obtained at low cost through agent pool purchase or dial-up VPS, which leads to the effect of simple IP blocking strategy getting worse and worse.

⑨ slider verification

The slider verification combines machine learning technology, which only needs to slide the slider without seeing the letters that are too complex to be distinguished by human eyes sometimes. However, due to the relatively simple implementation of time checking algorithm by some manufacturers, it often only needs relatively simple simulation sliding operation to bypass, so that the data is maliciously crawled.

⑨ context of association request

The anti crawler can judge whether the real person access or not through whether the token or network request context has completed the process. However, it is not too difficult for technicians with protocol analysis ability to conduct full simulation.

The participation of JavaScript in operation

Simple crawlers can't perform JS operation. If some intermediate results need JS engine to parse and calculate JS, then attackers can't simply crawl. However, crawler developers can still use their own JS engine module or directly use pointomjs and other gratuitous browsers for automatic parsing.

Increase data acquisition costs

When facing professional players, it can only be achieved by increasing the human cost of the other party, such as code confusion, dynamic encryption scheme, false data and other ways, using the advantage of development speed greater than analysis speed to drag down the will of the other party. If the other side is determined not to relax, it can only continue to fight until one side gives up due to machine cost or labor cost.

When the confrontation reaches this stage, just like the security confrontation, the technology struggle enters the "balance period" of the competition. At this time, the anti crawler engineer confronts most of the low-level players, and the rest of the high-level crawler engineers also tacitly maintain a crawl speed without too much pressure on the server. Both sides are like Taiji pushers. How to break this balance next?

5、 Fighting against new ideas: cloud AI anti reptile

After the trend of cloud computing, the fight between crawlers and anti crawlers has gradually joined the third-party forces. Cloud computing companies can directly provide enterprises with cloud anti climbing ability, and turn the battle situation from 1v1 of anti crawlers and crawlers to 2v1 of enterprise + cloud companies and crawlers, helping enterprises with anti climbing ability.

Especially in recent years, AI technology continues to break through, providing a new way to solve many problems. Based on this point of view, Yunding laboratory analyzes the massive real malicious reptile traffic through deep learning technology, and believes that introducing AI technology into the anti reptile field can play an excellent supplementary effect, which will be the trend of such field in the future.

For this reason, Tencent cloud website manager (WAF) and Yunding laboratory have established a more general reptile identification model based on the massive real reptile traffic, which has been very effective. In the future, they will be committed to opening the strongest identification ability to enterprises.

You may also want to see

To learn more about "Tencent cloud site manager WAF", please stamp the [read original] ☑