IMCAFS

Home

about anti reptiles, this one is enough

Posted by tzul at 2020-03-22
all

Statement: the author of this CSDN original articles, without permission is prohibited in any form of reprint. Editor: This article is from Cui Guangyu, R & D Manager of Ctrip hotel R & D department, who shared it in the third phase of Ctrip technology micro sharing. The following is a summary of the contents sorted out. Courseware: share slides and videos. Editor in charge: Qian Shuguang, focusing on architecture and algorithm, for reporting or contribution, please email [email protected], and "CSDN senior architect group", in which there are many famous Internet company's Daniel architects, welcome the architect plus wechat qshuguang2008 to apply for the group, and note the name + company + position.

Have you ever been infested by reptiles? When you see the word "reptile", does it feel a little bloody? Be patient. If you do something a little, you can let them win in name and lose in fact.

1、 Why anti reptile

1. Reptiles account for a high proportion of total PV, which wastes money (especially in March)

What is the concept of reptile in March? Every March, we will meet a peak of reptiles.

At first we were puzzled. Until one time, in April, we deleted a URL, and then a crawler continued to crawl the URL, resulting in a large number of errors, and the test began to trouble us. We had to specially publish a site for this crawler and restore the deleted URL.

But at that time, one of our team members said that they were not satisfied. They said that we could not kill the reptile, that's all. We had to release it for it, which was really disgraceful. So I came up with an idea, saying: URL can be used, but I will never give real data.

So we released a static file. Error reporting stopped, reptiles did not stop, that is to say, the other party did not know that everything was fake. This event gives us a great inspiration, and also directly becomes the core of our anti crawler technology: change.

Then a student came to apply for an internship. We looked at her resume and found that she had climbed Ctrip. Later, I confirmed during the interview that she was the guy we released in April. But because she is a girl and has good skills, she was recruited by us later. It's almost official now.

Later, when we discussed together, she mentioned that a large number of masters would choose to crawl OTA data and conduct public opinion analysis when writing papers. Because the paper was handed in in May, so, everyone has read the book, you know, all kinds of DOTA and lol in the early stage, it's too late in March, hurry to grasp the data, analyze it in April, and hand in the paper in May.

That's the rhythm.

2. The resources that the company can query free of charge are seized in batches and lose competitiveness, so as to make less money.

OTA prices can be queried directly in the non login state, which is the bottom line. If it is forced to log in, you can kill the account to make the other party pay the price, which is also the practice of many websites. But we can't force each other to log in. If there is no anti crawler, the other party can copy our information in batches, and our competitiveness will be greatly reduced.

Competitors can catch our price, users will know after a long time, only need to go to the competitors, there is no need to Ctrip. This is not good for us.

3. Are reptiles suspected of breaking the law? If so, can you sue for compensation? This will make money.

I specifically consulted the legal affairs department on this issue, and finally found that this is still a sidekick in China. It is possible to successfully prosecute, or it may be completely invalid. So we still need to use technical means to do the final guarantee.

2、 What kind of reptile

1. Very junior fresh graduates

The reptile we mentioned at the beginning in March is a very obvious example. This year's graduates' crawlers are usually simple and rough, regardless of the server pressure, and the number of unpredictable, it is easy to hang the site.

By the way, it's no longer possible to get an offer by climbing Ctrip. Because we all know that the first person who says beautiful women are like flowers is a genius. And the second... You know what?

2. Very low level start-up small companies

Now there are more and more start-ups, and they don't know who they are fooled by. When they start their businesses, they find they don't know what to do. They think big data is hot, so they start to do big data.

The analysis program is almost written, and I find that I have no data on hand.

What should I do? Write about crawling. So there are countless small reptiles, for the sake of the company's survival, constantly crawling data.

3. Careless wrong writing, runaway crawler that no one stops

Sometimes the comments on Ctrip may be as high as 60% of the visitors are crawlers. We have chosen to block them directly, and they are still crawling tirelessly.

What do you mean? That is to say, they can't climb any data at all, except that httpcode is 200, everything is wrong, but the crawler still doesn't stop. This is probably some small crawlers hosted on some servers, which have been unclaimed and are still working hard.

4. Formed business competitors

This is the biggest opponent. They have skills, money and what they want. If they fight with you, you can only fight with them.

5. Search engine with exhaust

We don't think that search engines are good people. They also have a time of wind, and a wind will lead to server performance degradation. The number of requests is no different from the network attack.

3、 What are reptiles and anti reptiles

Because anti crawler is a relatively new field for the time being, some definitions need to be made by ourselves. Our internal definition is as follows:

Keep in mind that human cost is also a resource, and more important than machine. Because, according to Moore's law, machines are getting cheaper. According to the development trend of IT industry, programmer's salary is more and more expensive. Therefore, it is the king to let the other party work overtime, and the machine cost is not particularly valuable.

4、 How to write a simple reptile

To be an anti reptile, we need to know how to write a simple reptile first.

At present, the crawler information searched on the network is very limited, usually only for a piece of Python code. Python is a good language, but it's not the best choice to use it for the site with anti crawler measures.

What's more ironic is that usually the python crawler code found will use a user agent of lynx. What should you do with this user agent, let alone me?

There are usually several processes to write a crawler:

For example, check the Ctrip production URL directly. Click "OK" on the details page to load the price. Assuming that the price is what you want, which request is the result after grabbing the network request?

The answer is surprisingly simple. You just need to arrange the data in reverse order according to the amount of data transmitted on the network. Because no matter how complex other confusing URLs are, developers will not be willing to add more data to them.

5、 How to write advanced reptiles

So what should reptiles do? Generally, the so-called advanced levels are as follows:

1. distributed

Generally, there are some textbooks that tell you that in order to achieve crawling efficiency, crawlers need to be distributed to multiple machines. This is a complete lie. The only function of distributed is to prevent IP blocking. Sealing IP is the ultimate means, and the effect is very good. Of course, it's very nice to hurt users by mistake.

2. Simulate JavaScript

Some tutorials will say that simulating JavaScript and grabbing dynamic web pages are advanced techniques. But it's just a very simple function. Because, if the other party doesn't have anti crawler, you can catch Ajax itself directly without concern about how JS handles it. If the opponent has an anti crawler, JavaScript must be very complex, focusing on analysis, not just simple simulation.

In other words: this should be basic skill.

3. PhantomJs

This is an extreme example. This thing is intended to be used for automatic testing. As a result, many people use it as a reptile because the effect is very good. But this thing has a hard injury: efficiency. In addition, phantomjs can also be caught. For various reasons, we will not talk about it for the time being.  

6、 Advantages and disadvantages of different reptiles

The lower the level of crawler, the easier it is to be blocked, but it has good performance and low cost. The more advanced the crawler, the more difficult it is to be blocked, but the lower the performance, the higher the cost.

When the cost is high enough, we can stop blocking the reptiles. There is a word in economics called marginal effect. If the cost is high enough, the profit will not be much.

Then if we compare the resources of both sides, we will find that it is not cost-effective to fight with each other unconditionally. There should be a golden point. If it's beyond this point, let it climb well. After all, we are not anti reptiles for face, but for business reasons.

7、 How to design an anti crawler system (general architecture)

A friend once gave me such a framework:

At that time, I thought that it sounds reasonable. It's worthy of architecture. The idea is different from us. Then we really did it and it was wrong. Because:

If we can identify reptiles, how can there be so much nonsense? Whatever you want to do with it. If you can't identify a reptile, who do you deal with appropriately?

Two of the three sentences are nonsense, only one is useful, and no specific implementation has been given. So: what's the use of this architecture?

Because there is a problem of architect worship at present, many small start-ups recruit and develop in the name of architects. The title given is: junior architect, architect itself is a senior position, why there is a junior architecture. This is equivalent to: Junior general / Junior commander.

Finally, I went to the company and found ten people, one CTO, nine architects, and maybe you are a junior architect, others are senior architects. But the junior architect is not a bad father. Some small start-ups still hire CTO for development.

Traditional anti reptile methods

1. The access is counted in the background. If a single IP access exceeds the threshold, it is blocked.

Although the effect is good, there are two defects. One is that it is very easy to hurt ordinary users by mistake. The other is that IP is not worth money. Tens of yuan may even buy hundreds of thousands of IPS. So on the whole, it is relatively deficient. But for the reptiles in March, this is very useful.

2. The access is counted in the background. If the access of a single session exceeds the threshold, it is blocked.

This one looks more advanced, but in fact, the effect is worse. Because session is not worth money at all, you can apply for another one.

3. The access is counted in the background. If the access of a single useragent exceeds the threshold, it is blocked.

This is a big move, similar to antibiotics and so on. The effect is surprisingly good, but the lethality is too large, and the injury is very serious. You should be very careful when using it. So far, we have only temporarily banned Firefox under the Mac.

4. Combination of the above

The ability of combination is bigger, the rate of injury is lower, and it is better to use when encountering low-level reptiles.

From the above we can see that in fact, reptile anti reptile is a game, RMB players are the most powerful.

Because the methods mentioned above have a general effect, it is more reliable to use JavaScript.

Maybe someone will say: if JavaScript is used, can we skip the front-end logic and pull the service directly? How can it be relied on? Because, I am a title party. JavaScript is more than just a front end. Skipping the front end is not the same as skipping JavaScript. That is to say: our server is made by nodejs.

Thinking question: when we write code, what code do we fear most? What code is not easy to debug?

Eval is notorious for its inefficiency and poor readability. It's exactly what we need.

JS does not support goto well, so we need to implement goto ourselves.

The current minify tool is usually a simple name like minify to ABCD, which does not meet our requirements. We can minify it for better use, such as Arabic. Why? Because Arabic is sometimes written from left to right, sometimes from right to left, and sometimes from bottom to top. Unless the other side hires an Arab programmer, it's a headache.

What bug is not easy to fix? Bugs that are not easy to reproduce are not easy to fix. Therefore, our code should be full of uncertainty, which is different every time.

Download the code itself, which is easier to understand. Here is a brief introduction to the following ideas:

Pure JavaScript anti crawler demo lets the other party grab the wrong price by changing the connection address. This method is simple, but it is very easy to be found if the other party comes to view it specifically.

Pure JavaScript anti crawler demo, change key. This method is simple and not easy to find. But it can be achieved by deliberately climbing the wrong price.

Pure JavaScript anti crawler demo, change dynamic key. This method can change the cost of key change to 0, so the cost is lower.

Pure JavaScript anti crawler demo, very complex change key. This method can make it difficult for the other party to analyze. If the browser detection mentioned later is added, it is more difficult to be crawled.

only this and nothing more.

We mentioned the marginal effect, that is to say, we can stop here. It's not worth it to invest more manpower in the future. Unless you have a special opponent. But this time it's about dignity, not business.

8、 I got you - and what to do

In addition, there are some divergent ideas. For example, can SQL injection be done in the response? After all, it's the hand that moves first. However, the legal affairs did not give a specific response to this issue, and it was not easy to explain it to her. So for the time being, it's just an assumption.

1. Technical suppression

As we all know, there is a de command in dotaai. When AI is killed, its multiple of gaining experience will be increased. Therefore, there are too many AI killers in the early stage. AI will be in a magical costume and cannot be killed.

The right way is to suppress the level of the opponent, but not kill. It's the same with anti reptiles. Don't overdo it in the first place and force people to fight with you.

2. psychological warfare

Provocation, pity, ridicule, obscenity.

If you don't mention the above, you can understand the spirit.

3. water discharge

This may be the highest level.

Programmers are not easy, especially crawlers. Take pity on them and give them a bite to eat. Maybe in a few days, you'll change to be a reptile because you've done a good job.