new exploration of data security protection for internet companies

Posted by fierce at 2020-02-27

In recent years, the situation of data security has become more and more serious, and various data security events have emerged in an endless stream. Under the current situation, Internet companies have basically reached a consensus: although they cannot completely prevent attacks, the bottom line is that sensitive data cannot be leaked. That is, the server can be mounted, but sensitive data cannot be dragged away. Server is an acceptable loss for Internet companies, but the leakage of sensitive data will have a significant reputation and economic impact on the company.

In the field of data security of Internet companies, both the data security life cycle proposed by traditional theory and the solutions provided by security manufacturers are facing the problem of landing difficulties. Its core point lies in the poor operability of massive data and complex application environment.

For example, the data security life cycle proposes that data should be classified and classified first, and then be protected. But Internet companies are basically growing savagely, and only after growing up can they find the problem of data security. However, the stock data has been formed, and the data table in tens of thousands per day is growing. In this case, how to achieve data classification and classification? Manual sorting is obviously not realistic, and the speed of sorting can not catch up with the data growth rate.

For example, the data audit solutions provided by security vendors are all hardware boxes based on traditional relational databases. What is the data audit scheme in Hadoop? In the face of massive data, many manufacturers can't afford to buy so many hardware boxes.

Therefore, Internet companies urgently need some means to meet their own characteristics to ensure data security. Therefore, the information security center of meituan has carried out some specific level exploration. These explorations are mapped to the IT level, mainly including application system and data warehouse. Next, we will discuss them separately.

1、 Application system

The application system is divided into two parts. One is to fight against external attacks. Most companies have security awareness, but awareness is not equal to ability, which is the basic skill of a responsible enterprise. Traditional problems include ultra vires, traversal, SQL injection, security configuration, low version vulnerabilities, etc., which are mentioned in OWASP's TOP10 risks. In practice, SDL, security operation and maintenance, red blue confrontation and other means are mainly considered, and the main problems are solved in the form of products. There is no key introduction here.

1.1 scanning and reptiles

In the new situation, there are also problems of scanning and crawling. Scanning number refers to colliding database or weak password: colliding database is to probe with the account password that has been leaked, and steal the user's data and fund after success; weak password is a simple password problem. For this kind of problems, the industry continues to explore new methods, including device fingerprint technology, complex verification code, human-computer identification, IP reputation, trying to alleviate the problem with multiple channels, but the black industry is also constantly upgrading the confrontation technology, including one click new machine, simulator, IP agent, human behavior imitation, so this is a constant confrontation process.

For example, some companies judge the change of acceleration and other sensors when users log in, because users will inevitably bring the change of angle and gravity when they click on the mobile screen. If there is no change in these sensors during the user's clicking, there is suspicion of using the script. Add another dimension to judge the recent electricity changes of users, and you can confirm whether this is a human phone or a black production studio phone. In the confrontation, it was found that the company used this kind of strategy, and it was very easy to resolve it. All the data can be forged, and a large number of such technical tools can be seen on a certain treasure for sale.

Reptile confrontation is another new problem. A previous article said that more than 75% of data access traffic of some companies is reptile. Crawlers do not bring any business value, but also pay a lot of resources for this, and also face the problem of data leakage.

After the rise of Internet finance, crawlers have made new changes, from unauthorized crawling data to authorized crawling data. For example, Xiao Zhang is short of money and applies for a small loan on the website of an Internet finance company. However, the Internet finance company does not know whether Xiao Zhang can make a loan or not and how he can repay the loan. Therefore, Xiao Zhang is required to provide the account password of the shopping website, email or other applications, and crawl the daily consumption data of Xiao Zhang as a reference for credit scoring. In order to obtain the loan, Xiao Zhang provides the account password, which constitutes the authorized crawling. This is a great change from the unauthorized crawling. Internet financial companies can come in to get more sensitive information, which not only increases the burden of resources, but also may cause the leakage of user passwords.

The fight against reptiles is also a comprehensive topic, and there is no technical solution to all problems. In addition to the previous device fingerprint, IP reputation and other means, the solution also includes a variety of machine learning algorithm models to distinguish normal behavior and abnormal behavior, and can also start from the direction of association model. But it is also a process of confrontation, and the black production is gradually exploring, so as to simulate human behavior. In the future, there will be a confrontation between machines, and the decision to win or lose is the cost.

1.2 watermark

In recent years, there have also been some incidents in the industry, in which sensitive internal documents and screenshots are sent out. Some events have caused media hype and public opinion impact on the company, which needs to be able to trace the origin of such outbound behavior. For the problem of robustness, the watermark technology includes spatial filtering, Fourier transform, geometric deformation and so on. It is simply a technology that transforms the information and restores it in bad conditions.

1.3 data honeypot

It refers to making a fake data set to capture the visitors and discover the attack behavior. Foreign companies have made corresponding products, which can be roughly understood as adding a "Trojan horse" to a data file, and all visitors will send the corresponding records back to the server after opening it again. Through this "Trojan", the details of the attacker can be traced. We have done similar things. Unfortunately, this data file has been there for a long time and nobody has access to it. Unmanned access is related to our positioning of honeypot. At this stage, we prefer to use it as an experimental gadget rather than large-scale adoption, because the "Trojan horse" itself may have certain risks.

1.4 big data behavior audit

The emergence of big data provides more possibilities for related audit, and abnormal behaviors can be analyzed through various data association. In this respect, traditional security audit manufacturers have made some attempts, but from an objective point of view, they are still relatively basic, unable to cope with the behavior audit of large Internet companies under complex circumstances, of course, this can not be demanding of traditional security audit manufacturers, which is related to business, business is to pursue profits. In this case, Internet companies have to do more by themselves.

For example, to prevent internal ghost, we can find internal ghost through multiple data association analysis and the rule of "sharing a device with bad people". By drawing inferences from one example, we can derive more rules of grasping internal ghost which are in line with the characteristics of our own data through information flow, logistics, capital flow and other major directions.

In addition, exceptions can also be found through ueba (user and entity behavior analysis), which requires burying points to collect data in each link, and the back end needs corresponding rule engine system, data platform and algorithm platform to support.

For example, the common clustering algorithm: some people do not conform to the behavior of most people, then these people may have abnormalities. The specific scenario can be: normal user behavior is to open the page first, select the product, and then log in and place an order. The abnormal behavior can be: first log in, then change the password, and finally select a new store and use a large coupon. Each data field here can derive various variables. Through these variables, there can be an exception judgment finally.

Another example is the association model. A bad Gang is usually connected. These dimensions can include IP, device, WiFi MAC address, GPS location, logistics address, capital flow and other dimensions. Combined with other data, a group can be associated. If one member of the gang is marked as black, the relationship circle will score and degrade according to the strength of the relationship.

Ueba is based on sufficient data support, and data can be external data suppliers. For example, Tencent and Alibaba both provide some external data services, including the judgment of IP reputation. Using these data, we can achieve the effect of joint defense and joint control. It can also be internal. Internet companies always have several business lines to serve a customer. It depends on the data sensitivity of security personnel and which data can be used by them.

1.5 data desensitization

In the application system, there will always be a lot of user sensitive data. Application system is divided into internal and external system desensitization, mainly to prevent collision and reptile. Desensitization of internal system is mainly to prevent internal personnel from leaking information.

Desensitization protection of external system can be treated in layers. By default, for key information such as bank card number, ID card, mobile phone number, address, etc., desensitization is forced to replace the key location with **** so that even if the database or crawler is hit, relevant information can not be obtained, so as to protect the data security of users. But there are always customers who need to see their own or modify their own complete information. In this case, layered protection is needed, which is mainly based on the judgment of common equipment. If it is a common equipment, it can be displayed after an unobstructed click. If the device is not in use, a strong verification is pushed.

In daily business, meituan has another feature. For the connection between the delivery rider and the buyer, the rider may not find the specific location and needs to communicate with the buyer, at least including the address and mobile number. For the protection of buyer's information, we also explored. Mobile phone number information, we use a "small" mechanism to solve, what the rider gets is a temporary transfer number, use this number to contact the buyer, but the real number is invisible. Address information, we use the image display in the system, after the order is completed, the address information is not visible.

The desensitization protection of internal system can be divided into several steps in practice. The first is to detect the sensitive information in the internal system. Here, you can choose to get it from the log or from the JS front-end. The two schemes have their own advantages and disadvantages. From log, it depends on the company's overall log specification. Otherwise, each system has a log, and the docking cycle is long and heavy. Obtained from the front-end JS, the scheme is lighter, but the impact of performance on the business should be considered.

The purpose of detection is to continuously detect the change of sensitive information, because in the internal complex environment, the system will continue to be upgraded. If there is no means of continuous monitoring, it will become a sports project, unable to guarantee the sustainability.

After the test, we need to do desensitization. The desensitization process needs to communicate with the business side clearly, which fields must be forced to be completely desensitized and which are semi desensitized. In the case of standardized application system authority construction, role-based desensitization can be considered. For example, risk control case personnel must need complete information of the user's bank card. In this case, immune authority can be given according to the role. However, the customer service personnel do not need to view the complete information, so they are forced to desensitize. Between immunity and desensitization, there is also a layer called semi desensitization, which means that when necessary, you can click to view the complete number, and the click action will be recorded.

In terms of desensitization as a whole, there should be a global view. How many sensitive information of users are accessed every day, how many information are desensitized, and why they are not desensitized. In this way, the change can be tracked as a whole, with the goal of constantly reducing the access rate of sensitive information. When the view fluctuates abnormally, it means that the business has changed, and the event cause needs to be tracked.

2、 Data warehouse

Data warehouse is the core of the company's data. If there is a problem here, it will face a huge risk. The management of data warehouse is a long-term and gradual construction process, in which the security link is only a small part, more is the level of data governance. This paper mainly talks about some tool construction in the security link, including data desensitization, privacy protection, big data behavior audit, asset map, data scanner.

2.1 data desensitization

Desensitization of data warehouse refers to the deformation of sensitive data, so as to protect sensitive data. It is mainly used for data analysts and developers to explore unknown data. There are several forms of desensitization in practice, including confusion and replacement of data, and data use without changing the expression of data itself. However, data obfuscation and replacement are all cost-effective. In the case of large Internet companies' massive data, the cost of this kind of data obfuscation and replacement is very high. In practice, the common way is to cover a relatively simple part, such as covering the mobile phone number, 139 * * 0011 to show. This method has simple rules and can play a certain degree of protection effect.

However, in some scenarios, simple masking can not meet the business requirements. At this time, other means need to be considered, such as tokenization for credit card number, segmentation for range data, case diversity, and even Base64 masking for pictures. Therefore, different services need to be provided according to different scenarios, which is the result of cost, efficiency and use,

The original table and desensitized table shall be considered for data hiding. There must be one original data. On this basis, it is two different cost schemes to copy another desensitization table or do visual desensitization on the original data. In addition, it is a more thorough way to copy a table for desensitization, but it means that each sensitive data table needs to be copied out, which is a cost issue for storage. Visual desensitization, on the other hand, is the dynamic desensitization of data presentation through rules, which can achieve desensitization effect at a lower cost, but there is the possibility of being bypassed.

2.2 privacy protection

In the field of privacy protection, some methods have been proposed, including k-anonymity, edge anonymity, differential privacy and so on. For example, some companies take out some data disclosure after removing sensitive information, and conduct algorithm competition. At this time, we need to consider different data aggregation, which can be associated with a person's personal logo. At present, it is seen that the DLP API of Google is applied in the industry in production, but its use is also more complex, aiming at a single scenario. The key of privacy protection is to be able to carry out large-scale engineering. In the background of big data era, these are still new topics. At present, there is no complete method to solve all the confrontation problems of privacy protection.

2.3 big data asset map

It refers to the platform for analyzing and visualizing the data assets of big data platform. The most common demand is that department a applies for the data of department B. as the owner of the data, Department B naturally wants to know how the data will be used by department a and whether it will be used by other people. At this time, an asset map is needed to track the flow direction and usage of data assets. From another perspective, for the security department, it is necessary to know what high-sensitivity data assets are on the current data platform, how they are used, and who has what permissions on the platform. Therefore, a visualized asset map is formed by metadata, kinship and operation log. It is not enough to form a map. It needs to be able to give early warning in time, recover authority and other interventions.

2.4 database scanner

It refers to the data scanning of big data platform, and its significance is to discover the sensitive data on the big data platform, so as to carry out the corresponding protection mechanism. The data tables of a large Internet company may directly generate up to tens of thousands of tables every day, through which more tables can be derived. According to the traditional definition of data security, the first step of data security is to classify and grade, but this step is very difficult to go on. In the case of massive inventory table, how to classify and grade? Manual carding is obviously unrealistic, and the speed of carding still can't catch up with the new speed. At this time, some automatic tools are needed to mark and grade the data. Therefore, the database scanner can find some basic highly sensitive data through regular expressions, such as cell phone number, bank card and other regular fields. For irregular fields, machine learning + manual label is needed to confirm.

To sum up, data security becomes more and more important when the business develops to a certain extent. Tool building at the micro level is a support to minimize business disruption and improve efficiency. At the macro level, in addition to the data security within its own system, the data security of partners, invested companies, logistics, riders, businesses, outsourcing and other organizations will also affect its own security, which can be called "death of the lips and death of the teeth". In the current situation that the security level of various organizations is uneven, the developed Internet companies are required to take more responsibilities to help partners improve the security level and build joint defense.

Author brief introduction

About the safety of meituan

Most of the core personnel in the Security Department of meituan have many years of practical experience in the Internet and security field. Many students have participated in the security system construction of large Internet companies, including many global security operation talents, with the experience of attack and defense on a scale of one million IDC. The security department also has CVE's "excavation experts", speakers invited to speak at top international conferences such as black hat, and of course, many beautiful operation girls.

At present, meituan security department involves technologies such as penetration testing, web protection, binary security, kernel security, distributed development, big data analysis, security algorithm, etc., as well as global compliance and privacy protection policy formulation. We are building a set of mobile office network adaptive security system with the scale of one million IDC and access of hundreds of thousands of terminals, which is built on the zero trust architecture and spans a variety of cloud infrastructure, including network layer, virtualization / container layer, server software layer (kernel / user state), language virtual machine layer (JVM / JS V8), web application layer, data access layer, etc., and can build a fully automatic security event awareness system based on "big data + machine learning" technology, striving to build the industry's most cutting-edge built-in security architecture and defense in depth system.

With the rapid development of meituan and the continuous improvement of business complexity, the security sector is facing more opportunities and challenges. We hope to implement more security projects that represent the best practices in the industry, and provide a broad development platform for more security practitioners, as well as more opportunities for continuous exploration in emerging security fields.

Recruitment Information

Meituan security department is recruiting various small partners, such as Web & binary attack and defense, background & system development, machine learning & algorithm, etc. If you want to join us, please send your resume to [email protected]

For specific position information, please refer to

Mtsrc homepage: