fully understand the security design of google infrastructure

Posted by tzul at 2020-02-25

Google's technical infrastructure jointly constructs common user systems such as search, email, photos, and enterprise systems such as G suite and Google cloud storage platform. It is the key to Google's data center and the security foundation for the entire Google network service. Based on the original text, freebuf briefly analyzes and introduces the security design of Google's technology infrastructure, which provides a series of security protection for Google's global information system, including operation security service, end-user data security storage, service security communication, user security communication and operation and maintenance security management. In the introduction, we will focus on the physical security, overall software and hardware basic security, technical limitations and operation and maintenance security of Google data center layer by layer.

Safety design of infrastructure at the bottom

Physical infrastructure security

Google's data center includes multiple layers of physical security protection, such as biometrics, metal sensing detection, monitoring, barriers to passage and laser intrusion sensing system, with strict access restrictions. Because some of Google's services are hosted in third-party data centers, such high security measures must be deployed to ensure absolute security control.

Hardware design deployment

Google data center network consists of thousands of servers. The main board and network equipment of these servers are customized by Google. Relevant equipment components and suppliers must undergo strict security testing and background review. At the same time, Google has also designed its own security chips, which are widely used in servers and related peripheral environments, providing an effective security means for hardware level security identification and authentication.

Security boot and server identification mechanism

In order to ensure the correct service startup, Google server uses a series of boot technology, including the use of encrypted signatures in BIOS, bootloader, kernel, system image and other underlying components. These signatures can be verified safely every time they are started and updated. The components involved in the whole process are built, controlled and reinforced by Google. With the upgrading of hardware, Google is also making continuous efforts to improve its security, such as designing its own lockable firmware chip, security microcontroller and security chip, and according to the design version of different servers, a trusted security startup mechanism is built into various chips. In the process of software boot and hardware boot, each server in the data center has its own unique identity, which is also used for the API call verification of the underlying management of the machine. In addition, Google has also developed an automatic update detection system to ensure that all kinds of software and hardware are updated, identified and diagnosed in a timely manner, and can automatically isolate those failed servers if necessary.

Security service deployment

In this section, we will introduce some basic software and hardware service security, and thousands of servers will serve and backup these service application requests, including Gmail's SMTP service, distributed data storage service, youtube video transcoding service, client app sandbox operation service and other regular user requests. All service requests running in the infrastructure are controlled by a cluster business management service called Borg.

Service identification, integrity and isolation

In the application layer communication of internal services, Google uses the encryption authentication authorization method to provide high-intensity access control for management and services. Although Google does not completely rely on the internal network partition and firewall as the main security mechanism, in order to prevent further attacks such as IP spoofing, Google uses filtering strategies at various points in the network entrance and exit, which also maximizes the network performance and availability.

Each running service has its own related service account ID. when creating a service or receiving an RPC request, it can provide corresponding encryption and authentication credentials. These identities are also used for inter service communication, as well as data and method access restrictions for specific clients.

Google's source code is stored in a central repository, where it can be audited against current and past code. Google infrastructure can configure corresponding security audit, verification and source code detection procedures for specific services. In terms of code review, at least one engineer other than the author is required to review and approve the code. In addition, if there is any problem in the review, it needs to be approved by the system administrator to change and execute the code. These mandatory requirements limit malicious changes to the source code made by insiders or attackers, and provide a traceable forensics process for the service code.

In addition, to protect other services running on the same server, Google has set up a series of sandbox and isolation technologies. These technologies include general user isolation, language and kernel based sandbox, and hardware virtualization. In general, in order to deal with the risky request services, such as the complex file format conversion of user requests, or the user execution code faced by products such as app end and Google computing engine, Google uses multi-layer isolation technology. In addition, in order to ensure the security of the network boundary, Google also uses some special sensitive services, such as cluster business services and key management services, which run on special servers.

Inter service access management

All running services can use the access management function provided by Google infrastructure to specify which services can communicate with it. For example, a service can be set to only provide API request calls for some specific white list services, which can be configured to only allow white name single account identity. After that, this access restriction mechanism will be automatically implemented by Google infrastructure.

Just as Google engineers require authentication for access to services, these services can be set to allow or deny access. All identity type information (machines, services, or employees) is stored in the global domain of Google infrastructure.

In internal identity authentication, Google provides a variety of identity management systems, including approval process, records and notifications. For example, these identifiers can be assigned to an access control group through a system to facilitate the change and approval of services by different engineers. The identity management system can manage the security access of thousands of running services. In addition to the automatic API level access control mechanism, Google has also set up access control lists (ACLS) and database reading services to facilitate the implementation of customized and fine-grained access control operations when necessary.

Design of communication encryption between services

In addition to the aforementioned PRC authentication and authorization functions, Google also provides the encryption, confidentiality and integrity functions of PRC data in the network. In order to secure other application layer protocols such as HTTP, these encryption functions are encapsulated in the internal PRC mechanism by Google. In essence, it provides application layer isolation and eliminates the security dependency of any network path. Even if the network is eavesdropped or the equipment is intruded, the encrypted service communication can ensure the information security and reliability.

For each PRC call, different encryption protection levels can be set for the service. For example, only integrity protection levels can be set for low value data exchange in the data center. For complex network attacks and internal network eavesdropping, all RPC traffic encryption functions will be automatically turned on without additional operation configuration. Meanwhile, Google has deployed a hardware encryption accelerator to encrypt all PRC traffic in the data center.

End user data access management

Typical Google services have brought many convenience to end users, such as Gmail, which will interact with Google infrastructure in the process of similar applications, such as calling mail list service API in Gmail service to access terminal user address.

Combined with the previous chapters, the address book service can be set to allow only specific RPC requests in Gmail service. However, this is still a very broad set of authority controls. However, within the scope of permission, Gmail service will respond to any user's request at any time. As the Gmail service will perform RPC requests for the address book on behalf of the end user, at this time, as part of the RPC requests, Google infrastructure will provide a "end user license credential" for the Gmail service, which is the identity certificate of a specific end user, which also provides a security guarantee for the data response of a specific end user's address book service.

In order to issue "permission license credentials" to end users, Google runs a central user identity service system. After the end user logs in, he / she will be authenticated through the identity service in various ways, such as user password, cookie information, OAuth token, etc. after that, any subsequent request from the client to Google will also require authentication of identity information.

When the service receives the end user password information, it will pass it to the central identity service system for authentication. If the authentication is correct, the identity service system will return a short-term valid "permission credentials" for the user's RPC related requests. Combined with the previous example, if the Gmail service obtains the permission permission credentials, it will pass the credentials to the address book service for verification. Then, as part of the RPC call, the credentials will apply to any client request.

Secure data storage

Static encryption

There are many storage services running in Google infrastructure, such as BigTable, spanner and centralized key management system. Most applications will have direct access to physical storage devices through these storage services. Before the storage data is written to the physical storage device, it can be configured to use the key distributed by the centralized key management system for encryption. The centralized key management system supports automatic key rotation, and provides comprehensive log audit, specific user identity integrity verification and other functions.

Spanner is a scalable, multi version, global distributed, synchronous replication database developed by Google. It is the first system to distribute data around the world and support external consistent distributed transactions.

Performing encryption at the application layer allows the infrastructure to isolate underlying storage threats such as malicious disk firmware, which is an additional implementation of another encryption protection layer. Every mechanical hard disk and solid-state hard disk of Google supports hardware encryption and status tracking. If an encrypted storage device is replaced or discarded, it must undergo multi-step disk cleaning and two independent verifications, and the device without this cleaning process will also be physically damaged.

Data deletion

Google's data deletion is not a complete data removal, but a "planned deletion" for some specific data. The advantage of this is that it can recover the data deleted unintentionally by the client or operation and maintenance operation. If the data is marked as scheduled delete, it will be deleted according to the specific service policy. When an end user performs an account deletion, Google's infrastructure will notify the relevant data clearing service to clear the data of the deleted account. After deleting Google account and Google email, Google system will delete all data related to the account and can no longer use the services provided by Google with the account.

Security design of network communication

In this section, Google security communications and related service design will be described. As mentioned earlier, Google infrastructure consists of a large number of physical devices that together make up different LANs and WANs. To prevent attacks such as DOS, Google infrastructure uses a private IP space.

Google front end services

Services within Google's infrastructure need to be registered with Google's front-end service (GFE) before they can run on the Internet. GFE ensures that all TLS connections must use the correct certificates and security policies, and can also play a defense role against DoS attacks. GFE uses the aforementioned RPC security protocol for request forwarding. In fact, any internal service running on the Internet through GFE registration is an agile reverse front-end proxy service. The front-end can not only provide DNS public IP services, but also play a role of DOS defense and TLS protection. GFE, like other services running on Google infrastructure, can handle a large number of requests.

DOS attack defense

In terms of scale and volume, Google infrastructure can resolve or withstand a large number of DoS attacks. Google has multiple cascading DOS protection means to prevent and mitigate any DoS attacks on GFE registration services. The connection request from the external backbone network to Google data center will be transmitted through load balancing of multi-layer software and hardware. These load balancing conductors will feed back the status information of incoming traffic to the central DOS monitoring system in Google infrastructure in real time. When DOS monitoring system detects DoS attacks, it will let the load balancing conductors discard or throttle the suspicious attack traffic in the first time.

At the next level, the GFE instance will also feed back the received request information to the Google central DOS monitoring system in real time, which includes the application layer information that the network layer load balancing transmitter does not have. If a suspected attack is detected, the central DOS monitoring system will also let the GFE instance discard or throttle the traffic of the suspected attack.

User authentication

After DOS defense, the next step is Google's central identity service system, which starts from the login page of the end user. In addition to the user name and password required, the system will also conduct intelligent verification of the latest login location and login device. After the authentication step is completed, the identity service system will distribute a credential such as a cookie or OAuth token to the user for subsequent request calls.

Of course, when logging in, users can also use two factor authentication measures such as OTP dynamic password, anti phishing security key and so on. In addition, Google and Fido have jointly agreed on the dual factor authentication standard for u2f users, and made the yubikey external key login USB adapter, which users can purchase to achieve more secure login.

Operation and maintenance safety

Security software development

In addition to the aforementioned security control system, Google also provides a development library to prevent developers from introducing some security defects. For example, in terms of web app, Google has set up a development library and framework to exclude XSS vulnerabilities. In addition, Google also has a large number of automatic security defect detection tools, such as fuzzy, static code analysis tools, web security scanners, etc. Finally, Google will conduct a comprehensive manual security review of the development code from quick and simple defect identification to in-depth risk discovery. These manual reviews are carried out by a team of experts in web security, encryption, and operating system security.

In addition, Google has implemented a vulnerability reward program to find vulnerabilities in applications and infrastructure. So far, Google has paid millions of dollars for this program. In addition, Google has invested a lot of intelligence and energy in finding 0-day vulnerabilities in its infrastructure in use and upstream products. For example, the bleeding hole found by Google engineers is the best example, and the Google security team has been the most committers of CVE vulnerability library, and the discoverer and repairer of Linux KVM hypervisor virtualization technology vulnerability.

Employee equipment and credential security

In order to protect employees' equipment and credentials from invasion, theft and other illegal internal activities, Google has invested a lot of money and costs in this area, which is also a key part of ensuring the safe operation of its infrastructure. For a long time, high-end complex phishing attacks against Google employees have been ongoing. In order to prevent such attacks, we forced employees to replace the OTP password authentication mode with the USB adapter security key of u2f.

In addition, Google has invested a lot of money to develop the security operation monitoring system of employee client, and has also configured the security scanning system for the installation program, download program, browser extension and access content of employee client.

Access to the company's internal LAN does not mean access to Google's access control rights. Google uses application level access control management, which allows only limited users from specific management devices, networks or buried locations to access internal control programs. (see beyondcorp for details)

Internal risk elimination and control

Google strictly limits the number of employees with administrative authority, and actively supervises its network behavior. In addition, for some special tasks, permission permission is not needed as much as possible, but is completed in an automatic and safe way to eliminate the overflow of permission permission requirements. This requires that some activities need to be approved by both parties, and restrictive APIs will be introduced to eliminate the risk of information disclosure. Google employees' access to end-user information will be recorded by the underlying architecture information hook. Google security team will conduct real-time monitoring of all access types and investigate the abnormal and suspicious events.

Intrusion detection

Google has a mature data processing pipeline, which can well integrate host based, network-based and service-based intrusion detection signals. The built-in security rules and detection mechanism in these pipelines will send event warnings to the operation and maintenance security personnel in a timely manner, and Google's event response team will also be on standby 24 hours a day. At the same time, Google's internal operation and maintenance team also regularly implements Red Square exercises to continuously measure and improve the effectiveness of the detection response mechanism.

Security design of Google cloud storage platform (GCP)

Here, we will take Google computing engine (GCE) service as an example to briefly describe the security design and improvement of Google cloud storage platform (GCP).

Google compute engine, GCE is an infrastructure service product of IAAs (infrastructure as a service) under Google. It allows you to use Google's servers to run Linux virtual machines and get more powerful data computing capabilities. Google said at the I / O conference that the compute engine service is more cost-effective than its competitors' products, with 50% more computing power per dollar purchased. Behind the Google compute engine service is a large number of Linux virtual machines. In addition, the processor used for the service is 771886 cores.

GCE services can enable customers to run Google Linux virtual machine on Google infrastructure to get powerful data computing capabilities. The implementation of GCE services consists of several logical components, especially the management control panel and the virtual machine itself. Among them, the management control panel is responsible for the connection with the external API, and the task arrangement of virtual machine creation and migration. Because it involves running a variety of services, the management control panel has built-in security startup mechanism.

Because the GCE control plane shows the API interface through GFE, it has the same DOS protection and SSL / TLS connection protection functions as the GFE instance. At the same time, customers can choose to use the Google cloud service load balancer built into GFE when running virtual machines, which can alleviate many types of DoS attacks. The user authenticated GCE control panel API provides security through Google's centralized identity authentication service, such as hijacking detection. Authorization is done using the central cloud Iam service.

Identity and access management (IAM): iam allows users to assign permissions to Google cloud resources according to the established Iam role classification rules, and allows other users to access all resources in a project as the owner / Editor / viewer.

The network traffic between the control panels, as well as the traffic from GFE to other services, are automatically authenticated and encrypted, and can be safely transferred from one data center to another. Each virtual machine (VM) runs simultaneously with the associated virtual machine manager (VMM) service instance. Google infrastructure provides two authentication identities for virtual machines, one for VMM service instance calling itself, and the other for VMM to represent the VM identity of customers, which also increases the call trust from VMM.

GCE's permanent disks are encrypted with static data, protected by the key distributed by Google's central key management system, and allow automatic key rotation and system audit. In addition, virtual machine isolation technology is an open-source KVM stack based on hardware virtualization. In order to maximize security protection, Google has also carried out a series of security tests on the core code of KVM, such as fuzzing, static analysis, manual verification, etc., so, as mentioned above, Google recently submitted and disclosed a number of KVM vulnerabilities.

Finally, Google's O & M security controls are a key part of ensuring that data access follows security policies. As a part of Google's cloud platform, GCE customers' data use behavior also follows GCP's use strategy. Google will not access or use customer data, except when necessary to provide services for customers.

*Reference source:, FB editor, compiled by clouds, reprinted from