virtualization technology

Posted by tzul at 2020-03-07

Press: on November 21, 2014, the author shared some opinions on virtualization technology at Alibaba technology club's virtualization technology exchange conference and "cloud 3.0" conference, and discussed with you after sorting out and supplementing. (long text should be careful)

Everyone is familiar with virtualization technology. Most of us have used virtual machine software such as VMware and VirtualBox. Some people think that virtualization technology has only become popular with the trend of cloud computing in recent years. A decade ago, it was just a toy for desktop users to test other operating systems. No, not too. As long as there are multiple tasks running on the computer at the same time, there will be a need for task isolation. Virtualization is a technology that makes each task seem to monopolize the whole computer and isolate the impact between tasks. As early as the 1960s, when computers were still giants, virtualization technology began to develop.

IBM 7044

Black History: hardware virtualization

The 1964 IBM M44 / 44x is considered to be the first virtualization enabled system in the world. It uses special hardware and software, and can virtual multiple then popular IBM 7044 mainframes on a single physical machine. The virtualization method it uses is very primitive: like a time-sharing system, an IBM 7044 mainframe has exclusive hardware resources to run at each time slice.

It is worth mentioning that the prototype system used in this research not only opened the era of virtualization technology, but also proposed an important concept of "paging" (because each virtual machine needs to use virtual address, which requires a layer of mapping from virtual address to physical address).

In the era when the concept of "process" was not invented, multitasking operating system and virtualization technology were actually difficult to separate, because "virtual machine" was a task, and there was no such dominant architecture as Intel x86 at that time, each mainframe was its own politics, and it was not compatible with other architectures. This kind of "task level" or "process level" virtualization, from the concept to today, is represented by LxC and OpenVZ operating system level virtualization.

This technology mainly relies on customized hardware to realize virtualization, known as "hardware virtualization". In this era of "software definition", mainframe has been declining, and most of the virtualization relying on hardware has also entered the history museum. Most of the virtualization technologies we see today are mainly software, supplemented by hardware. The demarcation shown in the figure below does not have strict boundaries, and a virtualization solution may use multiple technologies in the figure below at the same time, so don't care about these details~

From simulation execution to binary translation

As mentioned earlier, major manufacturers have their own architectures and instruction sets in the era of mainframe. Why hasn't there been any translation software between instruction sets? It's important to note that Cisco started out by being compatible with various network devices and Protocols (there's also gossip here. The couple who founded Cisco wanted to use computer networks to deliver love letters, but there are various network devices, so they invented routers that are compatible with multiple protocols).

Cisco's first router

Translation between network protocols and between instruction sets is a mechanical, lengthy and tedious thing. It needs great talent and care to be able to do it right and take into account all kinds of corner situations. The trouble of instruction set is that the boundary of instruction (where is the first instruction and where is the last instruction) is unknown, privileged operation (such as restart) may be executed, and dynamically generated data can be executed as code (in von Neumann architecture, data and code share linear memory space). Therefore, it is impossible to do well in static translation between instruction sets before binary code runs.

The simplest and rudimentary solution is "simulated execution.". Open a large array as the memory of "virtual machine", take the instruction set manual, write a switch statement of countless cases, judge what instruction is to be executed currently, and simulate the execution according to the manual. This is naturally feasible, but the efficiency is not flattering. Virtual machines are at least one order of magnitude slower than physical ones. The so-called "dynamic type languages", such as python, PHP and ruby, are mostly "simulated execution" methods after they are compiled into intermediate code, so they are not fast. Bochs, the famous x86 simulator, simulates execution. Although it is slow, it has good compatibility and is not easy to have security vulnerabilities.

You can't stop eating because it's hard to translate instructions. A for loop that runs 100 million times, if it's all operations like adding and subtracting, will be translated into machine code and executed directly, which is surely faster than simulation execution. Before binary translation, the word "dynamic" should be added, that is, during the execution of the program, the part that can be translated is translated into the machine code of the target architecture and directly executed (and cached for reuse), and the part that can't be translated is trapped in the simulator, which is intended to execute privileged instructions, and then translate the later code. Nowadays, JIT (just in time) technology used by Java,. Net, JavaScript, etc. is a similar pattern.

Some people will ask, if I just virtual the same architecture, such as a 64 bit x86 system on a 64 bit x86 system, do I need to do instruction translation? Needed. For example, the virtual machine may read the privilege registers GDT, LDT, IDT and tr without triggering the processor exception (that is, the virtual machine manager cannot catch such behavior), but the values of these privilege registers need to be "forged" for the virtual machine. This requires replacing the instruction that reads the privilege register with the instruction that calls the virtual machine manager before it executes.

Unfortunately, in the age of mainframe, Fabrice bellard, a genius who can do well in dynamic binary translation, was just born (1972).

Fabrice Bellard

Fabric bellard's QEMU (quick emulator) is the most popular virtualization software using dynamic binary translation technology. It can simulate x86, x86 μ 64, arm, MIPs, SPARC, PowerPC and other processor architectures, and run the operating system on these architectures without modification. When we enjoy the fun of audio-visual, when we smoothly run virtual machine systems of various architectures, we should not forget the creator of ffmpeg and QEMU, Fabrice bellard.

Take our most familiar Intel x86 architecture as an example, it is divided into four privilege levels 0-3. In general, the operating system kernel (privileged code) runs at ring 0 (highest privilege level), while the user process (non privileged code) runs at ring 3 (lowest privilege level). (even if you are root, your process is in ring 3! It's only ring kernel that enters the kernel. The supreme power means great responsibility, and the kernel programming is limited by many 0 make complaints about it.

After using the virtual machine, the virtual machine (guest OS) runs in ring 1, and the host operating system (VMM, virtual machine monitor) runs in ring 0. For example, install a Linux virtual machine on windows, the windows kernel runs in ring 0, the virtual Linux kernel runs in ring 1, and the applications in Linux system run in ring 3.

When the virtual machine system needs to execute privileged instructions, the virtual machine manager (VMM) will immediately capture it (who makes ring 0 higher than the privilege level of ring 1!) And simulate the execution of this privileged instruction, and then return to the virtual machine system. In order to improve the performance of system call and interrupt processing, we sometimes use the technology of dynamic binary translation to replace these privileged instructions with instructions calling the Virtual Machine Manager API before running. If all the privileged instructions are simulated seamlessly, the virtual machine system is just like running on a physical machine, and can't find itself running in the virtual machine at all. (of course, there are still some flaws)

Ask the virtual machine system and CPU for help

Although the dynamic binary translation is much faster than the simulation execution, there is still a big gap from the performance of the physical machine because every privileged instruction has to go around the virtual machine manager (simulation execution). There are two ways to make virtual machines fast:


Since the difficulty and performance bottleneck of dynamic binary translation is to simulate the execution of those miscellaneous privileged instructions, can we modify the kernel of virtual machine system to make those privileged instructions better? After all, in most cases, we don't need to "hide" the existence of the virtualization layer from the virtual machine, but to provide the necessary isolation between the virtual machines without causing too much performance overhead.

Paravirtualization is prefixed with para -, meaning "with" and "along side". That is to say, the relationship between virtual machine system and virtualization layer (host system) is no longer strict, but the relationship of mutual trust and cooperation. The virtualization layer should trust virtual machine system to a certain extent. In x86 architecture, both the virtualization layer and the guest OS of the virtual machine system are running in ring 0.

The kernel of the virtual machine system needs to be specially modified to change the privileged instruction to the call of the virtualization layer API. In modern operating systems, because these architecture related privileged operations are encapsulated (such as arch / directory in Linux kernel source code), compared with binary translation, the modification of virtual machine kernel source code is simpler.

Compared with full virtualization using binary translation, semi virtualization sacrifices generality in exchange for performance, because any operating system can run on the full virtualization platform without modification, and each semi virtualized operating system kernel needs human flesh modification.

Hardware assisted virtualization

For the same function, the implementation of special hardware is almost always faster than that of software, which is almost a golden rule. In the case of virtualization, bill can't make sure. Naturally, he needs Andy to help. (bill is the founder of Microsoft, Andy is the founder of Intel)

The concept of hardware helping software virtualize is not new either. As early as 1974, the famous paper formal requirements for virtualized third generation architectures proposed three basic conditions for virtualized architecture:

Virtual machine manager can fully control all system resources.

Based on the original four privilege levels, Intel has added a "root mode" dedicated to the virtual machine manager (VMM). It's like the fearless monkey king can't escape from the palm of the root mode. In this way, although the virtual machine system is running in ring 0, the privileged instructions executed by it will still be automatically caught by the CPU (triggering exception), and will fall into the virtual machine manager in root mode. After processing, it will use the vmlaunch or vmresume instructions to return to the virtual machine system.

In order to facilitate the replacement of CPU exception capture with API call in paravirtualization, Intel also provides "system call" from non root mode to root mode: vmcall instruction.

It looks great, doesn't it? In fact, for CPUs that just started to support hardware assisted virtualization, using this method is not necessarily better than dynamic binary translation. Because the hardware assisted virtualization mode needs to switch to root mode for each privileged instruction, and return after processing. This mode switching is like the real mode and protection mode switching. It needs to initialize many registers, save and restore the scene, not to mention the impact on TLB and cache. However, the engineers of Intel and AMD are not free. In the newer CPU, the performance overhead of hardware assisted virtualization is better than that of dynamic binary translation.

Finally, practical: in most BIOS, hardware assisted virtualization has switch options. Only with options such as Intel virtualization technology enabled can virtual machine managers use hardware acceleration. So when using virtual machine, don't forget to check BIOS.

CPU virtualization is not everything

So much has been said before, which gives us a false impression: as long as the CPU instructions are virtual, everything will be fine. CPU is the brain of computer, but other components of computer can not be ignored. These components include memory, hard disk, video card, network card and other peripherals that coexist with CPU day and night.

Memory virtualization

When I talked about IBM M44 / 44x, the ancestor of virtualization, I mentioned that it proposed the concept of "paging". In other words, each task (virtual machine) seems to have exclusive memory space, and paging mechanism is responsible for mapping memory addresses of different tasks to physical memory. If the physical memory is not enough, the operating system will swap the memory of the infrequently used task to external storage such as disk, and load it back when the infrequently used task needs to be executed (of course, this mechanism was invented later). In this way, developers do not need to consider how large the physical memory space is or whether the memory addresses of different tasks will conflict.

Now all the computers we use have paging mechanism. What the application program (user state process) sees is a vast virtual memory. It seems that the whole machine is monopolized by itself. The operating system is responsible for setting the mapping relationship between virtual memory and physical memory of user state process (as shown in the Vm1 box below); MMU (memory management) in CPU Unit) is responsible for translating the virtual address in the instruction into the physical address by querying the mapping relationship (so-called page table) when the user state program is running.

With virtual machines, things get a lot of trouble. To isolate virtual machines, the operating system of virtual machines cannot directly see the physical memory. The red part in the figure above, namely "machine memory" (MA), is managed by the virtualization layer, while the "physical memory" (PA) seen by the virtual machine is actually virtualized, which forms a two-level address mapping relationship.

Before the Nehalem architecture of Intel (thanks for Jonathan's correction), the memory manager (MMU) only knew how to address the memory according to the classic segmentation and paging mechanism, and did not know the existence of the virtualization layer. The virtualization layer needs to be responsible for "compressing" the two-level address mapping into the first level mapping, as shown by the red arrow in the figure above. The approach of virtualization layer is as follows:

Some people may worry that the increased level-1 mapping will slow down memory access. In fact, whether or not level-2 memory translation (slat) is enabled, the translation lookaside buffer (TLB) will store the mapping from virtual address (VA) to machine address (MA). If the hit rate of TLB is high, the memory access performance will not be significantly affected by the increased memory translation.

Device virtualization

Components other than CPU and memory are collectively referred to as peripherals. The communication between CPU and peripheral is I / O. Each virtual machine needs to have hard disk, network, even video card, mouse, keyboard, optical drive, etc. If you have used virtual machines, you should be familiar with these configurations.

Virtual device configuration in Hyper-V

There are generally three ways of device virtualization:

Let's take the network card as an example to see how the above three device virtualization methods work:

I / O MMU is a hardware component that enables peripheral devices and virtual machines to communicate directly. It bypasses the virtual machine manager (VMM) and improves I / O performance. Intel's name is vt-d, and AMD's name is amd VI. It can translate the physical address (PA) in the device register and DMA request into machine address (MA), and distribute the interrupt generated by the device to the virtual machine.

Time to use again: just like hardware assisted virtualization, I / O MMU has switch in BIOS, don't forget to turn it on~

There are also "full virtualization" and "semi virtualization" in device virtualization. If the driver in the virtual machine system can be modified, the API of the virtual machine manager (VMM) can be called directly to reduce the cost of simulating hardware devices. This behavior of modifying the driver in the virtual machine to improve performance belongs to semi virtualization.

By installing additional drivers in the virtual machine system, the interaction between some hosts and the virtual machine can also be realized (such as sharing the clipboard, dragging and transferring files). These drivers load hooks in the virtual machine system and call the API provided by the virtualization manager to complete the interaction with the host system.

Install additional drivers in the virtual machine

To make a summary, you can have a cup of tea and have a aftertaste:

Operating system level virtualization

Most of the time, we do not want to run any operating system in the virtual machine, but want to achieve a certain degree of isolation between different tasks. In the virtualization technology mentioned above, each virtual machine is an independent operating system, with its own task scheduling, memory management, file system, device driver, etc. it also runs a certain number of system services (such as refreshing the disk buffer, loggers, timed tasks, SSH Server, time synchronization service), these things will consume system resources (mainly memory), and the two-tier task scheduling and device driver of virtual machine and virtual machine manager will also increase the time overhead. Can the virtual machine share the operating system kernel and maintain a certain degree of isolation?

Why do two rivers enter a canal. In order to facilitate development and testing, the seventh version of UNIX in 1979 introduced the chroot mechanism. Chroot is to let a process take the specified directory as the root directory, and all its file system operations can only be carried out in the specified directory, so as not to harm the host system. Although chroot has a classic jump out vulnerability, and it does not isolate the process, network and other resources, chroot is still used as a clean environment for building and testing software.

To be a real virtualization solution, file system isolation is not enough. Two other important aspects are:

It is called "operating system level virtualization" or "task level virtualization". Since Linux containers (LxC) has been included into the kernel mainline since Linux version 3.8, operating system level virtualization is also called "container". In order to distinguish from the virtualization scheme that virtual machine is a complete operating system, the processes (process groups) that are executed in isolation are often not called "virtual machines", but "containers". Because there is no extra layer of operating system kernel, the container is lighter and faster than the virtual machine, and the memory and scheduling costs are also smaller. More importantly, access to disk and other I/O devices do not need to go through the virtualization layer, and there is no performance loss.

The operating system level virtualization on Linux does not start from LxC, on the contrary, LxC is a typical example of "the Yangtze River waves behind push the waves ahead". Linux vserver, OpenVZ and parallel containers are all solutions to realize operating system level virtualization in the Linux kernel. Although the younger generation has won more attention, as an elder, OpenVZ has more "enterprise level" functions than LxC:

OpenVZ architecture

Docker, good housekeeper of container

A good horse with a good saddle, such a good Linux container, naturally also needs a good housekeeper. Where do I get the system image? How to version control the image? Docker is a good housekeeper of Linux containers. It's said that even the huge hardware shows its favor to docker, and it's necessary to support Windows development containers.

Docker is actually generated for system operation and maintenance, which greatly reduces the cost of software installation and deployment. The reason why software installation is a hassle is that

There is a conflict between the software. For example, program a relies on glibc 2.13, while program B relies on glibc 2.14; script a requires Python 3, and script B requires Python 2; both Apache and nginx web servers want to listen to port 80. When conflicting software is installed in the same system, it is always easy to bring some confusion, such as DLL hell in the early days of windows. The solution to software conflicts is isolation, allowing multiple versions to coexist in the system, and providing methods to find matching versions. Let's see how docker solves these two problems:

Package all the dependencies and running environment of the software in one image instead of using complex scripts to "Install" the software in the unknown environment;

The relationship between docker, LxC and kernel components

Docker's virtualization is based on libcontainer (the figure above is older, and it was still based on LxC at that time). In fact, libcontainer and LxC are based on cgroups resource audit, chroot file system isolation, namespace isolation and other mechanisms provided by Linux kernel.

How does freeshell start

We know that freeshell uses OpenVZ virtualization technology, and people often ask if they can change the kernel, or where the freeshell system is started. In fact, like the normal Linux system, freeshell starts from / SBIN / init in the virtual machine (see if it's process 1?) , followed by the startup process of the virtual machine system itself.

One corner of freeshell control panel

In fact, different types of virtualization technologies start to boot virtual machine systems from different places:

Openstack, a cloud operating system

All of the above are specific technologies for virtualization. But for end users to use virtual machines, there is also a platform to manage virtual machines, or "cloud operating system". Openstack is currently the hottest cloud operating system.

Openstack cloud operating system

Openstack (or any reliable cloud operating system) virtualizes various resources on the cloud. Its management component is called nova.

Openstack control panel (horizon)

The following two figures show Nova's architecture using docker and Xen as virtualization solutions, respectively.

Docker is called by openstack Nova computing component

Using Xen as openstack's computing virtualization solution


The need for task isolation has given rise to virtualization. We all want to isolate tasks completely without losing too much performance. From the highest isolation but slow to impractical simulation execution, to the dynamic binary translation and hardware assisted virtualization adopted by modern full virtualization technology, to the semi virtualization of modified virtual machine system, to the virtualization of shared kernel and container based operating system level, performance is always the first driving force of virtualization technology wave. However, when we choose virtualization technology, we still need to find a balance between isolation and performance according to the actual needs.

The old computing virtualization technology, together with the relatively new storage and network virtualization technology, constitutes the cornerstone of the cloud operating system. When we enjoy the seemingly inexhaustible computing resources in the cloud, if we can peel off the layers of packages and understand the essence of computing, maybe we will treasure the computing resources in the cloud, praise the grandeur and delicacy of the virtualization technology building, and worship the computer masters who lead people into the cloud era.