Press: on November 21, 2014, the author shared some opinions on virtualization technology at Alibaba technology club's virtualization technology exchange conference and "cloud 3.0" conference, and discussed with you after sorting out and supplementing. (long text should be careful)
Everyone is familiar with virtualization technology. Most of us have used virtual machine software such as VMware and VirtualBox. Some people think that virtualization technology has only become popular with the trend of cloud computing in recent years. A decade ago, it was just a toy for desktop users to test other operating systems. No, not too. As long as there are multiple tasks running on the computer at the same time, there will be a need for task isolation. Virtualization is a technology that makes each task seem to monopolize the whole computer and isolate the impact between tasks. As early as the 1960s, when computers were still giants, virtualization technology began to develop.
IBM 7044
Black History: hardware virtualization
The 1964 IBM M44 / 44x is considered to be the first virtualization enabled system in the world. It uses special hardware and software, and can virtual multiple then popular IBM 7044 mainframes on a single physical machine. The virtualization method it uses is very primitive: like a time-sharing system, an IBM 7044 mainframe has exclusive hardware resources to run at each time slice.
It is worth mentioning that the prototype system used in this research not only opened the era of virtualization technology, but also proposed an important concept of "paging" (because each virtual machine needs to use virtual address, which requires a layer of mapping from virtual address to physical address).
In the era when the concept of "process" was not invented, multitasking operating system and virtualization technology were actually difficult to separate, because "virtual machine" was a task, and there was no such dominant architecture as Intel x86 at that time, each mainframe was its own politics, and it was not compatible with other architectures. This kind of "task level" or "process level" virtualization, from the concept to today, is represented by LxC and OpenVZ operating system level virtualization.
This technology mainly relies on customized hardware to realize virtualization, known as "hardware virtualization". In this era of "software definition", mainframe has been declining, and most of the virtualization relying on hardware has also entered the history museum. Most of the virtualization technologies we see today are mainly software, supplemented by hardware. The demarcation shown in the figure below does not have strict boundaries, and a virtualization solution may use multiple technologies in the figure below at the same time, so don't care about these details~
From simulation execution to binary translation
As mentioned earlier, major manufacturers have their own architectures and instruction sets in the era of mainframe. Why hasn't there been any translation software between instruction sets? It's important to note that Cisco started out by being compatible with various network devices and Protocols (there's also gossip here. The couple who founded Cisco wanted to use computer networks to deliver love letters, but there are various network devices, so they invented routers that are compatible with multiple protocols).
Cisco's first router
Translation between network protocols and between instruction sets is a mechanical, lengthy and tedious thing. It needs great talent and care to be able to do it right and take into account all kinds of corner situations. The trouble of instruction set is that the boundary of instruction (where is the first instruction and where is the last instruction) is unknown, privileged operation (such as restart) may be executed, and dynamically generated data can be executed as code (in von Neumann architecture, data and code share linear memory space). Therefore, it is impossible to do well in static translation between instruction sets before binary code runs.
The simplest and rudimentary solution is "simulated execution.". Open a large array as the memory of "virtual machine", take the instruction set manual, write a switch statement of countless cases, judge what instruction is to be executed currently, and simulate the execution according to the manual. This is naturally feasible, but the efficiency is not flattering. Virtual machines are at least one order of magnitude slower than physical ones. The so-called "dynamic type languages", such as python, PHP and ruby, are mostly "simulated execution" methods after they are compiled into intermediate code, so they are not fast. Bochs, the famous x86 simulator, simulates execution. Although it is slow, it has good compatibility and is not easy to have security vulnerabilities.
You can't stop eating because it's hard to translate instructions. A for loop that runs 100 million times, if it's all operations like adding and subtracting, will be translated into machine code and executed directly, which is surely faster than simulation execution. Before binary translation, the word "dynamic" should be added, that is, during the execution of the program, the part that can be translated is translated into the machine code of the target architecture and directly executed (and cached for reuse), and the part that can't be translated is trapped in the simulator, which is intended to execute privileged instructions, and then translate the later code. Nowadays, JIT (just in time) technology used by Java,. Net, JavaScript, etc. is a similar pattern.
Some people will ask, if I just virtual the same architecture, such as a 64 bit x86 system on a 64 bit x86 system, do I need to do instruction translation? Needed. For example, the virtual machine may read the privilege registers GDT, LDT, IDT and tr without triggering the processor exception (that is, the virtual machine manager cannot catch such behavior), but the values of these privilege registers need to be "forged" for the virtual machine. This requires replacing the instruction that reads the privilege register with the instruction that calls the virtual machine manager before it executes.
Unfortunately, in the age of mainframe, Fabrice bellard, a genius who can do well in dynamic binary translation, was just born (1972).
Fabrice Bellard
Fabric bellard's QEMU (quick emulator) is the most popular virtualization software using dynamic binary translation technology. It can simulate x86, x86 μ 64, arm, MIPs, SPARC, PowerPC and other processor architectures, and run the operating system on these architectures without modification. When we enjoy the fun of audio-visual, when we smoothly run virtual machine systems of various architectures, we should not forget the creator of ffmpeg and QEMU, Fabrice bellard.
Take our most familiar Intel x86 architecture as an example, it is divided into four privilege levels 0-3. In general, the operating system kernel (privileged code) runs at ring 0 (highest privilege level), while the user process (non privileged code) runs at ring 3 (lowest privilege level). (even if you are root, your process is in ring 3! It's only ring kernel that enters the kernel. The supreme power means great responsibility, and the kernel programming is limited by many 0 make complaints about it.
After using the virtual machine, the virtual machine (guest OS) runs in ring 1, and the host operating system (VMM, virtual machine monitor) runs in ring 0. For example, install a Linux virtual machine on windows, the windows kernel runs in ring 0, the virtual Linux kernel runs in ring 1, and the applications in Linux system run in ring 3.
When the virtual machine system needs to execute privileged instructions, the virtual machine manager (VMM) will immediately capture it (who makes ring 0 higher than the privilege level of ring 1!) And simulate the execution of this privileged instruction, and then return to the virtual machine system. In order to improve the performance of system call and interrupt processing, we sometimes use the technology of dynamic binary translation to replace these privileged instructions with instructions calling the Virtual Machine Manager API before running. If all the privileged instructions are simulated seamlessly, the virtual machine system is just like running on a physical machine, and can't find itself running in the virtual machine at all. (of course, there are still some flaws)
Ask the virtual machine system and CPU for help
Although the dynamic binary translation is much faster than the simulation execution, there is still a big gap from the performance of the physical machine because every privileged instruction has to go around the virtual machine manager (simulation execution). There are two ways to make virtual machines fast:
- Let the operating system of virtual machine help, so-called "paravirtualization" or OS assisted virtualization
- With the help of CPU, the two methods of hardware assisted virtualization are not mutually exclusive. Many modern virtualization solutions, such as Xen and VMware, use both methods at the same time.
Paravirtualization
Since the difficulty and performance bottleneck of dynamic binary translation is to simulate the execution of those miscellaneous privileged instructions, can we modify the kernel of virtual machine system to make those privileged instructions better? After all, in most cases, we don't need to "hide" the existence of the virtualization layer from the virtual machine, but to provide the necessary isolation between the virtual machines without causing too much performance overhead.
Paravirtualization is prefixed with para -, meaning "with" and "along side". That is to say, the relationship between virtual machine system and virtualization layer (host system) is no longer strict, but the relationship of mutual trust and cooperation. The virtualization layer should trust virtual machine system to a certain extent. In x86 architecture, both the virtualization layer and the guest OS of the virtual machine system are running in ring 0.
The kernel of the virtual machine system needs to be specially modified to change the privileged instruction to the call of the virtualization layer API. In modern operating systems, because these architecture related privileged operations are encapsulated (such as arch / directory in Linux kernel source code), compared with binary translation, the modification of virtual machine kernel source code is simpler.
Compared with full virtualization using binary translation, semi virtualization sacrifices generality in exchange for performance, because any operating system can run on the full virtualization platform without modification, and each semi virtualized operating system kernel needs human flesh modification.
Hardware assisted virtualization
For the same function, the implementation of special hardware is almost always faster than that of software, which is almost a golden rule. In the case of virtualization, bill can't make sure. Naturally, he needs Andy to help. (bill is the founder of Microsoft, Andy is the founder of Intel)
The concept of hardware helping software virtualize is not new either. As early as 1974, the famous paper formal requirements for virtualized third generation architectures proposed three basic conditions for virtualized architecture:
- As like as two peas, the virtual machine manager provides the virtual environment that is exactly the same as the real machine.
- In the worst case, the program running in virtual machine is not much slower than physical machine;
- Virtual machine manager can fully control all system resources. The program in the virtual machine can not access the resources that are not allocated to it. In some cases, the virtual machine manager can reclaim the resources that have been allocated to the virtual machine. The early x86 instruction set does not meet the above conditions, so the virtual machine manager is required to do complex dynamic binary translation. Since the main overhead of binary translation is to "capture" the privileged instructions of the virtual machine system, what the CPU can do is to help do the "capture" thing. Intel's home solution is called virtual machine control structures (VT-x), which was launched in the winter of 2005; amd was unwilling to help others, and then launched a similar virtual machine control blocks (AMD-V).
Virtual machine manager can fully control all system resources.
- A program in a virtual machine cannot access resources that are not assigned to it
- In some cases, the early x86 instruction set that the virtual machine manager can reclaim the resources that have been allocated to the virtual machine does not meet the above conditions, so the virtual machine manager is required to do complex dynamic binary translation. Since the main overhead of binary translation is to "capture" the privileged instructions of the virtual machine system, what the CPU can do is to help do the "capture" thing. Intel's home solution is called virtual machine control structures (VT-x), which was launched in the winter of 2005; amd was unwilling to help others, and then launched a similar virtual machine control blocks (AMD-V).
Based on the original four privilege levels, Intel has added a "root mode" dedicated to the virtual machine manager (VMM). It's like the fearless monkey king can't escape from the palm of the root mode. In this way, although the virtual machine system is running in ring 0, the privileged instructions executed by it will still be automatically caught by the CPU (triggering exception), and will fall into the virtual machine manager in root mode. After processing, it will use the vmlaunch or vmresume instructions to return to the virtual machine system.
In order to facilitate the replacement of CPU exception capture with API call in paravirtualization, Intel also provides "system call" from non root mode to root mode: vmcall instruction.
It looks great, doesn't it? In fact, for CPUs that just started to support hardware assisted virtualization, using this method is not necessarily better than dynamic binary translation. Because the hardware assisted virtualization mode needs to switch to root mode for each privileged instruction, and return after processing. This mode switching is like the real mode and protection mode switching. It needs to initialize many registers, save and restore the scene, not to mention the impact on TLB and cache. However, the engineers of Intel and AMD are not free. In the newer CPU, the performance overhead of hardware assisted virtualization is better than that of dynamic binary translation.
Finally, practical: in most BIOS, hardware assisted virtualization has switch options. Only with options such as Intel virtualization technology enabled can virtual machine managers use hardware acceleration. So when using virtual machine, don't forget to check BIOS.
CPU virtualization is not everything
So much has been said before, which gives us a false impression: as long as the CPU instructions are virtual, everything will be fine. CPU is the brain of computer, but other components of computer can not be ignored. These components include memory, hard disk, video card, network card and other peripherals that coexist with CPU day and night.
Memory virtualization
When I talked about IBM M44 / 44x, the ancestor of virtualization, I mentioned that it proposed the concept of "paging". In other words, each task (virtual machine) seems to have exclusive memory space, and paging mechanism is responsible for mapping memory addresses of different tasks to physical memory. If the physical memory is not enough, the operating system will swap the memory of the infrequently used task to external storage such as disk, and load it back when the infrequently used task needs to be executed (of course, this mechanism was invented later). In this way, developers do not need to consider how large the physical memory space is or whether the memory addresses of different tasks will conflict.
Now all the computers we use have paging mechanism. What the application program (user state process) sees is a vast virtual memory. It seems that the whole machine is monopolized by itself. The operating system is responsible for setting the mapping relationship between virtual memory and physical memory of user state process (as shown in the Vm1 box below); MMU (memory management) in CPU Unit) is responsible for translating the virtual address in the instruction into the physical address by querying the mapping relationship (so-called page table) when the user state program is running.
With virtual machines, things get a lot of trouble. To isolate virtual machines, the operating system of virtual machines cannot directly see the physical memory. The red part in the figure above, namely "machine memory" (MA), is managed by the virtualization layer, while the "physical memory" (PA) seen by the virtual machine is actually virtualized, which forms a two-level address mapping relationship.
Before the Nehalem architecture of Intel (thanks for Jonathan's correction), the memory manager (MMU) only knew how to address the memory according to the classic segmentation and paging mechanism, and did not know the existence of the virtualization layer. The virtualization layer needs to be responsible for "compressing" the two-level address mapping into the first level mapping, as shown by the red arrow in the figure above. The approach of virtualization layer is as follows:
- When switching to a virtual machine, the memory mapping relationship (page table) of the virtual machine is used as the page table of the physical machine;
- If the operating system of the virtual machine attempts to modify the CR3 register pointing to the page table (such as switching the page table between processes), it will be replaced with the access to the memory address of the "shadow page table" (through dynamic binary translation, or in the hardware assisted virtualization scheme, when an exception is triggered, it will fall into the virtualization layer);
- If the virtual machine operating system attempts to modify the contents of the page table, such as mapping a new physical memory (PA) to the virtual memory (VA) of a process, the virtualization layer needs to intercept it, allocate space in the machine memory (MA) for the physical address (PA), and map the virtual memory (VA) in the "shadow sub page table" to the machine memory (MA);
- If a virtual machine accesses a page and it is swapped out to external storage, it will trigger the page missing exception. The virtualization layer is responsible for distributing the page missing exception to the operating system of the virtual machine. It's not a small overhead to wrap every operation on a page table around the virtualization layer. As a part of hardware assisted virtualization, Intel began to introduce EPT (extend page table) technology from Nehalem architecture, and AMD also introduced NPT (nest page table) to enable memory management unit (MMU) to support second level address Translation (slat): provide another set of page tables for the virtualization layer. First, look up the original page table according to the virtual address (VA) to get the physical address (PA), and then look up the new page table to get the machine address (MA). This eliminates the need for shadow page tables.
Some people may worry that the increased level-1 mapping will slow down memory access. In fact, whether or not level-2 memory translation (slat) is enabled, the translation lookaside buffer (TLB) will store the mapping from virtual address (VA) to machine address (MA). If the hit rate of TLB is high, the memory access performance will not be significantly affected by the increased memory translation.
Device virtualization
Components other than CPU and memory are collectively referred to as peripherals. The communication between CPU and peripheral is I / O. Each virtual machine needs to have hard disk, network, even video card, mouse, keyboard, optical drive, etc. If you have used virtual machines, you should be familiar with these configurations.
Virtual device configuration in Hyper-V
There are generally three ways of device virtualization:
- Virtual devices, shared use
- Direct allocation, exclusive access
- Virtual multiple "small devices" with the help of physical devices
Let's take the network card as an example to see how the above three device virtualization methods work:
- The most classic way: for each virtual machine, a network card unrelated to the physical device is virtualized. When the virtual machine accesses the network card, it will fall into the virtual machine manager (VMM). This virtual network card seems to have two sides of a / b. side a is in the virtual machine and side B is in the virtual machine manager (host). Packets from side a will be sent to side B, and packets from side B will be sent to side a. There is a virtual switch in the virtual machine manager, which forwards between the B side of each virtual machine and the physical network card. Obviously, the software only virtual switch is the bottleneck of the system performance.
- The most local way: assign a real physical network card to each virtual machine, and map the PCI-E address space of the network card to the virtual machine. When the virtual machine accesses the PCI-E address space of the network card, if the CPU supports I / O virtualization, the I / O MMU in the CPU will map the physical address (PA) to the machine address (MA); if the CPU does not support I / O virtualization, it will trigger an exception and fall into the virtual machine manager. The virtual machine manager software will complete the translation from the physical address to the machine address, and send out the real PCI-E request. When CPU supports I / O virtualization, virtual machine access network card does not need to go through the virtualization layer, and the performance is relatively high, but this requires each virtual machine to own a physical network card, which can be played by tuhao.
- The most fashionable way: the physical network card supports virtualization, and multiple virtual functions (VFS) can be virtualized. The virtualization manager can assign each virtual function to a virtual machine. The virtual machine can see an independent PCI-E device by using a customized driver. When the virtual machine accesses the network card, it also translates the address directly through the I / O MMU of the CPU without going through the virtualization layer. There is a simple switch inside the physical network card, which can forward between the virtual functions and the network cable connected outside. This method can achieve high performance with a single network card, but it needs a high-end network card, a CPU to support I / O virtualization, and a customized driver in the virtual machine system.
I / O MMU is a hardware component that enables peripheral devices and virtual machines to communicate directly. It bypasses the virtual machine manager (VMM) and improves I / O performance. Intel's name is vt-d, and AMD's name is amd VI. It can translate the physical address (PA) in the device register and DMA request into machine address (MA), and distribute the interrupt generated by the device to the virtual machine.
Time to use again: just like hardware assisted virtualization, I / O MMU has switch in BIOS, don't forget to turn it on~
There are also "full virtualization" and "semi virtualization" in device virtualization. If the driver in the virtual machine system can be modified, the API of the virtual machine manager (VMM) can be called directly to reduce the cost of simulating hardware devices. This behavior of modifying the driver in the virtual machine to improve performance belongs to semi virtualization.
By installing additional drivers in the virtual machine system, the interaction between some hosts and the virtual machine can also be realized (such as sharing the clipboard, dragging and transferring files). These drivers load hooks in the virtual machine system and call the API provided by the virtualization manager to complete the interaction with the host system.
Install additional drivers in the virtual machine
To make a summary, you can have a cup of tea and have a aftertaste:
- According to the implementation of virtualization manager, it can be divided into hardware virtualization and software virtualization. Hardware virtualization has left the stage of history;
- According to whether the virtual machine operating system needs to be modified, it can be divided into full virtualization and paravirtualization;
- According to how to deal with privileged instructions, it can be divided into simulation execution (low efficiency), binary translation (QEMU) and hardware assisted Virtualization (KVM, etc.).
Operating system level virtualization
Most of the time, we do not want to run any operating system in the virtual machine, but want to achieve a certain degree of isolation between different tasks. In the virtualization technology mentioned above, each virtual machine is an independent operating system, with its own task scheduling, memory management, file system, device driver, etc. it also runs a certain number of system services (such as refreshing the disk buffer, loggers, timed tasks, SSH Server, time synchronization service), these things will consume system resources (mainly memory), and the two-tier task scheduling and device driver of virtual machine and virtual machine manager will also increase the time overhead. Can the virtual machine share the operating system kernel and maintain a certain degree of isolation?
Why do two rivers enter a canal. In order to facilitate development and testing, the seventh version of UNIX in 1979 introduced the chroot mechanism. Chroot is to let a process take the specified directory as the root directory, and all its file system operations can only be carried out in the specified directory, so as not to harm the host system. Although chroot has a classic jump out vulnerability, and it does not isolate the process, network and other resources, chroot is still used as a clean environment for building and testing software.
To be a real virtualization solution, file system isolation is not enough. Two other important aspects are:
- Process, network, IPC (interprocess communication), user and other namespace isolation. The virtual machine can only see its own process inside, and can only use its own virtual network card. The process communication will not interfere with the outside of the virtual machine. The uid / GID in the virtual machine is independent of the outside.
- Limitation and audit of resources. Because the program in the virtual machine "flies", it can't occupy all the CPU, memory, hard disk and other resources of the physical machine. It is necessary to be able to count the resources occupied by virtual machine and limit the resources. These two things are what the BSD and Linux communities have been doing since the 21st century. In Linux, namespace isolation is called user namespace. When creating a process, a new namespace is created by specifying the parameters of the clone system call. Resource restriction and audit are done by cgroups, whose API is located in the proc virtual file system.
It is called "operating system level virtualization" or "task level virtualization". Since Linux containers (LxC) has been included into the kernel mainline since Linux version 3.8, operating system level virtualization is also called "container". In order to distinguish from the virtualization scheme that virtual machine is a complete operating system, the processes (process groups) that are executed in isolation are often not called "virtual machines", but "containers". Because there is no extra layer of operating system kernel, the container is lighter and faster than the virtual machine, and the memory and scheduling costs are also smaller. More importantly, access to disk and other I/O devices do not need to go through the virtualization layer, and there is no performance loss.
The operating system level virtualization on Linux does not start from LxC, on the contrary, LxC is a typical example of "the Yangtze River waves behind push the waves ahead". Linux vserver, OpenVZ and parallel containers are all solutions to realize operating system level virtualization in the Linux kernel. Although the younger generation has won more attention, as an elder, OpenVZ has more "enterprise level" functions than LxC:
- The disk quota of each container can be audited and limited, which is realized by directory level disk space audit (this is also the main reason why freeshell uses OpenVZ instead of LxC);
- It supports checkpoint and live migration;
- The capacity of memory and disk such as free-m and DF LH is its quota in OpenVZ container, while the capacity of physical machine is seen in LxC;
- It supports swap space. When oom (out of memory) occurs, the process killing behavior is the same as that of the physical machine, and LxC will directly not allocate memory. However, because OpenVZ adheres to RHEL route, RHEL 6 is still the old kernel of 2.6.32, RHEL 7 just released OpenVZ has not followed up, OpenVZ kernel now seems to be very old, even the new version of SYSTEMd can not run, let alone all kinds of cool new functions of 3. X kernel.
OpenVZ architecture
Docker, good housekeeper of container
A good horse with a good saddle, such a good Linux container, naturally also needs a good housekeeper. Where do I get the system image? How to version control the image? Docker is a good housekeeper of Linux containers. It's said that even the huge hardware shows its favor to docker, and it's necessary to support Windows development containers.
Docker is actually generated for system operation and maintenance, which greatly reduces the cost of software installation and deployment. The reason why software installation is a hassle is that
- There is a dependency between software. For example, Linux relies on the standard C library glibc, cryptography library OpenSSL, and Java operating environment; windows relies on the. Net framework and flash player. If every software has all its dependencies, it's too bloated. How to find and install the dependencies is a university question, and also the feature of each Linux distribution.
- There is a conflict between the software. For example, program a relies on glibc 2.13, while program B relies on glibc 2.14; script a requires Python 3, and script B requires Python 2; both Apache and nginx web servers want to listen to port 80. When conflicting software is installed in the same system, it is always easy to bring some confusion, such as DLL hell in the early days of windows. The solution to software conflicts is isolation, allowing multiple versions to coexist in the system, and providing methods to find matching versions. Let's see how docker solves these two problems:
There is a conflict between the software. For example, program a relies on glibc 2.13, while program B relies on glibc 2.14; script a requires Python 3, and script B requires Python 2; both Apache and nginx web servers want to listen to port 80. When conflicting software is installed in the same system, it is always easy to bring some confusion, such as DLL hell in the early days of windows. The solution to software conflicts is isolation, allowing multiple versions to coexist in the system, and providing methods to find matching versions. Let's see how docker solves these two problems:
- Package all the dependencies and running environment of the software in one image instead of using complex scripts to "Install" the software in the unknown environment;
Package all the dependencies and running environment of the software in one image instead of using complex scripts to "Install" the software in the unknown environment;
- This includes all the dependent packages, so the docker image is hierarchical, that is, the application image is generally based on the basic system image, and only needs the incremental part of transmission and storage;
- Docker uses container based virtualization to run each software in an independent container, avoiding file system path conflicts and runtime resource conflicts of different software. Among them, the hierarchical image structure of docker depends on the aufs (another union file system) of Linux. Aufs can merge and mount the basic directory a and the incremental directory B into a directory C. the files in a and B can be seen in C at the same time (in case of conflict, the one in B shall prevail). The modification of C will be written to B. When a docker container is started, an incremental directory will be generated and mounted together with the docker image as the base directory. All writes to the docker container will be written to the incremental directory without modifying the base directory. In this way, the "version control" of the file system is realized with relatively low overhead, and the distribution volume of the software is also reduced (only the incremental part of the distribution is needed, and the basic image is what most people already have).
The relationship between docker, LxC and kernel components
Docker's virtualization is based on libcontainer (the figure above is older, and it was still based on LxC at that time). In fact, libcontainer and LxC are based on cgroups resource audit, chroot file system isolation, namespace isolation and other mechanisms provided by Linux kernel.
How does freeshell start
We know that freeshell uses OpenVZ virtualization technology, and people often ask if they can change the kernel, or where the freeshell system is started. In fact, like the normal Linux system, freeshell starts from / SBIN / init in the virtual machine (see if it's process 1?) , followed by the startup process of the virtual machine system itself.
One corner of freeshell control panel
In fact, different types of virtualization technologies start to boot virtual machine systems from different places:
- Boot from the simulated BIOS, support MBR, EFI, PXE and other boot modes, such as QEMU, VMware;
- Starting from the kernel, the virtual machine image does not contain the kernel, such as KVM, Xen;
- Starting from the init process, the virtual machine is a container that shares the kernel with the host. It will start various system services according to the boot process of the operating system, such as LxC and OpenVZ;
- Only running a specific application or service is also based on containers, such as docker.
Openstack, a cloud operating system
All of the above are specific technologies for virtualization. But for end users to use virtual machines, there is also a platform to manage virtual machines, or "cloud operating system". Openstack is currently the hottest cloud operating system.
Openstack cloud operating system
Openstack (or any reliable cloud operating system) virtualizes various resources on the cloud. Its management component is called nova.
- Computing: the virtualization technology mentioned in this paper is the virtualization of computing. Openstack can use a variety of virtualization solutions, such as Xen, KVM, QEMU, docker. Nova, the management component, decides which physical machine to schedule the virtual machine according to the load of each physical node, and then calls the API of these virtualization solutions to create, delete, power on, power off, etc.
- Storage: if the virtual machine image can only be stored locally in the computing node, it is not conducive to data redundancy and virtual machine migration. Therefore, in the cloud, the storage system that is logically centralized and physically distributed is generally adopted, which is independent of the computing node, that is, the access of the computing node to the data disk is generally through the network.
- Network: each customer should have its own virtual network. How to make the virtual networks of different customers not interfere with each other on the physical network is the matter of network virtualization. See my other blog "network virtualization technology outlook". In addition to Nova, the core virtualization manager, openstack also has many components, such as virtual machine image manager glance, object storage swift, block storage cinder, virtual network neutron, identity authentication service keystone, control panel horizon, etc.
Openstack control panel (horizon)
The following two figures show Nova's architecture using docker and Xen as virtualization solutions, respectively.
Docker is called by openstack Nova computing component
Using Xen as openstack's computing virtualization solution
epilogue
The need for task isolation has given rise to virtualization. We all want to isolate tasks completely without losing too much performance. From the highest isolation but slow to impractical simulation execution, to the dynamic binary translation and hardware assisted virtualization adopted by modern full virtualization technology, to the semi virtualization of modified virtual machine system, to the virtualization of shared kernel and container based operating system level, performance is always the first driving force of virtualization technology wave. However, when we choose virtualization technology, we still need to find a balance between isolation and performance according to the actual needs.
The old computing virtualization technology, together with the relatively new storage and network virtualization technology, constitutes the cornerstone of the cloud operating system. When we enjoy the seemingly inexhaustible computing resources in the cloud, if we can peel off the layers of packages and understand the essence of computing, maybe we will treasure the computing resources in the cloud, praise the grandeur and delicacy of the virtualization technology building, and worship the computer masters who lead people into the cloud era.
Reference
- Understanding Full Virtualization, Paravirtualization, and Hardware Assist, VMWare, http://www.vmware.com/files/pdf/VMware_paravirtualization.pdf
- Virtualization technology, IBM, http://www.ibm.com/developerworks/cn/linux/l-cn-vt/
- Formal requirements for virtualizable third generation architectures, CACM 1974
- All kinds of network pictures (sources are not specified)