hungry? evolution of operation and maintenance infrastructure

Posted by barello at 2020-02-27

The content of this article comes from Xu Wei, senior operation and maintenance manager of the 10th Meizu open day.

Editor: Cynthia

Hello, everyone. First of all, I'd like to introduce myself briefly. I'm Xu Wei. I'm in charge of the operation and maintenance and development of infrastructure. I worked in pptv, Ctrip, Youzu and other companies in the early years. I'm also an operation and maintenance veteran.

Founded in 2008, Ehu began to usher in a large-scale explosive growth of business at the end of 2014. In 2015-2016, Ehu entered a high-speed development period, and the growth of business and server are dozens of times of the scale. This large-scale growth will inevitably bring many challenges. This paper will share the measures and ideas to deal with challenges in different periods through the evolution history of Ehu operation and maintenance infrastructure.

1、 1.0 Era

In the era of 1.0 from 2014 to 2015, which is known as "hungry", the business ushered in rapid development. At this time, more consideration is given to what the business needs, rather than the long-term architecture. Each person or team is responsible for a part of their own work and fully meets the needs of the business. It can be imagined that there will be a lot of technical debts in this process due to poor consideration, that is, the so-called "pain" points.

Network pain

The pain of the Internet is mainly manifested in:

● there is no standardization: IP hangs randomly. Some servers may have two or even three IPS, some have binding, some don't;

● many attacks: in the case of high-speed business growth, there will be a large number of attacks, which may cause downtime;

● low bandwidth convergence. Because the traffic is too large, such as the situation of high bandwidth cache, the connection port of the switch or the Gigabit network card of the server will be filled up quickly;

● lack of monitoring: in case of any problem, the technical team doesn't know, and the rider or user says that they can't place an order, all kinds of complaints will be fed back by customer service;

● single point: there are many single points from business to overall architecture to every business or even machine;

● there is also the problem of unstable link quality.

Server pain

The pain of resources is mainly manifested in:

● untimely delivery of servers: our highest weekly delivery volume from last year to this year was 3700 + logical servers. On average, thousands of units are delivered and recycled every month, which requires high efficiency;

● lack of asset management: no standard, high maintenance cost. At this time, in the period of barbaric growth, you need to buy servers as soon as possible, without considering how many servers there are. I don't know what configuration the servers are. There is no standardization. Maybe one has SSD and the other doesn't. Maintenance costs are very high.

There is no guarantee of delivery quality: all of them are installed with human flesh. At the end of 2015, when a batch of machines are purchased, a temporary installation team will be formed to install together. Because it is human flesh operation, slow and delivery quality cannot be guaranteed, it is more difficult to check.

Lack of basic services

The lack of basic services is mainly reflected in:

● ZABBIX is the earliest way to monitor. Due to different configurations, some hard disks are not monitored, IOPs are missing, and business layer monitoring is not fully covered;

● load balancing: each business makes one or two servers by itself and hangs a nginx as the reverse agent, which is done casually;

● centralized file storage. Each server will store many files locally, which brings many problems to the whole infrastructure management. For example, some things, such as in theory, many SOA services on the Internet are stateless. There should be no other things except code locally. But when there is a failure, the business cannot confirm the problem because the monitoring is not mature, so it needs to look at the log. This is complicated. Some people say it needs to be kept for a week, some say it needs to be kept for a month, sometimes the log is dozens of times a day What about a G? Then add a hard disk. How? Who will purchase and manage? How to do the following standardization? Centralized log and centralized file storage are all to solve the problem of standardization.

Basic services are also very confusing.

2、 What did we do

What shall we do in the face of so many problems? In fact, three things are enough for operation and maintenance.

The first is standardization, from hardware to network to operating system to technology stack, software installation, log storage path, name, code deployment, monitoring from top to bottom, a set of systematic standards should be established. With standards, you can use code automation. With automation and standardization, you can achieve a virtuous circle.

The second is process. Process is to standardize and standardize many requirements by steps.

The third is platform. Build a platform to realize standardization and automation.

I understand that operation and maintenance need to do two life cycle management.

The first is the life cycle management of resources, including the purchase, putting on the shelf, deployment, code, troubleshooting, server recycling, scrap, etc.

The second is application life cycle management, including application development, testing, online, change, application offline, recycling, etc.

2.1 standardization

As for the concept of standardization, I often emphasize to you that it is necessary to let our users make choices rather than questions.

For example, users often say that I want a 24-core 32g 600g hard disk machine. At this time, you should tell users: I have four models: A, B, C, D, which one is computing, storage, memory, and high I / O, which one do you want? This is very important. Many users are just used to two machines: one with 200g hard disk and one with 250g hard disk. The needs of users are so strange that it is difficult to achieve without standardization. Our server models are unified. We provide various models. You need to talk with users, collect user needs, and try to identify the real needs of users.

When purchasing models, we need to make customization, such as whether to turn off the power saving mode. There are still some problems in various manufacturers, including the drive letter drift of the through card, how to do automation, and how to automatically go online when the machine comes.

Server factory and shelf also need to be customized. We mark resources as modules. The smallest module is three cabinets. How many servers are put in each cabinet is fixed. During production, for example, if I want to purchase a thousand servers, I will tell the manufacturer that I have planned which computer room, cabinet and u-bit the 1000 servers will be placed in, and the manufacturer will make customization. After the machines are delivered to the shelves, the manufacturer's family or service provider will connect the electricity, install the operating system automatically, even the network, and each layer is standardized.

2.2 process + automation

The picture above shows the workflow engine of you.

It can be understood that there are many processes in the resource life cycle, such as server application, including physical machine application, virtual machine application, cloud service application, etc.; for example, a large number of States, including recycling, etc. Behind the process is automation, which standardizes user input, allows users to do multiple choice questions, what kind of model, what kind of configuration, how many, and how many forms are filled in, and the background automatically executes them.

2.3 automation + platform

● automatic installation and initialization of physical server. For example, I have thousands of servers. Can I install them in a day? 360 once installed up to 5000 servers a day, our highest record is 2500 physical servers a day.

● online automation of network equipment.

● resource management platform. All resources can be managed in a unified way, similar to the management background of resource delivery process

● distributed file system, mainly used for database backup and image processing

● log centralized platform. All logs are centralized on elk instead of the server

2.4 private cloud platform (zstack)

In the early days of barbaric growth, virtual machines were created by themselves. After creation, there will be a question: how to know that a machine can be recreated? For example, I can create six virtual machines for one machine, and I have already created five. How can I know where else? It is necessary to arrange a service to 10 physical machines, or even cross cabinets, to avoid single point physical machine or single cabinet failure, affecting the global application, which involves the resource scheduling of virtual machines. We choose zstack.

Why zstack?

The three popular open source technologies for private cloud are openstack, cloudstack and zsstack.

Based on the principle that the simpler the better, we excluded openstack, which is too heavy and no one has time to go to a system as big as hold. And getting some feedback from the industry, on the whole, is not very good.

Cloudstack developers are also zsstack developers. At that time, the cloudstack community was no longer maintained, and it does not support centos7.

So we chose zstack. At that time, zstack also had many bugs, but it was relatively simple. We can do it well.

Characteristics of zstack: simple, stateless and interfaced

It's relatively simple. After installation, you can run. Of course, it's a little difficult to use it well. It is a central structure, all of which are based on messages. Our zstack platform doesn't see any big pages, and there are custom interfaces at the back. The front-end process automatically tunes the back-end interface to synchronize through some messages. Zstack has managed more than 6000 virtual machines.

3、 2.0 era

In the 1.0 era, we have done some standardized and automated work to make our things run smoothly. Since 2016, we have entered the era of 2.0. There are also some pain points in this stage: what is SLA? Do you have any data? You said the efficiency is very high. How to prove it? Is 1000 a day high? How is the data measured? In the IT circle, all things, except God, should speak with data. Everything should be quantifiable and measurable.

4、 What did we do

In this period, we will take two measures to solve the pain point: refined operation and data-based operation. Operation and maintenance are different from operation.

4.1 refined operation and maintenance

Fine operation and maintenance includes the following aspects.

● continuous upgrade of network architecture

● server performance baseline development

● server delivery quality verification (unqualified delivery)

● hardware failure repair automation

● network traffic analysis

● server restart automation

● bug fix: power saving mode, binding...

Continuous upgrade of network architecture

In the early days, we had a data center, and the core switch used was Huawei's 5700se. What does this mean? In a traffic burst, this device caused our P0 level accident. All of us redefined the network standards, and continued to do a lot of network upgrades, including the core, load balancing, converging to the core bandwidth, and network architecture optimization.

There are also links between IDC. Some of the earliest links between IDC were made by VPN. Now, the same city uses bare fiber, and the cross city uses transmission. There is also a kind of continuous investment here.

network optimization

As shown in the figure, IDC in Beijing and Shanghai, visiting IDC from the office, we have pulled bare fiber and special lines, all of which are strong enough. It also includes payment to third parties, such as Alipay, WeChat payment and so on.

Server performance baseline development, delivery quality verification

If the delivered server is good enough, it needs to speak with data. All our servers have a baseline, such as a computing model. The computing power, I / O capacity, PPS of network card packets can be tested. The performance test will be carried out at the time of delivery, and the delivery can only be carried out when the baseline is reached, otherwise, the delivery cannot be carried out.

Network traffic analysis

At the beginning, we encountered a situation where the bandwidth of a certain optical fiber converged to the access ran full. Because the early bandwidth convergence ratio was not enough, four 10g ports were on it. Due to the algorithm of network traffic, one of the four 10g ports ran full. We need to know which business is running and how is the traffic of key nodes. If there is a problem, we need to give an alarm.

Hardware failure warranty automation

At present, we have a large number of servers, and there may be dozens of failures every week. How can we know the failures at the first time and repair them quickly without affecting the business?

There are other tasks, such as automatic server restart, which is very hard for operation and maintenance. If there is a fault in the middle of the night or the server breaks down, it needs to be restarted. If you have to log in and restart through the remote management card and enter the password, it is too low. To achieve automatic restart.

Generally speaking, server automatic repair is a few logic.

● fault detection

● fault notification: user, IDC, supplier

● troubleshooting

● fault recovery verification

● fault analysis

The first is fault discovery. How to find resource faults? Monitoring, in band, out band and log multi-directional monitoring. All monitoring alarms will be collected at a place for preliminary collection, and finally will be sent to this system.

This is the figure of September 19. You can see that there are many faults. You need to notify after finding these faults. The notification is also very complicated. Is it a short message, a phone or an internal tool? We need to inform through multiple channels. Some users are very good. They say you don't need to send me emails. I have an interface to send messages automatically. For example, if our server fails, they will send them a message automatically. After receiving this message, the business will start to turn off the machine, or even complete a series of operations such as data operation. After that, they will return a message to the repair application system: this You can go to the server for maintenance. After receiving this message, inform IDC and the supplier: which machine room, which cabinet, which u-bit, and which server with serial number has any problem. Please come to the door for maintenance at what time. At the same time, tell IDC in various ways: who's ID number and what equipment will be brought to the door for maintenance at which place 。

We have tens of thousands of servers with only two people for operation and maintenance. The maintenance and fault handling of suppliers are human nature. After handling, log in to the external system and send us a message to tell us that the server has been repaired. Our program will automatically check whether the fault has been recovered. If so, it will inform the user when the resource has been repaired and the user will pull the server back after receiving the message. At the same time, all the fault information will enter my database and be analyzed automatically. It can be seen that which brand of server is not good, which model or which accessories are broken more. It can be a reference when choosing suppliers and models.

There are many details such as refined operation and maintenance, bug fix and so on. The details are the devil. At the beginning, it was hit by the power saving mode, including network card problems. From hardware to service, there are more bugs in the code.

Broadview coss

We have a lot of cabinets and machine rooms, and these data are collected and displayed through the automatic system.

Three things need to be considered in operation and maintenance: quality, efficiency and cost. As can be seen in the figure above, this is a module. There are many cabinets in this module. These cabinets use a lot of power, which reflects the cost. A large number of our cabinets are yellow and yellow are alarms. The power of a cabinet is 4000W and 5000W. We will try to make full use of the resources and the cost is relatively optimal, so our cabinets are all relatively high-power ones, for example, we will put a lot of equipment in the 47U cabinet.

4.2 data operation

All it stuff has to talk about data.

Assets situation

Assets include: how many servers do we have, which computer rooms are distributed, how many cabinets are there, what model, brand and model of servers, which are occupied and which are unused.

Network traffic analysis

The analysis of network traffic includes: who is the source of network traffic? For example, there is an exception here. I know that this is caused by cross city bandwidth transmission. As we all know, the cross city bandwidth is very expensive. The 10g bandwidth is full all of a sudden. It will take three months to expand. At this time, the overall business will be seriously affected. We need to know for the first time who is using the traffic.

Where is the server

We have bought so many company's things, we need to know who is using these machines? This line in the figure is the resource utilization rate, which is the big data department. It can be seen that the resource utilization rate of big data is very high. The utilization rate of resources in other departments is not high. Through this data, I will send you a report to tell you how much you spent, how many servers you used, what types of servers, distribution and utilization. This is the idea of operation, not the idea of operation and maintenance.

Resource delivery SLA

How is our workload measured? We have delivered a lot of servers. When, what model and efficiency have we delivered? It must be measurable. For example, when doing the year-end KPI assessment, you said that our department has done a lot of work. This project and that project are nonsense. Just tell me how many projects have been done this year, how many resources have been deployed and what the efficiency is. The average delivery time used to be two hours, now it's 20 minutes, and the next five minutes.

cost accounting

We spent a lot of money this year. Where did we spend so much money? Who spent it? What did you buy? For whom? How's it going? From all dimensions, the composition of these costs and so on, and even the cost comparison between our costs and those of our friends can be seen through these things.

Quality evaluation of suppliers

For example, when did each part fail and what was the failure rate? The report will be automatically delivered to the Purchaser as a technical score of the purchase. No one is involved in this process. In terms of quality, if a manufacturer's quality has declined in this period of time, after-sales management can be carried out for him.

Five, summary

This analysis is mainly about resource life cycle management. Most of the content of this paper focuses on the underlying resources, but the idea can be applied to all modules, such as logs, different operation and maintenance systems, monitoring, etc.

Finally, some thoughts.

Simple and usable. I have been engaged in Internet operation and maintenance for nearly ten years since I joined the industry. The failure of early servers needs to be looked at one by one by someone who knows very well. Now, the way is that one server is out of order and pulled down. The other server is quickly topping up. The cost is controllable, the quality is controllable, and the efficiency is the first. In this case, it must be simple and available.

This sentence was first passed out by Baidu, which is also a principle for me to do operation and maintenance. Everything should be simple. There are some open source solutions that have a lot of amazing features, but do you really need to think deeply about it? Does it really help you? What is the real core value? All software programs must achieve high clustering and low coupling to avoid strong dependence.

Standardization can be standardized, automation can be automated. Standardization must be the future trend. With the development of Alibaba cloud and Tencent cloud, many small companies will move to the cloud. What can I do for the future hybrid Cloud Architecture and operation and maintenance? How to achieve rapid expansion and elastic calculation? Elastic calculation includes capacity planning, pressure testing, etc. there are many points, the cornerstone of which is standardization and automation.

Try not to make wheels again. Make wheels by yourself. Developers love to build wheels. They always think that when I go to a company and don't write anything, it seems that I am low or have no performance. "Good" software is more important than "good" software. If there is no good or bad tool, it depends on whether you can use it well. If you do the right thing at the right time, don't be biased against the tool. It's just a brick. What we need to do is to merge these bricks and make good use of them.

First, then good (long live 80 points). It's hard to start everything. The first step must be first. Don't say that I think a thing is very grand, and the design of all aspects is particularly good. What should be considered is whether this thing can land on the ground? Landing is the most important thing. Then someone will ask how to deal with a lot of things that are rotten? Close first, then optimize slowly.

With the rapid development of the Internet, Internet applications must be iterated quickly. Our company releases them hundreds of times a day. It is very important to develop and update them quickly. This process must be a spiral one, even two steps forward and one step backward. Don't say whose architecture is the best, only for you is the best, and what is suitable for you in different stages is different, which needs constant refactoring and iteration.

Existing users are as important as new users. This is from Amazon. Amazon has a very famous thing: there was a user who was going to move the service to Amazon in that year, which is expected to cost tens of millions of dollars. Amazon commented that what kind of transformation the service needs to be made when it moves over and what impact the transformation will have on the stability of the existing business. After layers of reporting, the final conclusion is that I don't want this user. Because existing users cannot be guaranteed after acceptance.

This is very important. In September 2016, our log system went online. When it was just online, the peak request volume was 80000 / s. by October, it reached 800000 / s. at this time, there are still users coming to say they want to access. At that time, I felt that the hardware, architecture and other aspects were dying. I told you about this, and I won a one month buffer period for myself, and made a lot of technical transformation. Now, the peak value is more than 2.6 million logs per day, which can also be collected, transmitted, stored and analyzed in real time. There must be a balance between this. While serving the existing users well, gradually connect to new users. Of course, you can't say "no way" so that your brand is gone.

Embrace change, don't have a glass heart. I have worked for many years and have been to several companies. I can see that each company has different changes at different stages. For example, in my team, there is basically no operation and maintenance in development, and standardization is well done. The bottom layer is a hardware and software, an operating system expert, and other programmers. But how do many people do operation and maintenance? To start learning code, there must be change and growth. This year, my goal for the team is to eliminate our team in 2017. It means to be unattended, or only spend 10% and 20% of your time and energy on background bug fix, and 80% of your time on value output. Now this goal is in the process of landing.

Q & A

Q: Do you provide guidance on the server requirements of the business unit?

A: Our servers are divided into several types, including computing, storage, memory and high I / O (database specific). Storage is basically as like as two peas, and the computing type is virtualized and calculated. For example, SOA stateless service, when applying for a model, each business model is basically the same. When a new business comes up, there will be architecture review. Whether the review architecture is certified, whether there is data management, whether there is a log specification, whether there is a log specification, whether the overall architecture design is reasonable, and whether the resource requirements are reasonable.

Q: Is the log unified platform self-made?

A: We have a centralized management center. We have written a set of scripts to install flume automatically, and we have also done secondary development on flume. For example, Java has cross line logs, and we need to do merging line processing; for example, some business exception logs are 2m, which I can't carry, and I will judge and discard such exception logs; for example, some directories may have multiple logs, how to deal with them?

Q: How can users view logs and index each business? For example, I am the business side, I only look at the logs of my business?

A: At present, this has not been done. As you know, kibana's permission management is very weak. At present, we have not done this. Here I want to point out that we have a system in which an SDK will be embedded in the business code, and the call chain relationship of the stack will be put into the system, and finally the list will be displayed. The system will interface with our log system, and when users see the call chain, they can drag similar information. Now there are not many people using kibana, which is just a back-end service.

Q: You just mentioned that there will be a performance test run and pressure test. How do you do the pressure test?

A: Use some I / O pressure measurement provided by the system. Basic indicators of each model, such as CPU capacity, IOPs, PPS capacity.

Q: Just when it comes to logs, when my application reaches more than 100000 requests, if I write them to flume, the CPU of the server will be full. My current practice is to first place the log locally, and then push it online.

A: Flume is a good tool, but its performance is not very strong. The highest processing power of our pressure test is 20000 pieces / second. In fact, it can't run so high, including some fake deaths in the process of running, even when the sink is full. In this process, a lot of monitoring needs to be done. You should make flume distributed. For example, I have a flume on each business server to read files directly through TCP, UDP and logs. This depends on the business tolerance. We have some business heap logs that can't be lost. We test that 30% - 50% of UDP logs are discarded. If the log integrity requirements are very high, it's not recommended.

Flume 1.7 can support offset based logging, and it supports multiple files. The simpler the better, the more standardized the better, because there may be problems with your disk space, even with I / O.

Q: How does your development and operation and maintenance team work? Is it put together?

A: On this issue, Ali and I have also talked about it. They have made three explorations. The first one is development, which is developed for operation and maintenance. Many companies will do this. There is a disadvantage in doing this: operation and maintenance say what you are developing, and I don't want to use it; development means that operation and maintenance know to raise demand all day long. The relationship between the two sides is very bad. The final product is useless, so it is meaningless.

Ali made a second attempt. In this case, all developers develop their own operation and maintenance, and operation and maintenance do other things. The company is so big, there must be something to do. The disadvantages are: the development does not have a deep understanding of O & M, the O & M system is very different from online business development, and the O & M system has relatively poor reliability requirements, but high quality and efficiency requirements. Development does not understand the operation and maintenance, prone to various failures.

So Ali made a third attempt to integrate development with operation and maintenance.

Hungry no part of the full-time development and operation and maintenance team, there are also development and operation and maintenance team together.

This article is reproduced from the official account "msup".


Read more articles

The time after work determines your status in the Jianghu

Ctrip: experience of large-scale elastic search cluster management

Devops, is development eating operation and maintenance?

When does the hard disk break? AI based scientific fortune telling tells you

The way to cultivate Tencent business monitoring

Don't let five minutes before work ruin your promotion

What is the monitoring system of Unicorn in Silicon Valley?

The file is deleted under Linux. Why not free the space?

The past and present life of station B log system

Another teacher from hungry is coming to GOPs 2017 Shanghai Station

Hungry Guo Guofei, DBA director, is about to give a wonderful speech "hungry double living database practice"

Click to read the original text and enter the official website of GOPs 2017 Shanghai station activity