doctor these five years

Posted by punzalan at 2020-03-02


In August 2012, he landed at Pittsburgh Airport with a suitcase. I didn't find a place to live, and I didn't know how to get to CMU. Confused about the future, but full of optimism. Now, just finished the last report during the doctor's period, in the same airport, just waiting for the flight to leave.

Looking back on the past five years is not only a tossing five years, but also a self perception and promotion five years. Here I try to record the main things I have done in the past five years and my feelings, hoping to inspire you.

Year 0: 3 / 11-8 / 12

I first applied for a Ph.D. in the United States in 11 years, but the offer I got didn't have a particularly suitable tutor, so I went north to Wenyuan. At that time, I was in Baidu's business search department to do the click prediction of advertising. Specifically, machine learning is used to predict whether an advertisement will be clicked by users. At this time, it's two years before the word "big data" is popular, but Baidu's data at that time is still big even now. My task is how to efficiently use hundreds of machines to quickly train models on tens of tons of data.

At that time, the algorithm used in the product was based on lbfgs, so I wondered if I could change to a faster convergence algorithm. I found a good one in a few days. However, various problems are found in the implementation, including performance, convergence, and stability. What's more, there was a bare Linux and a very old version of GCC, which needed to be written from scratch. It took a lot of time to do system optimization, algorithm changes, and online experiments. After the last year, it went online in the whole advertising traffic.

Now in retrospect, I think I spent the whole year polishing all kinds of details, sometimes spending thousands of lines of code for 5% performance improvement. All of these lead to the algorithm is too complex and over designed. But going deep into every detail will greatly improve personal ability, and many problems encountered will become the source of future research direction. Some algorithmic thinking has been written here. At that time, deep learning was just coming out. I felt that this should be the future of large-scale machine learning, but it was several years later that we really started to follow up.

In the middle of December of 2011, I sent the materials to CMU and MIT again with a sudden impulse. As a result, I received the offer from CMU unexpectedly. One day, I had dinner with Kaige (Yu Kai) and tongge (Zhang Tong) in Baidu canteen. I said I accepted cmuoffer and I was struggling to go. They immediately said to Alex smola that he is going to join CMU. We will introduce him to you.

I remember the day before I left, I began to pack my luggage. I went to the company for a meeting in the morning and left at noon. I said hello to my friends and then I went to the airport. It was such a fine day in Beijing that I can't remember the day before when the haze burst my watch.

First year: 9 / 12-8 / 13

The first year's main thing is to be familiar with the environment and classes. CMU courses are relatively heavy. Doctors need to learn 8 courses, and each course has a huge workload. Moreover, it is required to be a teaching assistant in two courses, which is more tiring than class.

The most useful lesson for me this year was "advanced distributed system". I have learned a lot of good quality courses when I handed in ACM class before, and the pure knowledge courses generally don't help me much. But this course is mainly about reading papers and then discussing. It's not only about knowledge, but also the understanding of design concept. As we all know, for the system, design is an art rather than a science, which is the embodiment of the designer's aesthetic and philosophical ideas. At the same time, the history of the system world is also composed of wave after wave of trends. It is very meaningful to understand the development of history and the law of its constant repetition.

In that year, the teacher of this course was Hui Zhang (one of the gods and men, who taught in CMU in his 20s, including ionstoica, who is the tutor of Matei, the author of spark). He had a very good overall view and elaborated the question "why" in a very good way. It is through this course that I have a clear understanding of distributed system. Two years later, I came across that one of my papers was also in the reading list of this course. It was a small achievement.

In addition to class, it is more important to do research. When I went to CMU, Alex was still at Google and had no money, so he left me with Dave Anderson. So I had two mentors, one for machine learning and one for distributed systems.

The first half of the year was a process of getting familiar with each other. We talk together for an hour every week. In the first half of the year, we had to video because Alex wasn't there. Alex's signal is often bad, and he has German and Australian accents, plus thinking leaps. Often I can't understand what he said, only to make cute giggles. Still, Dave kept typing to tell me what Alex had said to get through the previous meetings.

The two mentors have very different styles. Alex is very quick, usually you say a little, he has thought out the next ten points, it is difficult to keep up with him. He usually has several solutions when he throws a question. At this time, it is not easy to prove that his idea is better than his, which requires a lot of communication and experimental data support. I think it took me about two years to prove that my plan is generally better in some directions, so he is not so hands on at this time.

Dave won't give a lot of ideas, but he'll help to understand a thing and make it clear. Because my research is mainly in machine learning. In the first two years, I was teaching Dave what is machine learning, and I tried not to use formula as much as possible.

My first research work is about how to divide data and calculate to reduce the network traffic in machine learning. Alex showed his strong points, and in a few minutes, he summed up the problem into an optimization problem, and then we three proposed a solution respectively. I did experiments and found Dave's algorithm was better. In the next two months, I did a lot of optimization, and then I did some theoretical analysis and wrote the paper.

Unfortunately, this idea seems to be a little ahead of time. Although we have improved our writing over and over again, we have submitted several reviewers who just don't understand it or think it's not important. At that time, the academic community began to boast about "big data", but I think most people don't understand it, or their "big data" is still several gigabytes in size, and it takes more than ten minutes to bake a USB flash disk.

This is my job in CMU. I think it's very useful, but it's the only one that hasn't been published.

Richard Peng, who was in the same office as me, was doing theoretical research. I often discuss problems with him, and then have some ideas to cooperate in a job. The general idea is to make the fast algorithm of graph compression on the low rank approximation of matrix. This work has written 30 pages of formulas without any experiments. I mainly used it as a leisurely entertainment between code writing, but I got lucky in FOCS.

Frankly speaking, I don't particularly like pure theory. For example, in the proof of bound, a lot of items are directly lost, which leads me to think that bound is particularly similar. For people who do systems, the last thing they spell is constants. I don't think it's practical to open up and close up in this work. So I think we should do something more practical in the future.

In CMU, I went back to the rhythm of working seven days a week before I went to Baidu. Spend at least 80 hours a week at school. If tired, go to the gym. I usually go at 12 p.m. I'm not the only one. Everyone works hard. For example, the gym in the early morning, the office at 3 a.m. and Chinese or Indian students can be seen everywhere. My roommate Tian yuandonghua at that time was much more than me at school.

I read a lot about optimization in that time. Among them, the book about distributed computing written by Bertsekas in the late 1980s inspired me the most. This book can be regarded as the summary of the research results of the golden generation in the field of MIT control, which is still not out of date.

Inspired by this, I turn to asynchronous algorithm, that is, under the distributed environment, the timeliness of data is not guaranteed to improve the system performance. Based on the algorithm I did in Baidu, I made some improvements and theoretical analysis, and then put nips.

After investing in nips, I started to practice in Google research. At that time, Google brain was not long established. On the 42nd floor of "the answer to the universe", including Jeff Dean, Geoffrey Hinton, Prabhakar Raghavan, many big bulls crowded together, and the citation rate of the paper can exceed 800000.

Alex told me that you should read jure leskovec's articles and learn how to tell stories. I also tried to use some user GPS data in Google to model user behavior. But when I wrote the article, I couldn't write out jure's sense of story. I found that I was not that material. Because of the use of user data, this article coincides with Snowden's awareness of the importance of privacy. Google only allowed it to post after half of the results were deleted. Some tired don't love.

But I spent most of my time at Google researching internal code and documentation. Google's infrastructure is good and its documentation is sound. Although I didn't learn anything directly, I opened my eyes at least.

The second year: 9 / 13-8 / 14

Tuomas Sandholm's mechanism design was launched in this semester, which is another great God. For example, recently, Texas poker has won the professional players, and the previous company has sold hundreds of millions. But I didn't understand the course completely, and I didn't even do the major assignments promised. In the next two years, whenever I met Tuomas, he would ask if there was any progress. I can only see him from afar and get around.

Nips was rejected and found that reviewers did not understand the difference between threads and processes, which was a bit frustrating. In the next lab, a paper with similar ideas but a lot of simplicity was written by Oracle, so there was a lot of pressure at that time. Alex consoled that this kind of thing often happened, looked down on it, and then gave many examples of his own.

After thinking about it, a good article naturally needs enough "dry goods", or information, but an acceptable article needs to meet the following formula:

For machine learning meetings, because of the large amount of contributions, the natural average level of many reviewers will decline. And many reviewers spend half an hour to an hour reading articles, so the value on the right side of the formula is usually very small, and we can't control it.

If the amount of information in the article is not large, such as improving the previous work or some simple new ideas, then the probability of the formula is very high. For articles with large amount of information, it is necessary to improve readability, including clear problem setting, sufficient Context Interpretation and so on. The nips and the earlier rejected job are just because we assume that the reviewers have enough professional knowledge, and we have crammed too many dry goods into it, which makes everyone confused.

Even for published articles, the above formula can also be used to measure the citation rate of a paper. For example, it is often seen that many articles on dry goods are not quoted by many people, and some work of the same period is to consider that the results of simple special cases are quoted by large.

In the next six months, I am mainly working on a general distributed machine learning framework. I want to do experiments in the future. The name is parameterserver, which follows the name proposed by alex10. I spent a lot of time on interface design, implemented several versions, and ran some large-scale experiments at the industrial level.

But what really took me a lot of time was writing papers. The goal is to put this work on osdi, which is one of the two conferences in the system industry. We expect reviewers to be in the same state as Dave two years ago. They won't have too much background in machine learning and mathematics, so they need to use as few formulas as possible. I spent a whole month writing papers, 14 pages full of words and diagrams. But the effort was not in vain, and the paper was finally accepted. It then took weeks to prepare for the conference report. Compared with spending a week writing papers and two or three days preparing reports, the level of writing and reporting has been greatly improved this time. The formulas and theorems that were not put in were put into the next nips, which was a lucky one.

With the article after a little peace of mind can be more free to do something.

During the winter vacation, I went back to Baidu to find Kaige and tongge. Tong Ge said that he had an idea recently, so he did the experiment quickly and then wrote a paper and submitted it to KDD. At the same time, one of Alex's students voted KDD for the idea that he always wanted me to do it, but I didn't think this little trick was worth my time, and the result was the best paper. On the day of making the report, I was in the conference hall with few people. The conference hall next to them was crowded. This makes me wonder for a long time whether it's better to follow my tutor.

At that time, Kaige worked on the young commander's plan in Baidu and joined in when he thought it was quite suitable. At this time, Kaige is carrying on deep learning with a large group of brothers. Naturally, I also jumped into the pit. After trying several ideas, I think it's more appetizing to build a distributed deep learning framework. I chose cxxnet as the starting point, mainly because I am familiar with Tianqi. At the same time, I started to run some experiments like Alex net.

I started to do in-depth learning related projects because of the young commander's plan. Kaige also supports me to do open source development and give back to the society rather than only do internal products. But I'm ashamed that I didn't do anything to help the company during my time as a young commander.

The third year: 9 / 14-8 / 15

After returning to CMU, Alex saw that deep learning was so hot, and said that we also went to buy some GPU to play. But we are relatively poor, so we can only go to Newegg to pick up some cheap goods. This started a vigorous journey of machine tossing. I feel like I've been buying and buying clothes all year. In the end, we may spend tens of thousands of dollars to save a cluster with 80 GPUs. Now think about the cost of time is not worth it, and in order to buy a variety of cheap hardware models lead to high maintenance costs. But enjoy it when it's time. See this blog for details

This year, I wrote a lot of parameter server code, and spent a lot of time helping users to use these codes. It's hard to say it's successful. Now think about it for several reasons. When I write code, I give priority to performance and machine learning algorithms that support the most. But just like the previous mistake, it ignores the readability of the code, so that only a few people can understand the code and do some development. For example, I tried to let students in the Alex group use the code, but the various asynchronies and callbacks made them feel difficult to understand. Secondly, no one can audit code interfaces together, which makes these interfaces have a strong personal flavor, and it is difficult to be simple and clear to all people.

But fortunately, I found a group of like-minded friends. At first, I found that Tianqi was writing xgboost distributed startup script. I saw it and found it was very useful, so I chatted with him. We found that there are many basic components, such as startup scripts. File reading should be used by multiple projects, rather than making a wheel for each project. So Tianqi and GitHub created an organization called DMLC to enhance cooperation and communication. The first project is DMLC core, which places startup and data reading code.

The second new project for DMLC is wormhole. The idea is to provide a series of distributed machine learning algorithms that use almost the same configuration parameters to unify the user experience. I transplanted the machine learning algorithm in parameter server, and Tianqi transplanted xgboost. The original system code of parameter server is simplified to PS Lite.

In the middle, I heard Baidu students say that the factory machine (FM) works well in advertising data, so it is implemented in wormhole. Some optimizations are made for the distribution, and then WSDM is put into operation. It didn't take a month before and after, but miraculously got the best paper nomination.

In the development of wormhole, we found that each algorithm is quite different, they can share some code, but they have their own characteristics, and need special optimization to ensure performance. This leads to some difficulties in maintenance, such as changes to common code that cause all items to be checked. In conclusion, I think it's best to do only one thing for a project. So Tianqi put the xgboost code back to the original project, and I also put FM as an independent project called difactoro.

Through a series of projects, I learned that it is very difficult to build a general and efficient distributed machine learning framework at the current level and manpower. It is feasible to do targeted projects for a similar machine learning algorithm. The interface of this project must conform to this kind of algorithm structure, so the students who do algorithm development can understand it easily, rather than exposing the details of the underlying system too much.

The real project to grow the DMLC community is the third, called mxnet. The background at that time was that cxxnet reached a certain maturity, but its flexibility had limitations. Users can only define models through one configuration item, not through interactive programming. Another project is ZZ and Minerva, which are made by agile. It is an interactive programming interface similar to numpy, but this flexible interface brings many challenges to stability and performance optimization. When I do distributed extension for two projects at the same time, all of them have a certain understanding. Then a natural idea is that it's not good to combine the two projects to learn from each other.

The developers of the two projects were called together to discuss several times, with a general view. The new project is named mxnet, which can be called mixed net. It is a combination of the first two names (Minerva and cxxnet). It's not an easy decision to give up a project that has been developed for several years, but fortunately, the small partners are willing to strive for the best, so mxnet is going well. Soon there will be the first version to run.

The fourth year: 9 / 15-8 / 16

I wrote a lot of code for difacto and mxnet in the first half of the year. In fact, at the beginning, I think difactoro is more important. After all, it is very significant for the improvement of linear algorithm and the extra computing cost is not large, which will greatly improve the application of advertising prediction. But when I met Andrew ng once, I told him that I was doing these two projects at the same time. He immediately told me that I should focus on mxnet, which has a lot of room for the future. I've always admired Andrew's vision, so I listened to his advice.

In November, mxnet had a high degree of completion. I wrote a small paper and put it into the workshop of nips. But then I heard about tensorflow (TF) open source. Developed by a large number of full-time engineers led by Jeff Dean and supported by Google's huge propaganda machine, it is not surprising that Google has quickly become the most popular deep learning platform. TF is still a big pressure on us. We have core developers who switch to TF. However, the presence of TF makes me realize that it's better to focus on doing better than worrying too much about the opponent.

During nips, mxnet's small partners got together once, and several of them were actually the first time I met. Then NVIDIA's GTC invited us to give a presentation. Between these two times, we broke out and made many improvements. At the same time, users are growing steadily. We always think that mxnet is a small development team, so it's an advantage to make new things fast. But as the number of users increases, we get complaints that the development is too fast, which leads to many module compatibility problems. There has been a period of reflection on the trade-off between the speed and stability of new technology development.

At this time, big data is no longer popular overnight. Everyone is talking about deep learning.

I've also spent a lot of time promoting mxnet and getting developers. Including Weibo Zhihu to roar and report everywhere. I was intoxicated by a lot of praises, but many pertinent criticisms also made me realize that the important point is to share sincerely rather than simply boast.

Because of a large number of media intervention, the whole in-depth learning has the trend of entertainment. A lot of entertaining reports are just simple information and (biased) opinions, but not too much dry goods. It's not only not nutritious for others, but also vanity for oneself. Instead of writing these simple hydrology, it's better to do some in-depth sharing, including technical details, design ideas, and experience.

This kind of sharing is easy to fall into the mistake of focusing only on what you have done and how good the result is. These can really prove one's ability, and it will be very helpful for people who want to repeat the work. But more people are more concerned about the scope of application, which is when the effect will weaken, why the result will be so good, and what insight is. This requires more in-depth understanding and thinking, rather than simply presenting the results.

This is also true for writing papers. Just saying how much better your results are than the baseline only shows that this is a good job, but no matter how good the results are, it doesn't mean that this job has depth.

The popularity of deep learning has led to the continuous acquisition of start-ups with various huge funds. Alex was a little bit impatient. As a result, Dave, ash (formerly Yahoo CTO) and I got a company together and started with hundreds of thousands of angel investments. Alex wrote reptiles, Dave wrote frames, I ran models, and I worked hard for a while. It's a pity Dave ran to TF with Jeff in the middle. Later the company was sold to a small public company. Later, we thought the company was unreliable and didn't consider working with them.

The first start-up can't be said to be very successful, and we have learned several points: first, we must pay attention to having too many ideas to start a company with a professor, but we don't hold on to one job. Second, it's not particularly reliable to find a group of part-time doctoral students to work, especially when the product is not clear. Third, we must make a product even if we want to sell the company. When we sell, we give a lot of people the feeling that the team is too strong but the product is too weak, so they just want people. Fourth, it is very difficult to change a non-technical company through technology, especially new technology.

Then we rush to the next company. Ash wanted to make a big idea because of his financial freedom, but at this time Alex just bought a house in the bay area and was under pressure to repay the loan. He chose to go to Amazon. So it's stillborn.

Then I received Jeff's email saying whether he was interested in joining Google. Naturally, this is an attractive opportunity. At the same time, I think a small company with strong entrepreneurial technology is a good choice. But from the development of mxnet, going to Amazon is one of the best choices. Always fill the hole you dig. So I went to Amazon as a part-time job, and led a group of kids to do mxnet development and deep learning applications on AWS.

The fifth year: 9 / 16-2 / 17

Alex said that I could graduate as early as early as the beginning of 2015, but as a late patient, I was delayed in preparing. At this time, I can't wait any longer, so I wrote my graduation thesis in the bay area. Alex thinks that my graduation thesis should be well written, but I'm not interested in pounding the drum to write what I've done before, especially in California, where the sun is so good that I spend most of my time in the backyard basking in the sun. At this time, station B has been completely occupied by primary school students. It's not convenient to buy books here. I brushed a lot of starting points when I was bored. Then he wrote an alchemy.

CMU requires the defense committee to have three CMU teachers and one outside the school. In addition to two mentors, I found Jeff Dean and Ruslan salakhutdinov, who just joined CMU. Russ then joined apple, and the whole committee was in the bay area. Jeff joked that he could come to Google to answer. Unfortunately, I quarreled with CMU for many times, but I was still not allowed to reply outside the school, and I had to have three members of the Committee present. These restrictions led to the delay of the defence and the temporary addition of Barnabas poczos. Finally, Jeff's assistant quickly cut through the mess and coordinated the time to fix everything. Without her, I would have to wait a few months.

The defense was in a strange state. There were AI directors of Google, Amazon and apple on the Committee. The other two and I worked part-time in these three companies respectively. This reflects the current trend of AI academia running to industry.

However, the matter of defense is quite simple. It's not much different from making a report. It's peaceful. Even if Russ asked mxnet and tensorflow which is the best, they didn't fight.

After my reply, I asked the committee if I had any suggestions for finding a job in academia. We introduced a lot of experience, but we all stressed that the academic community is busy, poor and poor, and the salary of the industrial community (almost his face) falls to the president of CMU in minutes and seconds. You have to think about it.


The night before the defense, I thought about two questions, one is "what is the most important thing for doctors" and the other is "what will I do if I can come back again". For the first question, we have learned a lot in the past five years. For example, we have learned the distributed system in the system, followed the development of machine learning in the past five years, improved the level of writing articles, making slides and reporting, and strengthened the code ability. I feel that I can do first-class research or write PK code with a large team. As long as we work hard, there is nothing terrible about our opponents.

But more importantly, the doctor's five-year time can focus on some things from the technical best, make new breakthroughs, this atmosphere can not be given elsewhere.

The second question is what would happen if I stayed at home? At that time, most of Baidu's partners are doing well now, leading this wave of AI trend, and even several companies have created hundreds of millions of value. So from the perspective of money or influence, it's not bad in the industry all the time. Maybe it's a local tyrant now.

But I think I will choose to read blog. There is still a lot of time to make money, but it's a chance to spend a few years in a certain field from getting started to becoming proficient or even promoting the development of this field. Standing at the high point of this field, we will find that although the world is very large, in fact, other fields also use the same technology and have the same development law. The learning methods learned during the doctor's period can make great achievements in all directions.

More important is the ideal and feelings. People have to work for 50 years in their life. Why not spend five years to pursue their ideals and feelings?