what are the hottest and most famous high-tech start-ups and technologies in silicon valley?

Posted by tetley at 2020-04-10

From the strong Silicon Valley companies to the bubble of everyone's discussion, how to combine big data with artificial intelligence? What is the prospect of science and technology in 2015? Dong Fei, a coursera software engineer from Silicon Valley, sorted out the dry goods he had recently given in Stanford public lecture and the Q & A on various occasions to share with you. In this paper, he has his first-hand experience, as well as specific analysis of some companies he has personally worked for or in-depth studied, such as Hadoop, Amazon, LinkedIn, etc. Dong Fei's Zhihu is, and his email address is [email protected]

1. What are the hottest and most famous high-tech start-ups in Silicon Valley?

In Silicon Valley, you are very enthusiastic about entrepreneurship opportunities. Through my observation and accumulation, I have seen many hot start-ups emerging in recent years. I'll give you a list. This is the financing scale selection of start-ups around the world on Wall Street. Its original title is the billion startup club. I also shared it in my domestic lectures last year. In less than a year, as of January 17, 2015, the ranking and scale have changed a lot.

First of all, the valuation reached seven in 10 billon, but none a year ago. Second, the first is Xiaomi, which is well known in China; third, most of the top 20 (80% in the United States, in California, in Silicon Valley, in San Francisco!) For example, Uber, airbnb, Dropbox, pinterest; fourth, there are many similar models that are successful. For example, flipkart is Taobao in the Indian market. Uber and airbnb are both in the category of sharing economy. So you can still look for the next big opportunities in Uber, Palantir, snapchat, square and o2o apps. I have interviewed and experienced their environment in person in many companies.

2, are there so many high valued companies that mean there is a big bubble?

Having seen so many high valuation companies, many people feel very crazy. Is it a big bubble? Is the bubble going to be broken, which is a question of many people. I think in Silicon Valley, this dream filled place, investors encourage entrepreneurs to boldly do it and also encourage bubbles. Many projects will be valued at 2 or 3 times in a matter of months. As Uber and Snapchat, I was surprised by their huge scale of financing. So this chart is about the cycle of "emerging technology hype", which classifies all kinds of technologies according to their technological maturity and expectation.

Innovation trigger, peak of expectation, low of expectation, slope of illumination, platform of productivity Productivity ", the more to the left, the more new technology, the more in the concept stage; the more to the right, the more mature technology, about into commercial applications, play a role in improving productivity. The vertical axis represents the expected value. People's expectation of new technology will increase with the deepening of understanding and media hype, and then it will reach the peak. As a result of technical bottleneck or other reasons, the expectation will gradually cool to the low point, but when the technology is mature, the expectation will rise again, and users will be accumulated again, and then it will be on the healthy track of sustainable growth.

Gartner releases hype maps of technology trends every year. This year's comparison with last year's chart shows that the concept of Internet of things, automatic driving cars, consumer 3D printing, natural language question answering and so on are at the peak of speculation. Big data has slipped from the top, NFC and cloud computing are close to the bottom.

3. In the future, what is the trend of high-tech entrepreneurship?

I'll start with a recent movie I saw, animation Game, the founder of computer logic, Alan Turing (named after him as the highest prize in the computer industry), had a hard life. He made Turing machine for deciphering the German army code and made outstanding contributions to the victory of World War II, saving tens of millions of lives. But in that era, he was sentenced to chemical castration because of same-sex love, and his suicide ended his short life of 42 years old. One of his great contributions is his pioneering work in artificial intelligence. He proposed Turing test to test whether a machine can display intelligence equivalent to or indistinguishable from human beings.

Today, artificial intelligence has made great progress. From expert system to statistics based learning, from support vector machine to neural network deep learning, each step leads machine intelligence to the next step.

At Google Dr. Wu Jun, a senior scientist (beauty of mathematics, author at the top of the wave), proposed three trends of current technology development: first, cloud computing and mobile Internet, which are in progress; second, machine intelligence, which is starting to happen now, but many people have not realized the impact on society; third, the combination of big data and machine intelligence, which will happen in the future, Some companies are doing it, but they haven't formed a large scale yet. He thinks that in the future, the machine will control 98% of the people. Now we have to make a choice. How can we become the remaining 2%?

4. Why will the future of big data and machine intelligence come?

In fact, before the Industrial Revolution (1820), the world's per capita GDP basically remained unchanged in the two or three thousand years before 1800, while in the 180 years from 1820 to 2001, the world's per capita GDP increased from the original $667 to $6049. It shows that the income growth brought about by the industrial revolution is indeed earth shaking. What's going on here? You can think about it. But the progress of human beings has not stopped or steadily increased. With the invention of electric power, computer, Internet and mobile Internet, the annual GDP of the world has grown from 5% to 2%, and the information is also growing rapidly. According to the calculation, the amount of information in the last two years is the sum of the previous 30 years, and the latest 10 years is far more than the sum of all the previous accumulated information of human beings. In the computer age, there is a famous Moore's law, that is, the number of transistors with the same cost will double every 18 months, and the cost of transistors with the same number will be halved in turn. This law has been well matched with the development of the last 30 years, and can be derived into many similar fields: storage, power consumption, bandwidth, pixel.

As one of the most important mathematicians in the 20th century, von Neumann is one of the greatest scientific generalists with outstanding achievements in modern computer, game theory, nuclear weapons and many other fields. He proposed that (Technology) would approach some essential singularity in human history, after which all human behaviors could not continue to exist in a familiar way. This is the famous singularity theory. At present, it will grow exponentially. Ray Kurzweil, an American futurist, said that human beings can achieve digital immortality in 2045, and he also founded singularity University. He believed that with the exponential growth of information technology, wireless network, biology, physics and other fields, artificial intelligence will be realized in 2029, and the life span of human beings will be greatly extended in the next 15 years.

5. What are the big data companies worthy of attention in foreign countries? What are there in China?

This is the list of big data companies summarized in 2014. We can roughly divide it into infrastructure and applications, while the underlying technology will use some common technologies, such as Hadoop, mahout, HBase, Cassandra, which I will also cover below. I can give a few examples. In the analysis, cloudera, hortonworks and MAPR are the three swordsmen of Hadoop. Some operation and maintenance fields, mangodb and couchbase are all representatives of NoSQL. As the service fields of AWS and Google bigquery, they are at the crosshairs. In the traditional database, Oracle has acquired mysql, DB2 is for old banks, and Teradata has been a data warehouse for many years. The above apps are more, such as Google, Amazon, Netflix, twitter, business intelligence: SAP, gooddata, some in advertising media: turn, Rocketfuel, intelligent operation and maintenance sumologic, etc. Last year's new star, databricks, rocked the ecosystem of Hadoop with the wave of spark.

For the fast-growing Chinese market, big companies also mean big data. Bat three companies are willing to spare no effort to invest in big data.

When I was at Baidu five years ago, I put forward the idea of Box Computing. In the past two years, they have set up Silicon Valley Research Institute and recruited Andrew ng as the chief scientist. The research project is Baidu brain, which greatly improves the accuracy and recall rate of speech and picture recognition. Recently, they have also made an unmanned bicycle, which is very interesting. Tencent, as the largest social application, also likes big data. It has developed a mass storage system based on C + + platform. Last year, Taobao broke through 1 billion in two minutes on the main battlefield of the double 11, with a turnover of 57.1 billion. There are many stories behind it. In those days, those who had ambition to do pyramid (pyramid three-tier distributed system built by Google's three carriages) in Baidu continued to create myths in oceanbase. Alibaba cloud was controversial at that time, and Ma Yun also doubted whether he was fooled by Wang Jian. Finally, he experienced the baptism of the double eleventh Festival, which proved Alibaba cloud's reliability. Xiaomi's Lei Jun also places great hopes on big data. On the one hand, so many data grow in geometric progression, on the other hand, the storage bandwidth is a huge cost, and it will go bankrupt if it is worthless.

6. Hadoop is the most popular big data technology nowadays. At that time, what caused the popularity of Hadoop? What were the design advantages of Hadoop at that time?

To see where Hadoop started, we must mention the advanced nature of Google. Over 10 years ago, Google had published 3 paper's discussion on distributed system, namely GFS, MapReduce, BigTable. It's a very NB system, but no one has seen it. In the industry, many people itch to copy it according to their ideas. At that time, Doug Cutting, the author of Apache nutch Lucene, was also one of them. Later, they were acquired by Yahoo and set up a team to do it. That's where Hadoop started and the big model developed. Later, with the bull of Yahoo, they went to Facebook and Google, Big data companies such as cloudera and hortonworks have also been established to bring Hadoop practice to various Silicon Valley companies. Google hasn't stopped yet. It has produced three new carriages, pregel, caffeine and Dremel. Later, many of them stepped into the background and started a new round of open source war.

Why is Hadoop more suitable for big data? First of all, the expansion is very good. The ability of the system can be improved directly by adding nodes. It has an important idea that mobile computing is not mobile data, because mobile data is a huge cost and requires network bandwidth. Secondly, it aims to make use of cheap common computer (hard disk), so that although it may be unstable (probability of disk failure), it can achieve high reliability through fault tolerance and redundancy at the system level. And very flexible, can use a variety of data, binary, document type, record type. Using various forms (structured, semi-structured, unstructured so-called schemalesss) is also a skill in on-demand computing.

7. What are the companies and products around Hadoop?

When it comes to Hadoop, I don't say anything, but ecosystem. There are too many interactive components in it, involving IO, processing, application, configuration and workflow. In real work, when several components interact with each other, the maintenance of your headache is just beginning. I also want to say a few things simply: Hadoop core has three HDFS, MapReduce and common, NoSQL: Cassandra and HBase on the periphery, hive, a data warehouse developed by Facebook, pig workflow language developed by Yahoo, mahout, a machine learning algorithm library, oozie, a workflow management software, and zookeeper, which plays an important role in the selection of master in many distributed systems.

8. Can you explain how Hadoop works in a way that ordinary people can understand?

Let's start with HDFS, the so-called distributed file system of Hadoop, which can truly achieve high-strength fault tolerance. According to the principle of locality, continuous storage is optimized. In short, it is to allocate large data blocks and read integers continuously each time. If you design your own distributed file system, what should you do if you can access it normally after a machine is hung up? First of all, a master needs to be used as a directory search (namenode). Then the data nodes are divided into blocks. In order to make a backup, the same block of data cannot be placed on the same machine. Otherwise, the machine hangs up, and your backup cannot be found. HDFS uses a method of rack bit awareness, first puts a copy into a machine on the same rack, and then copies one copy to other servers, maybe different data centers, so that if a data point is broken, it will be called from another rack, and the internal network connection of the same rack is very fast, if that machine is also broken, it can only be obtained remotely. This is a method. Now there is a method based on erasure code that was originally used in the field of communication fault tolerance, which can save space and achieve the goal of fault tolerance. You can query it if you are interested.

Next, MapReduce is a programming paradigm. Its idea is to divide the task of batch processing into two stages. The so-called map stage is to generate key, value pair from data, In the process of sorting again, there is a step called shuffle, which transports the same key to the same reducer. On the reducer, because the same key has been ensured to be on the same one, you can directly aggregate, calculate some sum, and finally output the result to HDFS. For developers, what you need to do is to write map and reduce functions, such as sorting in the middle and shuffle network transmission, fault tolerance processing, and the framework has helped you do a good job.

9. MapReduce model itself has some problems?

First, it is not efficient to write a lot of underlying code. Second, all things must be converted into two operations map / reduce, which is strange in itself and can not solve all situations.

10. Where does spark come from? What are the design advantages of spark over Hadoop MapReduce?

In fact, spark appears to solve the above problems. Let's talk about the origin of spark. Published in hotcloud from Berkeley amplab in 2010, it is a successful model from academia to industry, and also attracted the capital of the top VC: Andreessen Horowitz. In 2013, these big bulls (Berkeley department head, the youngest assistant professor of MIT) went out from Berkeley amplab to establish databricks, attracting numerous Hadoop Big guy stoops. It is written in functional language Scala. Spark is simply the framework of memory computing (including iterative computing, DAG computing, and streaming Computing). Before MapReduce was often ridiculed for its low efficiency, but spark made everyone fresh. As the core developer of spark, reynod introduces that spark performance is 100 times higher than Hadoop, and its algorithm implementation is only 1 / 10 or 1 / 100. On last year's sort benchmark, spark ran 100 TB of sorting in 23 minutes, setting a new world record held by Hadoop.

11. If you want to work in big data, can you recommend some effective learning methods? What are the recommended books?

I also have some suggestions. First of all, lay a good foundation. Although Hadoop is hot, its basic principles are accumulated in books for many years, such as introduction to algorithms, UNIX design philosophy, database principles, in-depth understanding of computer principles, Java design patterns, and some heavyweight books for reference. The most classic of Hadoop, the defining guide, I also share it with Zhihu.

Secondly, choose the target. If you want to be a data scientist, I can recommend coursera's data science course, which is easy to understand. To learn the basic tools like hive and pig, if you want to be an application layer, you need to be familiar with some workflow of Hadoop, including some basic tuning. If you want to be an architecture, you need to be able to build a cluster, understand various basic software services, understand computer bottlenecks and load management, and some performance tools of Linux. Finally, more practice is needed. Big data itself depends on practice. You can press API first For the example in writing a book, you can debug successfully first, and then accumulate more. When you encounter similar problems, you can find the corresponding classic mode, and then go further, which is the actual problem. Maybe no one around you has met. You need some inspiration and skills to ask questions online, and then make the best choice according to the actual situation.

12. Cloud computing is closely related to big data technology. You have worked in Amazon cloud computing department. Can you briefly introduce Amazon's redshift framework?

I have worked in Amazon's cloud computing department, so I still know AWS better. Generally speaking, the maturity is very high. There are a lot of startup based on it, such as Netflix, pinterest, coursera. Amazon is still innovating. Every year, it holds a reinvent conference to promote new cloud products and share successful cases. In this conference, I will say a few things casually: like S3 is simple object-oriented storage, dynamodb is a supplement to relational database, gleier archives cold data, elastic MapReduce packages MapReduce directly to provide computing services, EC2 It is the basic virtual host. Data pipeline will provide a graphical interface to directly connect tasks in series.

Redshift, a massively parallel computer architecture, is a very convenient data warehouse solution, which is the SQL interface. It is seamlessly connected with various cloud services. The biggest feature is fast. It has very good performance at the level of TB to Pb. I also use it directly in my work. It also supports different hardware platforms. If you want to be faster, you can use SSD Of course, the support capacity is smaller.

13. What big data open source technologies does LinkedIn use?

In LinkedIn, there are many data products, such as people you may like, job you may be interested. Your user access sources and even your career path can be mined. So LinkedIn also uses a lot of open source technology. I'll talk about the most successful Kafka. It's a distributed message queue that can be used in tracking, internal metrics and data transmission. Data will go through different storage or platforms at the front end and back end. Each platform has its own format. If there is no unified log, there will be catastrophic o (m * n) data docking complexity. If the format you set changes, you also need to modify all the relevant ones. So the intermediate bridge proposed here is Kafka. We agreed to use a format as the transmission standard, and then you can customize the data source (Topics) you want at the receiving end, and finally realize the linear o (M + n) complexity. For the corresponding design details, please refer to the design documents. Among them, Jay Kreps, Rao Jun, the main author, came out to establish Kafka as an independent development company.

In LinkedIn, Hadoop, as the main force of batch processing, is widely used in various product lines, such as advertising groups. On the one hand, we need to do some flexible query and analysis of advertisers' matching, advertising prediction and actual effect. On the other hand, we also need Hadoop as support in report generation. If you want to interview LinkedIn back-end group, I suggest you go to hive, pig, Azkaban (data flow management software), Avro data definition format, Kafka and Voldemort to see some design concepts. LinkedIn has a special open source community and is also build's own technical brand.

14. What are the characteristics of coursera compared with other Silicon Valley startups in terms of big data architecture? What are the reasons and technological orientations for these characteristics?

Coursera is a mission driven company, not to pursue the ultimate technology, but to serve teachers, students, solve their pain points and share their success. This is the biggest difference from other technology companies. On the one hand, it is still an early stage of accumulation, and large-scale computing has not yet come. Only by actively learning and adapting to changes can we maintain the rapid growth of start-ups.

Coursera, as a start-up company, is eager to remain agile and efficient. Technically, all of them are developed based on AWS. You can imagine starting cloud services at will and doing some experiments. We are roughly divided into product group, architecture group, and data analysis group. I've listed all the development technologies I've used. Because the company is relatively new, there is no problem of historical migration. We boldly use Scala as the main programming language and python as the script control. For example, the product group is the course product provided, in which play framework and JavaScript backbone are widely used as the control center. The architecture group is mainly to maintain the underlying storage, general services, performance and stability.

My data group consists of more than 10 people. Part of it is to monitor, mine and improve commercial products and core growth indicators. One part is to build a data warehouse to improve the seamless data flow with various departments. Many technologies are also used, such as using scalding to write Hadoop MapReduce programs, and some people do ab testing framework, recommendation system, and try to use the least human resources to do influential things. In addition to the open source world, we also actively use third-party products, such as sumologic for log error analysis, redshift for big data analysis platform and slack for internal communication. All of this is to liberate productivity and focus on user experience, product development and iteration.