how to solve problems in machine learning

Posted by trammel at 2020-03-31

With the advent of the era of big data, machine learning has become an important and key tool to solve problems. Machine learning is a hot direction in both industry and academia. However, both academia and industry focus on the research of machine learning. Academia focuses on the research of machine learning theory, and industry focuses on how to use machine learning to solve practical problems. Based on meituan's practice in machine learning, we will introduce a series of inaction (the article with the label of "inaction series of machine learning") and introduce the basic technology, experience and skills required by machine learning in solving the practical problems in the industry. This paper mainly introduces the whole process of machine learning to solve practical problems, including key links such as problem modeling, training data preparation, feature extraction, training model and optimization model. In addition, several other articles will introduce these key links in more depth.

The following is divided into 1) overview of machine learning, 2) problem modeling, 3) preparation of training data, 4) extraction of features, 5) training model, 6) optimization model, 7) summary of 7 chapters.

What is machine learning?

With the application of machine learning in the practical industry, the word has been given various meanings. Machine learning is a scientific discipline that deals with the construction and study of algorithms that can learn from data

Machine learning can be divided into unsupervised learning and supervised learning. In the industry, supervised learning is a more common and valuable way. In the following, it is mainly introduced in this way. As shown in the figure below, there are two processes for supervised machine learning to solve practical problems: one is the offline training process (blue arrow), which includes data filtering and cleaning, feature extraction, model training and model optimization; the other is the application process (green arrow), which extracts features from the data to be estimated and applies the model obtained from offline training Type B is used to estimate, and the estimated value is used in the actual product. In these two processes, offline training is the most technically challenging work (many works of online prediction process can reuse the work of offline training process), so the following mainly introduces the offline training process.

What is a model?

Model is an important concept in machine learning. In short, it refers to the mapping from feature space to output space. It is generally composed of the hypothesis function and parameter W of the model (the following formula is an expression of logistic regression model, which will be explained in detail in the chapter of training model); the hypothesis space of a model (hypothesis) Space) refers to the set of output spaces corresponding to all possible w in a given model. The commonly used models in industry include logistic regression (LR), gradient boosting decision tree (gbdt), support vector machine (SVM), deep neural network (DNN), etc. Model training is to obtain a set of parameters w based on the training data, so as to make the specific target optimal, that is, to obtain the optimal mapping from the feature space to the output space. See the chapter of training model for the specific implementation.

Why use machine learning to solve problems?

At present, in the era of big data, there are data of t to p everywhere. Simple rule processing is difficult to play the value of these data;

Low cost and high performance computing reduces the learning time and cost based on large-scale data;

Cheap large-scale storage makes it possible to process large-scale data more quickly and at less cost;

There are a large number of high-value problems, so that after spending a lot of energy to solve problems with machine learning, we can get rich profits.

What problems should machine learning be used to solve?

The goal problem needs great value, because machine learning has a certain price to solve the problem;

The target problem has a large amount of data available, so that machine learning can solve the problem better (compared with simple rules or manual work);

The goal problem is determined by many factors (characteristics), and the advantage of machine learning in solving problems can be embodied (compared with simple rules or artificial);

The goal problem needs continuous optimization, because machine learning can be based on data self-learning and iteration, and continuously play the value.

This paper takes deal (group purchase order) transaction volume prediction as an example (that is, to estimate how much a given deal sold in a period of time), and introduces how to use machine learning to solve the problem. First of all:

Collect information about the problem, understand the problem, and become an expert on the problem;

Disassemble the problem, simplify the problem, and transform the problem into a predictable problem.

After in-depth understanding and analysis of deal volume, it can be divided into several problems as follows:

Single model? Multiple models? How to choose?

After disassembling according to the figure above, there are two possible models for estimating deal volume, one is to directly estimate deal volume; the other is to estimate each sub problem, such as establishing a user number model and a purchase visit rate model (the number of orders that users who visit this deal will buy), and then calculate deal volume based on the estimated value of these sub problems.

Different methods have different advantages and disadvantages, as follows:

Which mode to choose? 1) If the difficulty of the problem can be estimated is large, multiple models shall be considered; 2) the importance of the problem itself, if the problem is very important, multiple models shall be considered; 3) if the relationship between multiple models is clear, multiple models can be used.

If multiple models are used, how to integrate them? Linear fusion or complex fusion can be carried out according to the characteristics and requirements of the problem. For example, there are at least two kinds of problems in this paper:

Model selection

For deal volume, we think it is very difficult to estimate directly. We hope to split it into sub problems to estimate, that is, multi model model model. In this way, we need to establish the number of users model and the purchase visit rate model, because the way of machine learning to solve the problem is similar, the following only takes the purchase visit rate model as an example. To solve the problem of purchase visit rate, first of all, we need to select the model. We have the following considerations:

The main considerations are: 1) select the model consistent with the business objectives; 2) select the model consistent with the training data and features.

训练数据少,High Level特征多,则使用“复杂”的非线性模型(流行的GBDT、Random Forest等); 训练数据很大量,Low Level特征多,则使用“简单”的线性模型(流行的LR、Linear-SVM等)。

Additional considerations include: 1) whether the current model is widely used in industry; 2) whether the current model has a relatively mature open-source Toolkit (within or outside the company); 3) whether the data processing capacity of the current toolkit can meet the requirements; 4) whether you understand the current model theory and whether you have previously used the model to solve problems.

Select the model for the actual problem, the business objective of the problem to be transformed is the model evaluation objective, and the evaluation objective of the transformation model is the model optimization objective; select the appropriate model according to the different objectives of the business, and the specific relationship is as follows:

Generally speaking, it is difficult to estimate the real value (regression), size order (sorting), and the correct range (classification) of the target from large to small. According to the application requirements, the target with small difficulty shall be selected as much as possible. For the application goal of the purchase visit rate prediction, we need to know at least the order of size or the real value, so we can choose area under curve (AUC) or mean absolute error (MAE) as the evaluation goal and maximum likelihood as the model loss function (i.e. optimization goal). To sum up, we choose spark version gbdt or LR, mainly based on the following considerations: 1) it can solve the sorting or regression problem; 2) we have implemented our own algorithm, which is often used, with good results; 3) it supports massive data; 4) it is widely used in the industry.

In depth understanding of the problem, after selecting the corresponding model for the problem, the next step is to prepare data; data is the root of machine learning to solve the problem, and if the data is not selected correctly, the problem cannot be solved, so the preparation of training data needs special care and attention:


The distribution of the data to be solved should be consistent as much as possible;

The distribution of training set / test set is consistent with the data distribution of online prediction environment as much as possible. The distribution here refers to the distribution of (x, y), not only the distribution of Y;

Y data noise shall be as small as possible, and Y noisy data shall be eliminated as far as possible;

It is not necessary to do sampling. Sampling may change the distribution of actual data. However, if the data is too large to train or the positive and negative proportion is seriously misaligned (for example, more than 100:1), it needs to be solved by sampling.

Common problems and Solutions

The data distribution of the problems to be solved is not consistent: 1) the data of deal may be very different in the purchase rate, for example, the influencing factors or performance of food deal and hotel deal are very different, which needs special treatment; either normalize the data in advance, or take the factors with inconsistent distribution as features, or train the models separately for each kind of deal.

The data distribution has changed: 1) use the data training model half a year ago to predict the current data, because the data distribution may change over time, and the effect may be poor. Try to use the recent data training to predict the current data. The historical data can be used as the weight reduction model or transfer learning.

Y data has noise: 1) when establishing the CTR model, take items that the user does not see as negative examples. These items are not clicked because the user does not see them, not necessarily because the user does not like them and they are not clicked, so these items are noisy. Some simple rules can be adopted to eliminate these negative examples of noise. For example, the idea of skip above is adopted, that is, items that have been clicked by users are taken as negative examples (assuming that users browse items from top to bottom).

The sampling method is biased and does not cover the whole set: 1) in the purchase visit rate problem, if only one store's deal is used for estimation, then the deal of multiple stores cannot be well estimated. It should be ensured that there are deal data of one store and multiple stores; 2) there is no binary classification problem of objective data. Rules are used to obtain positive / negative cases, and the rules do not cover the positive / negative cases comprehensively. Random sampling data should be marked manually to ensure that the distribution of sampling data and actual data is consistent.

Training data on purchasing rate

Collect n-month deal data (x) and corresponding purchase rate (y);

Collect the latest n months, and eliminate the irregular time such as holidays (keep the same distribution);

Only deal (reduce y noise) with online time > T and number of users > u are collected;

Consider the deal sales life cycle (keep the distribution consistent);

Consider the differences of different cities, different business circles and different categories (keep the same distribution).

After the completion of data filtering and cleaning, it is necessary to extract features from the data, that is, to complete the transformation from input space to feature space (see the figure below). For linear model or nonlinear model, different feature extraction is needed, linear model needs more work and skills of feature extraction, while nonlinear model has relatively low requirements for feature extraction.

In general, features can be divided into high level and low level. High level refers to features with general meaning, and low level refers to features with specific meaning. For example:

DEAL A1属于POIA,人均50以下,访购率高; DEAL A2属于POIA,人均50以上,访购率高; DEAL B1属于POIB,人均50以下,访购率高; DEAL B2属于POIB,人均50以上,访购率底;

Based on the above data, two features can be selected: poi (store) or per capita consumption; POI is low level and per capita consumption is high level; the following estimates can be obtained by learning the model:

如果DEALx 属于POIA(Low Level feature),访购率高; 如果DEALx 人均50以下(High Level feature),访购率高。

Therefore, on the whole, low level is relatively targeted, with a small coverage of single feature (not much data containing this feature), and a large number of features (dimensions). High level is more general, single feature has a large coverage (there are many data containing this feature), and the number of features (dimensions) is not large. The prediction value of long tail samples is mainly affected by the high level feature. The prediction value of high frequency samples is mainly affected by low level characteristics.

There are a lot of features of high level or low level for the purchase visit rate, some of which are shown in the following figure:

The features of the nonlinear model 1) can mainly use the high level features, because of the large computational complexity, so the feature dimension should not be too high; 2) through the high level nonlinear mapping can better fit the target.

Features of linear model 1) feature system should be as comprehensive as possible, both high level and low level should be available; 2) high level can be converted to low level to improve the fitting ability of model.

feature normalization

After feature extraction, if the value range of different features is very different, it is better to normalize the features to achieve better results. The common normalization methods are as follows:

Rescaling: normalize to [0,1] or [- 1, 1], in a similar way:

Standardization: set as the mean value of X distribution, and as the standard deviation of X distribution;

Scaling to unit length: normalize to unit length vector

feature selection

After feature extraction and normalization, if too many features are found, resulting in the model can not be trained, or it is easy to lead to model over fitting, then it is necessary to select features and select valuable features.

Filter: assuming that the influence of feature subset on model prediction is independent, select a feature subset and analyze the relationship between the subset and data label. If there is a positive correlation, the feature subset is considered effective. There are many algorithms to measure the relationship between feature subsets and data labels, such as chi square, information gain.

Wrapper: select a feature subset to add to the original feature set, train with the model, and compare the effect before and after adding the feature subset. If the effect is better, the feature subset is considered effective, otherwise, it is considered invalid.

Embedded: combine feature selection and model training, such as adding L1 norm and L2 norm to the loss function.

After feature extraction and processing, we can start model training. In the following, take the simple and commonly used logistic regression model (hereinafter referred to as LR model) as an example for a brief introduction. There are m (x, y) training data, where x is the feature vector, y is the label, and W is the parameter vector in the model, that is, the object to be learned in the model training. The so-called training model is to select the hypothesis function and loss function. Based on the existing training data (x, y), W is adjusted continuously to make the loss function optimal. The corresponding W is the final learning result and the corresponding model is obtained.

Model function

1)假说函数,即假设x和y存在一种函数关系: 2)损失函数,基于上述假设函数,构建模型损失函数(优化目标),在LR中通常以(x,y)的最大似然估计为目标:

optimization algorithm

Coordinate descent for W, each iteration, fixed other dimensions remain unchanged, only one dimension is searched to determine the optimal descent direction (schematic diagram is as follows), and the formula is expressed as follows:

After the above mentioned data filtering and cleaning, feature design and selection, model training, we get a model, but if we find that the effect is not good? What should I do? [first] reflect on whether the target is predictable and whether there are bugs in the data and features. [then] analyze whether the model is overfitting or underfitting, and make targeted optimization from data, features and models.

Underfitting & Overfitting

The so-called understanding means that the model has not learned the internal relationship of data. As shown in the left one of the figure below, the generated classification surface cannot distinguish X and O data well. The underlying reason is that the model assumption space is too small or the model assumption space deviates. The so-called overfitting, that is, the model transition fits the internal relationship of training data, as shown in the right one of the figure below, the generated classification surface is too good to distinguish X and O data, while the real classification surface may not be so, so that it does not perform well on non training data; the deep reason is the contradiction between huge model assumption space and sparse data.

In practice, you can determine whether the current model is underfitting or overfitting based on the performance of the model in the training set and test set. The judgment method is as follows:

How to solve the problem of underfitting and overfitting?

To sum up, the problem-solving of machine learning involves such key links as problem modeling, preparation of training data, extraction of features, training model and optimization model, which are as follows:

Understand the business, decompose the business objectives, and plan the predictable roadmap of the model.

Data: y data should be as real and objective as possible; the distribution of training set / test set should be as consistent as possible with that of online application environment.

Features: use domain knowledge to extract and select features; design different features for different types of models.

Model: select different models for different business objectives, different data and features; if the model does not meet expectations, check whether there are bugs in data, features, models and other processing links; consider model underfitting and qferfitting, and optimize the model accordingly.