kaggle data mining competition experience sharing

Posted by millikan at 2020-02-27

brief introduction

Founded in 2010, kaggle focuses on data science and holds machine learning competitions. It is the largest data science community and data competition platform in the world. Since 2013, the author has participated in several competitions held by kaggle, and won the first place in crowdflower search related competition (1326 teams) and the third place in Homedepot commodity search related competition (2125 teams). He once ranked the 10th in the world and the first in China in the list of kaggle data scientists. At present, I am a data mining engineer in Tencent's social and effect advertising department, responsible for the work related to looklike population expansion. This paper shares the author's experience in participating in the data mining competition.


Kaggle introduction

Founded in 2010, kaggle focuses on data science and holds machine learning competitions. It is the largest data science community and data competition platform in the world. On kaggle, enterprises or research institutions release business and research problems, offer rewards to attract data scientists around the world, and solve modeling problems through crowdsourcing. And competitors can access to rich real data, solve practical problems, compete for places and win prizes. Well known technology companies such as Google, Facebook and Microsoft have all held data mining competitions on kaggle. In March 2017, kaggle was acquired by Google cloudnext.

1.1 entry method

You can take part in the competition in the form of individuals or teams. There is generally no limit on the number of teams, but they need to be completed before the merger deadline. In order to participate in the competition, at least one valid submission is required before entry deadline. In the simplest way, you can submit the official sample submission directly. As for team formation, it is recommended to conduct data exploration and model construction by individuals first, and then team formation in the later stage of the game (for example, 2-3 weeks before the end of the game) to give full play to the effect of team formation (similar to model integration, the greater the model difference, the more likely it is to help improve the effect and surpass the effect of single model). Of course, teams can be set up at the beginning to facilitate division of labor and cooperation, discussion of problems and collision sparks.

Kaggle attaches great importance to the fairness of the game. In the competition, everyone is allowed to submit with only one account. Within 1-2 weeks after the end of the competition, kaggle will reject the cheater submitted with multiple accounts (generally, cheater detection will be conducted for the team with top 100). On kaggle's personal page, the results of the competition will also be deleted, which is equivalent to that the player has never participated in the competition. In addition, teams should not share code or data without permission, unless it is published on the forum.

Generally, the competition only submits the prediction results of the test set, without the need to submit the code. There is a limit on the number of submissions per person (or team) per day, usually 2 or 5 times. There will be a prompt on the submission page.

1.2 Awards

Kaggle has a large amount of prize money. Generally, the top three can get the prize money. In the recently concluded second national data science bowl, the total bonus pool is as high as $100W, among which the first one can get a reward of $50W and even the tenth one can get a reward of $2.5W.

The winning team needs to prepare executable code, readme, algorithm description document, etc. to submit to kaggle within 1-2 weeks after the end of the competition for the qualification review. Kaggle will invite the winning team to publish an interview in kaggle blog to share the competition story and experience. For some competitions, kaggle or the host will invite the winning team to have a telephone / video conference, and the winning team will present and communicate with the host team.

1.3 competition type

According to the official classification provided by kaggle, it can be divided into the following types (as shown in Figure 1 below):

◆ Featured: commercial or scientific research problems, with relatively large bonus;

◆ acceptance: the award of the competition is an interview opportunity;

◆ research: for scientific research and academic competitions, there will also be a certain bonus, which generally requires a strong field and professional knowledge;

◆ Playground: provide some public data sets for trial models and algorithms;

◆ getting started: provide some simple tasks for getting familiar with the platform and competition;

◆ in class: used for class project assignment or examination.

Figure 1. Kaggle game type

Classification from domain ownership: including search relevance, advertising click through rate prediction, sales volume prediction, loan default judgment, cancer detection, etc.

From the task target Division: including regression, classification (two classification, multi classification, multi label), sorting, mixture (classification + regression) and so on.

From the data carrier Division: including text, voice, image and time series.

From the feature form division: including the original data, plaintext features, desensitization features (the meaning of features is not clear), etc.

1.4 competition process

The basic process of a data mining competition is shown in Figure 2 below, and the specific modules will be described in the next chapter.

Figure 2. Basic process of data mining competition

What I want to emphasize here is that when kaggle calculates the score, it can be divided into public leader board (LB) and private lb. Specifically, competitors submit the forecast results of the whole test set. Kaggle uses part of the test set to calculate scores and rankings, which are displayed on public lb in real time to provide timely feedback and dynamic display of the competition; the rest of the test set is used to calculate the final scores and rankings of competitors, which is private LB at the end of the competition It will be announced later. The data used to calculate public LB and private LB are divided in different ways, depending on the type of competition and data. Generally, they are divided randomly, by time or by certain rules.

This process can be summarized as shown in Figure 3. The purpose of this process is to avoid model over fitting, so as to get a model with good generalization ability. If you don't set private lb (i.e. all test data are used to calculate public lb), the contestants will constantly get feedback from public lb (i.e. test set) to adjust or filter the model. In this case, the test set is actually involved in the construction and tuning of the model as a verification set. The effect on public Lb is not the effect on the real unknown data, and cannot reliably reflect the effect of the model. The division of public LB and private LB also reminds competitors that the goal of our modeling is to obtain a model that performs well in unknown data, rather than just in known data.

Figure 3. Purpose of dividing public LB and private LB

(refer to the sharing of owenzhang for the figure [1])


Basic process of data mining competition

As can be seen from Figure 2 above, a data mining competition mainly includes four modules: data analysis, data cleaning, feature engineering, model training and verification, which will be introduced one by one.

2.1 data analysis

Data analysis may involve the following aspects:

◆ analyze the distribution of characteristic variables

◇ the characteristic variable is a continuous value: if it is a long tail distribution and linear model is considered, power transformation or logarithmic transformation can be performed on the variable.

◇ feature variable is discrete value: observe the frequency distribution of each discrete value. For features with low frequency, consider coding them as "other" category.

◆ analyze the distribution of target variables

◇ the target variable is a continuous value: check whether its value range is large. If it is large, it can be considered to make logarithmic transformation, and the transformed value can be used as a new target variable for modeling (in this case, the prediction results need to be inverted). In general, box Cox transformation can be applied to continuous variables. Through the transformation, the model can be better optimized, and usually it will also improve the effect.

◇ the target variable is a discrete value: if the data distribution is unbalanced, consider whether upper / lower sampling is needed; if the target variable is unbalanced on an ID, consider stratified sampling when dividing the local training set and verification set.

◆ distribution and correlation of two variables

◇ can be used to find highly correlated and collinear features.

Through exploratory analysis of data (or even visual observation of samples in some cases), it can also help to inspire data cleaning and feature extraction, such as the processing of missing and abnormal values, whether text data needs to be spell corrected, etc.

2.2 data cleaning

Data cleaning refers to the processing of the original data to facilitate the subsequent feature extraction. Sometimes the boundary between feature extraction and feature extraction is not so clear. Common data cleaning generally includes:

◆ data splicing

◇ the data provided is scattered in multiple files, which needs to be spliced according to the corresponding key values.

◆ treatment of missing characteristic value

◇ the eigenvalues are continuous values: the missing values are supplemented according to different distribution types: partial normal distribution, replaced by mean value, can keep the mean value of data; partial long tail distribution, replaced by median value, to avoid the influence of outlier;

◇ characteristic value is discrete value: use mode instead.

◆ text data cleaning

In the competition, if the data contains text, a lot of data cleaning work is often needed. For example, removing HTML tags, word segmentation, spelling correction, synonym replacement, removing stop words, stemming words, unifying number and unit format, etc.

2.3 characteristic works

There is a saying that features determine the upper limit of the effect, and different models only approach the upper limit in different ways or to different degrees. In this way, good feature input is very important for the effect of the model, which is called "garbage in, garbage out". To do feature engineering well is often related to domain knowledge and understanding of problems, and also to one's experience. The practice of feature engineering is also case by case. Here are some points and opinions.

2.3.1 feature transformation

Mainly for some characteristics of long tail distribution, power transformation or logarithmic transformation is needed to optimize the model (LR or DNN). It should be noted that the random forest and gbdt models are not sensitive to monotonic function transformation. The reason is that the tree model only considers sorting quantiles when solving split points.

2.3.2 feature coding

For discrete class features, it is often necessary to transform / encode them as features to input them into the model. The common encoding methods are labelencoder and onehotencoder (the interface in sklearn). For example, for the "gender" feature (male and female), the two methods can be coded as {0,1} and {[1,0], [0,1]}, respectively.

For the category features (ID features) with more values (such as hundreds of thousands), the onehotencoder coding directly will lead to a huge feature matrix and affect the effect of the model. It can be processed as follows:

◆ count the frequency of each value in the sample, take the value of top n for one hot coding, and the remaining categories are classified into "other" categories, where n needs to be optimized according to the model effect;

◆ some statistics of each ID feature (such as historical average click rate and historical average browse rate) are used to replace the ID value as the feature. For details, please refer to the winning scheme of the second place in the avazu click rate prediction competition;

◆ with reference to word2vec, map the value of each category feature to a continuous vector, initialize this vector, and train with the model. After the training, you can get the embedding of each ID at the same time. For specific usage, please refer to the third prize winning scheme of Rossmann sales forecast competition,

For random forest and gbdt models, if there are many values for category features, the results after labelencoder can be directly used as features.

2.4 model training and verification

2.4.1 model selection

After processing the features, we can train and verify the model.

◆ for sparse features (such as text features and one hot ID features), we usually use linear models, such as linear regression or logistic regression. The random forest and gbdt tree models are not suitable for sparse features, but they can be used after dimension reduction (such as PCA, SVD / LSA, etc.). If the sparse feature is directly input into DNN, it will lead to more network weight, which is not conducive to optimization. We can also consider reducing dimension first, or using embedding for ID class features;

◆ for dense features, xgboost is recommended for modeling, which is simple and easy to use;

◆ there are both sparse features and dense features in the data. It can be considered to use linear model to model sparse features, and input its output together with dense features into xgboost / DNN modeling. For details, please refer to section 2.5.2 stacking.

2.4.2 model validation

For the selected features and models, we often need to optimize the super parameters of the model to get better results. Parameter adjustment can be generally summarized as the following three steps:

1. Partition of training set and verification set. According to the training set and test set provided by the competition, the training set is divided into local training set and local verification set by simulating its division mode. The division method depends on the specific competition and data. The commonly used methods are:

a) Random division: for example, 70% of random samples are used as training sets, and the remaining 30% are used as test sets. In this case, kfold or structured kfold can be used locally to construct training set and verification set.

b) Divided by time: generally corresponding to time series data, such as taking the first 7 days data as training set and the last 1 day data as test set. In this case, the local training set and the verification set need to be divided in time order. The common wrong way is random partition, which may lead to overestimation of model effect.

c) According to some rules: in the Homedepot search relevance game, the training set and the query set in the test set are not exactly coincident, but only partially intersected. In another similar game (crowdflower search correlation game), the training set and the test set have exactly the same query set. For the division of training set and verification set data in the Homedepot competition, it is necessary to consider that the query set is not completely coincident. One of the methods can refer to the award winning scheme of the third place,'u Homedepot.

2. Specify the parameter space. When specifying the parameter space, we need to have a certain understanding of the model parameters and how they affect the effect of the model, in order to specify a reasonable parameter space. For example, DNN or xgboost learning rate is generally 0.01 or OK (too large may cause the optimization algorithm to miss the optimization point, too small may cause the optimization convergence to be too slow). Another example is random forest. Generally, setting the number of trees in the range of 100-200 can have a good effect. Of course, some people fix the number of trees to 500, and then only adjust other super parameters.

3. Search parameters according to certain methods. The commonly used parameter search methods are grid search, random search and some automatic methods (such as hyperopt). Among them, the method of hyperopt, according to the effect of the parameter combination that has been evaluated in history, infers which parameter combination is more likely to get better effect in this evaluation. For the introduction and comparison of these methods, please refer to [2].

2.4.3 make proper use of public LB's feedback

In section 2.4.2, we mentioned the local validation results. When submitting the forecast results to kaggle, we will also receive the feedback results from public lb. If the change trend of the two results is the same, for example, if local validation is improved and public Lb is also improved, we can use the change of local validation to perceive the evolution of the model without relying on a large number of submissions. If the change trend of the two is inconsistent, it is necessary to consider whether the division method of local training set and verification set mentioned in section 2.4.2 is consistent with that of training set and test set.

In addition, in some cases, public LB feedback will also provide useful information, and proper use of such feedback may bring you advantages. As shown in Figure 4, (a) and (b) indicate that the data has no obvious relationship with time (such as image classification), and (c) and (d) indicate that the data changes with time (such as time series in sales volume prediction). (a) The difference between (b) and (b) is that the number of training set samples is relative to the magnitude of public lb, in which (a) the number of training set samples is far greater than the number of public LB samples, in this case, the local validation based on the training set is more reliable; in (b) the number of training set is equivalent to public lb, in this case, the feedback of public LB can be used to guide the selection of the model Choose. One way of fusion is to weight the samples according to the proportion of local validation and public lb. For example, the evaluation standard is the accuracy rate, the number of samples of local validation is n μ L, the accuracy rate is a μ L; the number of samples of public Lb is n μ P, and the accuracy rate is a μ P. Then, we can use the fused index: (n? L * a? L + n? P * a? P) / (n? L + n? P) to screen the model. For (c) and (d), because the data distribution is time-dependent, it is necessary to use public LB's feedback to select the model, especially for the case shown in (c) figure.

Figure 4. Appropriate use of public LB's feedback

(refer to the sharing of owenzhang for the figure [1])

2.5 model integration

If you want to get a place in the competition, almost all of them need to carry out model integration (team building is also a kind of model integration). As for the introduction of model integration, there are already good blogs, please refer to [3]. Here, I will briefly introduce some common methods and personal experience.

2.5.1 averaging and voiting

Directly average or vote the prediction results of multiple models. For tasks with continuous target variables, average is used; for tasks with discrete target variables, voting is used.

2.5.2 Stacking

Figure 5.5-fold stacking

(refer to sharing of Jeong Yoon Lee [4])

Figure 5 shows the process of using 5-fold to stack (of course, stage 2, stage 3, etc. can be superimposed on it). The main steps are as follows:

1. Data set division. Divide the training data according to 5-fold (if the data is related to time, it needs to be divided according to time. For more general division, please refer to section 3.4.2, which will not be covered here);

2. Basic model training I (as shown in the left half of the first line of Figure 5). According to the cross validation method, train the model on the training folder (as shown in the gray part of the figure), and make predictions on the validation folder to get the prediction results (as shown in the yellow part of the figure). Finally, the above prediction results of the whole training set are synthesized (as shown in CV prediction in the first yellow part of the figure).

3. Basic model training II (as shown in the left half of the second and third lines of Figure 5). Train the model on the full training set (as shown in the gray part of the second line of the figure), and make predictions on the test set to get the prediction results (as shown in the green part of the dotted line of the third line of the figure).

4. Stage 1 model integration training I (as shown in the right half of the first line in Figure 5). Take the CV prediction from step 2 as a new training set, and follow step 2 to get the CV prediction of stage 1 model integration.

5. Stage 1 Model Integration Training II (as shown in the right half of the second and third lines of Figure 5). Take CV prediction from step 2 as a new training set and prediction from step 3 as a new test set, and follow step 3 to get stage 1 model integrated test set prediction. This is the output of stage 1, which can be submitted to kaggle to verify its effect.

In Figure 5, only one basic model is shown, but in practical application, there are many basic models, such as SVM, DNN, xgboost, etc. It can also be the same model, different parameters, or different sample weights. Repeat steps 4 and 5, and then stack stage 2, stage 3 and other models.

2.5.3 Blending

Blending is similar to stacking, but a part of data (such as 20%) is reserved for training stage X model.

2.5.4 Bagging Ensemble Selection

Bagging assemble selection [5] is a method I use in crowdflower search relevance competition. Its main advantage is that it can optimize any index to integrate models. These indicators can be differentiable (such as logloss) and non differentiable (such as accuracy, AUC, qualitative weighted kappa, etc.). It is a forward greedy algorithm with the possibility of over fitting. In reference [5], the author proposes a series of methods (such as bagging) to reduce this risk and stabilize the performance of the integrated model. Using this approach requires hundreds of underlying models. For this reason, in crowdflower's competition, I kept all the intermediate models and corresponding prediction results in the process of parameter adjustment as the basic model. The advantage of this is that not only the best single model can be found, but also all the intermediate models can participate in model integration to further improve the effect.

2.6 automation framework

From the above introduction, we can see that there are many modules involved in a data mining competition. If there is a more automatic framework, the whole process will be more efficient. In the earlier stage of crowdflower competition, I reconstructed the code architecture of the whole project, abstracted out three modules: Feature Engineering, model tuning and validation, and model integration, which greatly improved the efficiency of trying new features and models, and was also a favorable factor for me to win the ranking in the end. This code is open-source on GitHub. At present, it's GitHub's most stars about kaggle competition solution,'crowdflower.

It mainly includes the following parts:

1. Modular Feature Engineering

a) The interface is unified, and only a small amount of code is needed to generate new features;

b) Automatically stitching individual features into feature matrix.

2. Automatic model tuning and validation

a) The partition method of custom training set and verification set;

b) Use grid search / hyperopt and other methods to tune the specific model in the specified parameter space, and record the best model parameters and corresponding performance.

3. Automation model integration

a) For the specified basic model, the integration model is generated according to certain methods (such as averaging / stacking / blending, etc.).


Review of kaggle competition plan

So far, there have been various competitions on the kaggle platform, covering application scenarios such as image classification, sales volume prediction, search relevance, click through rate prediction, etc. In many competitions, the winners will open their own solutions and are very happy to share their experience and skills. These open source solutions and experience sharing are very good reference materials for beginners and veterans. In the following, the author makes a simple inventory of the competition open source schemes in different scenarios based on his own background and interests, and summarizes the commonly used methods and tools in order to inspire ideas.

3.1 image classification

3.1.1 task name

National Data Science Bowl

3.1.2 task details

With the great success of deep learning in the field of visual image, there are more and more competitions related to visual image in kaggle. The release of these competitions has attracted many competitors to explore deep learning based methods to solve the image problem in the vertical field. NDSB is one of the early image classification related competitions. The goal of this competition is to use a large number of binary images of marine plankton provided, and to achieve automatic classification by building models.

3.1.3 Award Program

● 1st place:Cyclic Pooling + Rolling Feature Maps + Unsupervised and Semi-Supervised Approaches。 It is worth mentioning that the main player of this team is also the first in the Galaxy Zoo planetary image classification competition, and the developer of fast conv based on FFT in theano. In both competitions, theano was used, and it was very smooth. Scheme link:

● 2nd place:Deep CNN designing theory + VGG-like model + RReLU。 The team is also quite strong, including Xudong Cao, a former MsrA researcher, Tianqi Chen, Naiyan Wang, Bing Xu, etc. Tianqi and other gods used cxxnet (the predecessor of mxnet) at that time, which was also promoted in this competition. Another famous work of Tianqi God is xgboost, which is now used by the top 10 team in almost every game of kaggle. Program link:

● 17th place:Realtime data augmentation + BN + PReLU。 Scheme link:

3.1.4 common tools

▲ Theano:

▲ Keras:

▲ Cuda-convnet2:

▲ Caffe:


▲ MXNet:


3.2 sales forecast

3.2.1 task name

Walmart Recruiting - Store Sales Forecasting

3.2.2 task details

Walmart provides weekly sales records from 2010-02-05 to 2012-11-01 as training data. Participants need to establish models to predict the sales volume from 2012-11-02 to 2013-07-26. The characteristic data provided by the competition include: store ID, Department ID, CPI, temperature, gasoline price, unemployment rate, holidays, etc.

3.2.3 Award Program

● 1st place:Time series forecasting method: stlf + arima + ets。 Mainly based on the statistical method of time series, rob J Hyndman's forecast R package is widely used. Program link:

● 2nd place:Time series forecasting + ML: arima + RF + LR + PCR。 Statistical method of time series + hybrid of traditional machine learning method; scheme link:

● 16th place:Feature engineering + GBM。 Scheme link:

3.2.4 common tools

▲ R forecast package:

▲ R GBM package:

3.3 search relevance

3.3.1 task name

CrowdFlower Search Results Relevance

3.3.2 task details

The competition requires players to use tens of thousands of (query, title, description) tuples of data as training samples, build a model to predict its relevance score {1, 2, 3, 4}. The game provides raw text data for query, title, and description. The competition uses "qualitative weighted kappa" as the evaluation criteria, which makes the task different from the common regression and classification tasks.

3.3.3 Award Program

● 1st place:Data Cleaning + Feature Engineering + Base Model + Ensemble。 After cleaning the original text data, a large number of features, such as attribute features, distance features and statistical features based on grouping, are extracted, and different models (regression, classification, sorting, etc.) are trained with different objective functions. Finally, the prediction results of different models are fused with the method of model integration. Scheme link:

● 2nd place:A Similar Workflow

● 3rd place: A Similar Workflow

3.3.4 common tools


▲ Gensim:

▲ XGBoost:

▲ RGF:

3.4 estimated click through rate I

3.4.1 task name

Criteo Display Advertising Challenge

3.4.2 task details

Classic hit rate prediction game. The competition provides 7 days of training data, 1 day of test data. There are 13 integer features and 26 category features, all of which are desensitized, so we can't know the meaning of specific features.

3.4.3 Award Program

● 1st place: gbdt feature code + FFM. The team of Taiwan University, drawing on the scheme of Facebook [6], uses gbdt to code features, and then inputs the coded features and other features into the field aware factorization machine (FFM) for modeling. Program link:

● 3rd place:Quadratic Feature Generation + FTRL。 The combination of traditional feature engineering and ftrl linear model. Program link:

● 4th place:Feature Engineering + Sparse DNN

3.4.4 common tools

▲ Vowpal Wabbit:

▲ XGBoost:


3.5 prediction of click rate II

3.5.1 task name

Avazu Click-Through Rate Prediction

3.5.2 task details

Click through rate forecast game. It provides 10 days of training data, 1 day of test data, and provides time, banner location, site, app, device features, 8 desensitization category features.

3.5.3 Award Program

● 1st place:Feature Engineering + FFM + Ensemble。 It's also a team of Taiwan University. In this competition, they used FFM extensively and integrated only based on FFM. Program link:

● 2nd place: Feature Engineering + gbdt feature coding + FFM + blending. The competition plan of owenzhang (once the No.1 in the list of kaggle for a long time). The feature engineering of owenzhang is of great reference value. Scheme link:

3.5.4 common tools


▲ XGBoost:


Reference material

[1] Tips for data science competitions

[2] Algorithms for Hyper-Parameter Optimization

[3] Mlwave blog: kaggle assembling Guide

[4] Sharing of Jeong Yoon Lee: wining data science competitions

[5] Ensemble Selection from Libraries of Models

[6] Practical Lessons from Predicting Clicks on Ads at Facebook



As a former student party, I am very grateful and grateful for the platform like kaggle, which provides extremely challenging tasks in different fields and rich and diverse data. Let me use this kind of data mining Xiaobai of "Yi" theory and "Wai" theory. I can practice in real problem scenarios and business data to improve my data mining skills. If I'm not careful, I can also win the prize. If you are also eager to try, you may as well choose a suitable task and start the journey of data mining. Oh, by the way, our department held the "Tencent social advertising college algorithm competition" this year to estimate the conversion rate of mobile app advertisements. There are a lot of real data, rich prizes and bonuses. The top 20 team can still get the green channel of school recruitment. Would you like to try it? Portal:

For more details, please visit the official website:

Competition registration channel:

Of course, there are also ways to apply:

TSA contest: Official wechat

Children's shoes for fear of too many gifts~

Author brief introduction

Chen Chenglong, PhD, graduated from Sun Yat sen University in 2015, studied image tampering detection, published two papers on IEEE tip, the top journal in the image field, and won the first and third place respectively in the kaggle crowdflower and Homedepot search relevance competition. He once ranked the 10th in the world and the first in China in the list of kaggle data scientists. At present, I am a data mining engineer in Tencent's social and effect advertising department, responsible for the work related to looklike population expansion.

Click to read the full text and compete now