cold start of intention recognition based on small sample learning

Posted by santillano at 2020-02-29

Authors: Li Binghua, Geng Ruiying, Li Yongbin, Sun Jian

Unit: Xiaomi Beijing team of Alibaba intelligent service business department

With the development of deep learning and natural language processing technology, many companies are committed to the development of human-computer dialogue system, hoping that people and machines can interact through natural language. Alibaba Xiaomi's team in Beijing has created an intelligent dialogue development platform, dialog studio, to enable third-party developers to develop task-based dialogues in their business scenarios, and one of the important functions is to classify intentions.

When a large number of platform users create a new conversation task, there is not a large amount of annotation data, and each intention is often only a few or a dozen samples, so how to use the existing few samples to build intention classification model? In the face of this kind of cold start problem, we propose to use few shot learning to solve the problem of intention recognition in dialogue platform.

As for the system knowledge and the latest development of feed shot learning, please refer to our previous review. This paper mainly introduces our work in detail: firstly, summarize the previous work, put forward the feed shot learning framework of encoder induction relation, and then integrate the cap network and dynamic routing, put forward the induction Network, in two small sample text classification data sets, has achieved state-of-the-art.

Human beings are very good at identifying a kind of object through a very small number of samples, for example, children only need some pictures in the book to know what is "zebra" and "rhinoceros". Inspired by this kind of fast learning ability, we hope that the model can be classified correctly through a small amount of data in a large number of categories. For new categories, we only need a small number of samples to be able to learn quickly, which is the problem few shot learning needs to solve.  

Feed shot learning is an application scenario of meta learning in the field of supervised learning. In the training stage, we decompose data sets into different meta tasks according to categories to learn the generalization ability of the model in the case of category changes. In the testing stage, we can In the stage, facing new categories and only a small amount of data in each category, the classification can be completed without changing the existing model.  

Formally speaking, the training set of few shot contains a large number of categories, with a small number of samples in each category. In the training stage, C categories will be randomly selected from the training set, and K samples of each category (a total of C × k data) will be used to build a meta task as the support set input of the model; then a batch of samples from these C categories will be taken as the prediction object of the model (query set or batch set). This task is called C-way k-shot problem.

In the process of training, each round (epiride) will sample different meta tasks, that is, it contains different category combinations. This mechanism makes the model learn the common parts of different meta tasks, such as how to extract important features and compare sample similarity, and forget the domain related parts of meta tasks. The model learned by this learning mechanism can also be classified well in the face of new meta tasks that have not been seen before. See algorithm 1 for details.

Most of the popular methods in feed shot learning are metric based, that is to say, a small number of samples in the category are used to calculate the representation of the category, and then a certain metric method is used to calculate the final classification results. The following is a brief introduction to the existing relevant methods.  

As shown in Figure 1, the Siamese network [1] trains the twin network in a supervised way to learn, and then reuses the features extracted by the network for one / few shot learning. The specific network is a two-way neural network. During training, different pairs of samples are constructed through combination. Input the network for training. At the top level, judge whether they belong to the same class through the distance of sample pairs, and generate the corresponding probability distribution. In the prediction stage, the twin network processes each sample pair between the test sample and the support set, and the final prediction result is the category with the highest probability on the support set.

▲ Figure 1. Siamese network

Compared with twin network, match network [2] is shown in Figure 2. It builds different encoders for support set and batch set. The output of final classifier is the weighted sum of predicted values between support set samples and query. This network can generate tags for unknown categories without changing the model. At the same time, a matching nets based on memory and attention is proposed, which makes it possible to learn quickly.

In addition, this paper also makes the whole task follow a principle of traditional machine learning, that is, training and testing should be carried out under the same conditions. It is proposed that the network should constantly watch only a small number of samples of each type during training, which makes the training and testing process consistent. This also makes the following articles will be based on this way to train and test.

▲ Figure 2. Match network

Prototype network [3] is based on the idea that there is a prototype expression for each category, and the prototype of this category is the mean value of support set in embedding space. Then, the classification problem becomes the nearest neighbor in embedding space.

As shown in Figure 3, C1, C2 and C3 are the mean centers (prototype) of the three categories respectively. After embedding the test sample x, calculate the distance from the three centers to obtain the category of X.

▲ Figure 3. Prototype network

This paper uses the mixed density estimation of the exponential family distribution under Bregman divergence, and uses more categories in training than in testing, that is, 20 categories (20 ways) are used for each epides in training, while the effect of testing in 5 categories (5 ways) is improved by 2.5 percentage points compared with that in training.  

Several network structures introduced above use fixed measurement methods in the final distance measurement, such as cosine, Euclidean distance, etc. all the learning processes under this model structure take place in the embedding stage of the samples.

The relation network [4] thinks that the measurement method is also a very important part of the network, which needs to be modeled, so the network does not meet the single and fixed distance measurement method, but trains a network to learn (such as CNN) the distance measurement method, and also changes in the loss aspect, considering that the relation network pays more attention to the relation Score is more like a regression than a 0 / 1 Classification, so MSE is used instead of cross-entry.

▲ Figure 4. Relation networks

Looking back at the above methods, it can be seen from table 1 that the existing methods simply add (relation net) or average (prototype) to the sample vectors when representing new categories Net), in this case, due to the diversity of natural languages, only one part of different representations of the same class is related to the content of the class, and the other parts change greatly with each person's language habits, so many key information will be lost in the noise generated by different representations of the same class.

For example, in the field of operators, the intention of "changing package" is also expressed. It can be said that it is simple and clear: "I want to change package" or very complicated: "I want to change package next month, which is to cancel the package that I didn't need and change it to a cheaper one...".

If we simply sum up different scripts, the information irrelevant to classification will accumulate, and then affect the effect of classification.

▲ Table 1. Comparison of metric based methods

Different from the supervised learning with a large number of samples, the noise problem will become more obvious in the few shot learning, because under the large number of samples of supervised learning, the proportion of key information and noise in a certain category of data is very different, and the model is easy to distinguish which is noise (such as words or n-gram), which is effective information (such as business keywords or sentence patterns), and vice versa Learning has only a small number of samples, so it is difficult to capture such information with a simple mechanism. Therefore, the steps of explicitly modeling the class representation are very meaningful. The specific implementation details will be described in detail below.  

Therefore, a better learning method should be the ability to model and induce category features: ignore the details unrelated to business, and summarize the semantic representation of categories from the diverse representations at the sample level. Therefore, we need to stand in a higher perspective to reconstruct the hierarchical semantic representation of different samples in the support set, and dynamically summarize the category features from the sample information.

In this work, we propose the induction network. By combining the dynamic routing algorithm with the meta learning mechanism, we explicitly model the ability to generalize the class representation from a small number of samples.  

First of all, our team summarized the common features of metric based method and put forward a three-level framework of encoder induction relation, as shown in Figure 5. Encoder module is used to obtain the semantic representation of each sample, which can use typical CNN, LSTM, transformer and other structures. Induction module is used to summarize the category features and relation from the sample semantics of support set Module is used to measure the semantic relationship between query and category, and then complete the classification.  

As shown in Table 1, previous work is often devoted to learning different distance measurement methods, while ignoring the modeling from sample representation to category representation. In natural language, because of the different language habits of each person, there are many different expressions of the same category. If we simply add or average them as the representation of the category, these interference information unrelated to the classification will accumulate and affect the final effect. Therefore, our work explicitly models the ability from sample representation to category representation.

▲ Figure 5. Encoder induction relation three level framework

As shown in Figure 6, our model is based on the three-level framework of encoder induction relation, in which the encoder module uses Bi LSTM based on self attention, the induction module uses dynamic routing algorithm, and the relation module uses neural tensor network.

▲ Figure 6. Induction network framework

Encoder module

In this work, Bi LSTM self attention is used to model sentence level semantics, input the word vector matrix of the sentence, and then encode the sentence level semantic representation E.

Induction module

After coding each sample in the support set as a sample vector, the induction module sums it up as a class vector.

In this process, we regard the sample vector in the support set as the input capsule, and the output capsule as the semantic feature representation of each class after a layer of dynamic routing transformation.  

First of all, it is to make a matrix transformation for all samples, which means to transform the semantic space at the sample level to the semantic space at the category level. In this process, we use the same transformation matrix for all sample vectors in the support set, so that we can process the support set of any scale, which means that our model can deal with any way any shot scenarios.

Then, the irrelevant information is filtered by dynamic routing, and the category features are extracted. In each iteration of dynamic routing, we dynamically adjust the connection coefficient between the upper and lower layers and make sure that the sum is 1:

The logical value of Bi connection coefficient is initialized to 0 in the first iteration. For a given sample prediction vector, each candidate class vector is a weighted sum of:

Then a nonlinear squash function is used to ensure that the module length of each class vector does not exceed 1:

The last step of each iteration is to adjust the connection strength by "routing by agreement". If there is a large point multiplication result between the generated class candidate vector and a sample prediction vector, the connection strength between them will be increased, otherwise it will be decreased.  

The mapping process from the sample vector to the category vector is modeled by this dynamic routing method, which can effectively filter the interference information irrelevant to the classification and obtain the category characteristics. See algorithm 2 for details.

Relation module

We get the class vector representation of each category in the support set through the induction module, and get the vector of each query in the batch set through the encoder module. The next step is to measure the correlation between the two. The relation module is a typical natural sensor layer. Firstly, the interaction between each class vector and query vector pair is modeled by three-dimensional sensor, and then the relationship score is obtained by using the full connection layer.

Objective function

We use the least square loss to train our model and return the relationship score to the real label: the score between the matched class and query pair tends to 1, while the score of the mismatched class tends to 0. In each episode, given the support set s and query set, the loss function is defined as follows:

We use gradient descent method to update the parameters of encoder, induction and relation modules. After the training, our model does not need any finetune when identifying new categories, because in the meta training phase, the model has been given enough generalization ability, and will continue to accumulate with the iteration of the model.

We verify the effect of the model on two few shot text classification datasets. All experiments are implemented by tensorflow.  

Data set

1. The ArsC data set was proposed by Yu et al. [6] in naacl 2018. It is taken from the Amazon multi domain emotion classification data. The data set contains the review data of 23 Amazon products. For each product, three two classification tasks are constructed, and their reviews are divided into five, four and two grades according to their scores. Each grade is regarded as a two classification task, and then 23 * 3 = 69 tasks are generated, and then take them 12 tasks (4 * 3) are used as test sets, and the remaining 57 tasks are used as training sets.  

2. The odic data set comes from the online log of Alibaba dialogue factory platform. Users will submit a variety of different dialogue tasks and intentions to the platform, but each intention has only a few annotation data, which forms a typical feed shot learning task. The data set contains 216 intentions, 159 of which are used for training and 57 for testing.  

Parameter setting

The word vector of the pre training uses 300 dimensional glove word vector, the hidden layer dimension of LSTM is set to 128, the iterator love times of dynamic routing is set to 3, and the tensor number of relation module is h = 100. We build a 2-way 5-shot model on the ArsC dataset, and choose C and K from [5,10] on the odic dataset to get four groups of experiments.

In each epiride, in addition to selecting K samples for the support set, we also select another 20 samples for each class as the query set. That is to say, in the 5-way 5-shot scenario, each iteration training will have 5 * 5 + 5 * 20 = 125 samples participating in the training.

Experimental results

The experimental results on the ArsC and odic datasets are shown in Table 2 and table 3.

▲ Table 2. Experimental results of ArsC data set

▲ Table 3. Experimental results of Odic data set

As shown in Table 1, we put the metric based methods into the encoder induction relation framework. We can find that previous work is often devoted to learning different distance measurement methods, while ignoring the modeling from sample representation to category representation.

In natural language, because of the different language habits of each person, there are many different expressions of the same category. If we simply add or average them as the expression of the category, these interference information unrelated to the classification will be accumulated, which will affect the final effect. Therefore, our work explicitly models the ability to summarize the sample representation into the category representation, and Beyond the previous state of the art model.  

Experimental analysis

We further analyze the influence of transpose matrix and model on encoder module.

The function of transpose matrix

In the 5-way 10 shot scenario, we use t-sne to reduce the dimension and visualize the changes of the support set samples before and after the transformation transformation matrix. As shown in the figure, we can find that the sample vector separability of the support set after the transformation matrix is significantly better. It also proves the validity of matrix transposition process for converting sample features to class features.

Query visualization

We find that the induction network can not only generate higher quality class vectors, but also help the encoder module learn better sample semantic representation. By randomly selecting the categories of five test sets and visualizing the vectors of all the samples after encoder, we find that the separability of the sample vectors learned by the induction network is significantly higher than that of the relation network, which shows that our induction module and the relation module are transmitted to the encoder in the reverse direction The more effective information of the module makes it learn the sample representation which is easier to classify.

In this work, we propose the induction network to solve the problem of small sample text classification. Our model reconstructs the hierarchical semantic representation of support set samples and dynamically induces the feature representation of categories. We combine the dynamic routing algorithm with the framework of meta learning to simulate the inductive ability of human like. The experimental results show that our model exceeds the current state of the art model in different small sample classification data sets.

[1] Koch, Gregory, Richard Zemel, and Ruslan Salakhutdinov. "Siamese neural networks for one-shot image recognition." ICML Deep Learning Workshop. Vol. 2. 2015. 

[2] Oriol Vinyals, Charles Blundell, Tim Lillicrap, Daan Wierstra, et al. Matching networks for one shot learning. In Advances in Neural Information Processing Systems, pages 3630–3638, 2016. 

[3] Snell, Jake, Kevin Swersky, and Richard Zemel. "Prototypical networks for few-shot learning." Advances in Neural Information Processing Systems. 2017. 

[4] Sung, Flood, et al. "Learning to compare: Relation network for few-shot learning." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018. 

[5] Geng R, Li B, Li Y, et al. Few-Shot Text Classification with Induction Network[J]. arXiv preprint arXiv:1902.10482, 2019. 

[6] Yu, Mo, et al. "Diverse few-shot text classification with multiple metrics." arXiv preprint arXiv:1805.07513