professor zhang min, tsinghua university: research progress of personalized recommendation (interpretability, robustness and fairness)

Posted by santillano at 2020-03-18

[introduction] this is the sharing record of Zhang Min, associate professor of Tsinghua University, at the byte tech 2019 machine intelligence frontier forum. Byte tech 2019 is jointly sponsored by China artificial intelligence society, byte beat and Tsinghua University, and co sponsored by data Science Research Institute of Tsinghua University.

Hello everyone, today I would like to share with you the research progress of personalized recommendation. This paper mainly discusses three key words: interpretability, robustness and fairness. We started to make interpretable recommendations around 2013, and then began to study robustness and fairness gradually. Why are these three words important?

Graph theory: interpretability, robustness and fairness are three important challenges for AI at present.

Maybe you are familiar with the development of artificial intelligence. Indeed, after the start of the AI boom, people think that AI is becoming more and more powerful. But for many scholars engaged in artificial intelligence research, now they think more about where artificial intelligence meets the biggest bottleneck. At present, we have basically reached a consensus: the two core challenges in the field of artificial intelligence are interpretability and robustness.

In addition to interpretability and robustness, since two or three years ago, foreign research has paid more and more attention to the third issue: fairness. In the course of our research, we find that the three points of interpretability, robustness and fairness are not completely separated. So today's report will discuss these three points separately, but also try to show the relationship between them. Because these three topics are very big, we use a specific field to discuss, which is the personalized recommendation that our research group has been studying for years.

The first is interpretability. What is interpretability? It's very simple. In addition to knowing how to do a task, we also want to know why. This "why" actually has two different angles. First of all, from the perspective of users, we not only want to show users the results of recommendation, such as the recommended products presented on online shopping websites, but also tell users why to recommend this product. Another example is news recommendation. Why does the system push these contents to users from today's hundreds of news. We need a reason, and we need to explain it to users. This is the interpretability of the result. The second aspect is the interpretability of the system perspective, which is the interpretation needed by the system developers. In the research process of our laboratory, sometimes students say to me that the result is very good or very bad. They may be afraid of me asking a question: why is the result like this? Why do our methods work better than others? If not, what is the problem? In particular, what factors / features / data are causing problems, and is it possible to improve? This is about the interpretability of the system. In the current research of artificial intelligence (especially deep learning), there are many discussions on interpretative machine learning. Many people say that the disadvantage of deep learning is that they don't know how to give the results, that is, the lack of systematic interpretability.

Let's talk about the interpretability to users. After that, when we discuss the problem of robustness, we will mention the interpretability of the system.

At present, the recommendation system has been widely used. You must have used the recommendation system, whether it is news reading information flow or online shopping. Now the reason given by the recommendation system is very simple. One of the most common reasons is that the user who bought a certain product bought something else, and then said "you may also be interested in...". In fact, the reason why the recommendation system doesn't give a more convincing reason is not that it doesn't want to give it, but that it can't. Why? Let's start with the recommendation algorithm. Here I will briefly introduce the basic concepts and try to make friends without the background of recommendation system understand.

Concise principle of recommendation system

In recommendation system technology, collaborative filtering is a very common and effective method. In the collaborative filtering technology, we often see a matrix similar to the one shown in the figure below. This matrix records whether a user has bought any product. At this time, the system will generate the recommended product candidates according to the person who has bought the same product and what other products. But the system does not directly look up the matrix to push out the results. People will divide the matrix into two parts: one is users, the other is commodities. The hidden variables of these two parts will share the same dimension, connect users and products, and map them to the same space. This is the commonly used implicit variable decomposition machine model. In fact, the real reason to recommend this product to you may be that on the vectors represented by your third, tenth and twelfth dimensions, your preferences and the vectors represented by the three dimensions of the recommended product are very matched. But if the system tells the user, "I recommend this product to you because your characteristics in the 12th dimension match the 12th dimension of the product", the user may feel puzzled.

Figure: the decomposer model can be used to assist the implementation of recommendation system based on collaborative filtering.

So we want to know whether there is a way to give accurate recommendations and reliable explanations. So people began to try in this direction. We proposed the concept of explainable recommendation around 2014 (as shown below). Later, many people have done relevant research in this field, and our EFM model has become the baseline method that we often use to compare when we make interpretable recommendations. At that time, the idea was that, although the hidden variables in the middle were unexplainable, if we found the middle bridge - the bridge is the specific characteristics, such as the characteristics of goods - then the recommended results could be explained. For example, when the system recommends a mobile phone, it will explain that it has good photo taking performance and beautiful appearance. This may be more suitable for a fashionable girl. If the system finds that other users are interested in other features, it can find other suitable mobile phones to recommend, for example, a mobile phone with large screen, large font, simple operation and long standby time is recommended to you, and you are buying a smart elderly machine for your parents, you are likely to be convinced. After using this method, we can increase the click through rate of users from 3% to 4%, which is a great improvement.

People may ask, "maybe we don't need a reason?" So we use the real data of online shopping websites to do experiments to analyze whether this explanation has any effect. The first group of experiments directly gave the recommendation results, without explanation; the second group gave the same recommendation results, but at the same time gave a simple explanation of "what else the user who has seen this product has seen", so that the click through rate can be from 3.20% to 3.22%; third, we gave a new explanation, providing more specific information, such as large screen, long standby time, and discovery points The hit rate was further increased to 4.34%. So real user experiments tell us that as long as we give a reasonable explanation, the accuracy of recommendation will be greatly improved - sometimes people need someone else to give us a reason to do things.


Illustration: principle explanation of explainable recommendation algorithm EFM

However, the above methods also have problems. First of all, not everything is easy to identify features. For example, it's hard for us to describe the attributes of news, so we can do similar processing. In addition, because human language expression is very free, natural language processing expression has a great diversity. For example, some people may say in the comments, "this thing has no obvious shortcomings, but it doesn't feel easy to use.". In this case, it is difficult to find a complete and accurate feature description quickly. So we think that maybe we can try to improve the granularity a little bit, and not do feature level interpretability at that fine granularity. So this gives us more ideas. Here's a review of Amazon's purchase. You may find that in addition to the user's comments and ratings on the product, other users will also rate the comments of a certain user: the score represents whether other users think the comment is useful or not. If we find such useful comment information for all products, when users browse and purchase, we can present the most useful comments to users, so the recommendation system not only affects the results of purchase, but also helps users in the early and middle selection decision-making process when selecting products.

Figure: users' comments can also be commented by other users.

Therefore, we have done some work from this perspective. Let's first look at whether we can automatically discover the usefulness of comments. Because there is an important principle on the Internet called "lazy user", that is, don't expect users to take the initiative to do too many things. So there are very few users who are willing to give other people's comments, and the data is sparse. So can our system learn by itself? Secondly, is it possible to combine it with the final recommendation algorithm in the process of researching usefulness? Rather than just judging whether some comments are useful without the recommendation system taking advantage of it.

So we designed the model in the figure below, which is a deep learning model based on attention network. In this model, we try to find out more useful and reliable comments through the selection of the mechanism of middle attention while finally giving comments and recommendations. This work was published at the 2018 www conference. The effect of the model is very good. Compared with the state of art methods such as the classic recommendation algorithm and the algorithm based on deep learning, our model will have a statistically significant improvement. In addition, whether the model considers attention or not, the effect will vary greatly. As shown in the figure below.

Figure: Based on the neural attention network, an interpretable recommendation algorithm at the comment level is given.

Figure: the performance of the model is improved significantly by adding the interpretable recommendation method based on the attention mechanism.

How to see whether this model is effective for users? We compared several common ways. For example, most shopping websites now rank comments in the following ways:

Chronological, with the most recent comments at the front;

Random sorting;

Sort by content length after eliminating spam comments (because it is generally believed that longer comments are more useful).

However, sorting based on time and length is often less effective than random, and our proposed method performs better. It is worth noting that, in fact, our large-scale user annotation validity data, which is used as the standard answer, is biased (bias). Because something that has been rated useful will be more likely to be considered useful by others because of the Matthew effect. Comments that are actually useful but don't have the opportunity to present them immediately will remain silent forever. This bias is also one of the situations we call "unfairness". So we made a more objective evaluation of the third party, and found that this kind of bias does exist, and the method found through algorithm analysis is more reliable and effective than relying on users' voting in the system.

In terms of interpretability, there are still more issues to be discussed, such as whether we should use the method of production or the method of discriminant. Our point of view is all right. How to evaluate the validity of this explanation? We think a feasible idea is to combine with the user's behavior. In addition, how to deal with the possible deviation caused by the recommendation algorithm? In particular, does the explanation itself bring about unfairness? This is also a very easy problem, which may become a philosophical problem.

Robustness problem

The second problem to be discussed is robustness. This problem involves many aspects. In the field of personalized recommendation, one of the specific manifestations of robustness problem is the challenge of serious data loss. We all know that we can make recommendations based on the user's history, but if a new user has no history, how do you make recommendations? This is called a cold start problem.

In the recommendation system, there are two kinds of methods: one is based on collaborative filtering, the other is based on content matching. We can integrate them and use historical data to learn the weights assigned to these two methods: for example, 0.8 and 0.2. At the time of cold start, the collaborative filtering part is 0, but at least 0.2-weighted content-based method can be used. But obviously, for different users and different products, the weights of this fusion should be different. So we put forward a thought (as shown in the figure below): instead of selecting the same weight for all people, we put forward a unified framework, which can automatically learn different weights in different situations by using the attention network. If you are interested, please take a look at our paper published on CIKM 2018: attention based adaptive model to unify warm and cold starts recommendation. The effect is really very good, it can solve the problem of cold start very effectively, and it is very helpful to the overall effect.

Figure: a unified framework can solve the cold start recommendation problem.

What's more interesting is that when the students give me the figure below, I think it's a beautiful job, because this work also reflects the system's explainability. Why did the model just mentioned produce good results? This is because by learning different attention, you will find that the upper left corner is a new item (such as a new product or a new message), and the lower right corner is a new user. This picture explains the situation of sufficient information and serious shortage of information (new goods + new users). So you will find that when we solve the problem of robustness, we can also improve the interpretability at the system level.

Graph theory: it can improve the robustness of the recommendation system and the interpretability of the system.

Fairness Issues

In the end, we will discuss fairness in a short time. The issue of fairness deserves attention. For example, a study in 2018 found that on two public data sets, movielens and LastFm, the recommendation effect for men is better than that for women, and the recommendation effect for the elderly and young people under 18 years old is better than that for people between 18 and 50 years old. This is not a systematic bias, which may be related to data volume and user habits, but it is unfair It does exist. On the other hand, there is also unfairness in the recommended materials and related information. For example, the unfairness in the comments we discussed earlier, as well as the unfairness in more popular recommendations, will also bring about unfairness in the unpopular things. Sometimes there is a conflict between fairness to users and goods. For example, we want to increase the diversity of recommendations, but some research shows that when we increase the diversity, we can improve the fairness to the recommended products, but reduce the fairness to users.

Figure: the effect of recommendation system on different groups of people is different, which reduces the fairness to users and recommendations.

In the last minute, share the interesting phenomena we found in the inequity of user behavior. People often say that the quality of this article is too poor when they watch the information flow of news. How can I recommend these? In fact, when we look at the click through rate, we will be surprised to find that: the overall click through rate of low-quality news (the blue line in the left figure below) is always higher than that of high-quality news (the red line in the figure), and even we will find that some users actually know that the quality of this new news is not good before clicking, but people still have a curiosity mentality, "I know it is not good But I'm the point. "After ordering, I found that the quality of this news was really not good. But on the other hand, it's very strange for the recommendation system - users, why don't you like it. So this kind of massive click bias is also unfair, which is unfair to high-quality news.

Figure: the click through rate of low-quality news is always higher than that of high-quality news.

How to solve it? It can be solved to a certain extent from the algorithm idea. Our idea is not just to see the click, not just to take the click rate to do the evaluation index, but to see the satisfaction of users. Although this satisfaction is not explicitly given by the user, it can be automatically analyzed by finding clues from the user's behavior. We published the related work on SIGIR in 2018 (see the following figure for the article and main methods).

Figure: the click through rate of low-quality news is always higher than that of high-quality news.

The above is what I'd like to share briefly with you today. I hope you will pay attention to three very important factors: interpretability, robustness and fairness. Moreover, these three factors do not exist independently, but interact with each other. If we want to have a better AI system, we must do further work in these three aspects. The real intelligent AI technology still has a long way to go. There are many challenges and opportunities for us to discover and face.

Editor: Wen Jing

Checked by: Hong Shuyue