cloud computing, distributed architecture, k8s, big data, machine learning, search, recommendation, advertising

Posted by fierce at 2020-04-14

The search results sorting algorithm model of search engine can be abstracted to calculate the P (d|q) of each doc,

Based on the principle of naive Bayesian algorithm, P (d|q) = P (q|d) * P (d) / P (q), P (q) is the same for each doc, so the final ranking score is p (q|d) * P (d), P (q|d) is the matching degree of query and DOC, P (d) is the score of each doc itself, and the score of DOC itself can be defined from many angles, which is related to many factors.

Lucene's default sorting algorithm is TF-IDF model in eigenvector space,

 score(q,d)= coord(q,d)·queryNorm(q)·∑( tf(t in d)·idf(t)^2·t.getBoost()·norm(t,d) )

In fact, such a large string of formulas can be summarized as the matching of the correlation between query and DOC, as well as boost (boost can be applied to DOC). The disadvantage of Lucene model is its low accuracy. For the ranking rules of e-commerce search, you can set the parameter comprehensive weight of all dimensions (popularity, popularity, cheating, seller factor, commodity factor,) of doc to the boost of doc to affect the boost score of doc.

There is also a related BM25 model based on probability model, which has been provided in Lucene 4. The formula of BM25 will not be discussed in detail here. You can search it if you are interested.

In web page sorting, in addition to the relevance between query and web page, the PageRank of web page itself is also related.

In Solr, we can use the configuration of EDI Smax (PF, QF) and BF query to affect boost ranking.

With the relevance of query and DOC and the scoring of Doc's own factors, in practical application, how to integrate these factors that affect search sorting? One way is to manually evaluate various factors and add these factors according to the weight. For the combination of various factors, the strategy is constantly refined. When the scale reaches a certain degree, it is difficult to adjust the weight of these factors manually. Then another approach is to use machine learning method LTR, combining with the user's search and click behavior, to learn the model, and adjust the factor weight automatically according to the model (the following chapters on LTR will be introduced in detail).

It can be seen that no matter search, recommendation or advertising, sorting is involved. More and more machine learning methods are used for modeling, and these aspects are converging.

141 original articles published, 148 praised, 1.3 million visitors+