where nlp runs: what is natural language processing? zmonster's blog

Posted by santillano at 2020-03-02


This is the second article in the NLP where to run series, which is as follows:

What is natural language processing

The above is a fake sketch, which uses the scene of "what is a mechanical device" in "three fools" when Rancho is asked by the professor. In the movie, through the contrast between the rigidity of professors and "top students" in this scene and the flexible wit of Rancho, we attack the rigid education system. However, film is always a film. We need to admit that its criticism is objective. However, Rancho's behavior is to present and strengthen conflicts and contradictions as the other extreme opposite to one extreme. When we know things and learn knowledge, mechanical rigid memory is not advisable, and it's not an encouraging behavior to have no rules but rely on personal perceptual knowledge.

So what is natural language processing?

Natural language processing, or NLP for short, is a subject aiming at using computer technology to understand and use natural language. In different scenarios, it is sometimes called Computational Linguistics (CL) or natural language understanding (NLU).

To understand the field of NLP, we only need to grasp these points

NLP is mainly carried out by computer technology

At the bottom level of theory, NLP involves mathematics, linguistics, cognitive science and other disciplines, but at the end, it usually carries these theoretical knowledge and plays an effective role through computer technology.

It seems like nonsense, but it should be emphasized here that unlike human brain, computer has its advantages and disadvantages, and NLP technology will also be affected by the advantages and disadvantages of current computer technology. So don't use our brain to process natural language and its effect to require NLP technology - for most of us, using language is very natural and simple, but don't think that using computer to process it is also very simple.

NLP should understand and use natural language

The so-called "natural language" refers to the language naturally evolved in our world, such as English, Chinese, French It's called "natural language" to distinguish it from artificial languages such as programming languages (such as C, Java, python, etc.).

Programming language has a very clear and fixed grammar. Every sentence written in programming language has a unique and definite meaning. Therefore, the computer only needs to analyze and execute it according to the grammar rules.

Natural language is different, it has relatively stable grammar rules, but these grammar rules will have exceptions and have been evolving, plus some other characteristics, often have ambiguity. Dealing with ambiguity is a very core part of NLP.

NLP tries to understand natural language, but there is no definite standard for "understanding"

In an ideal sense, "understanding" natural language requires NLP to understand natural language like human brain. However, there is no systematic and comprehensive understanding of how our brain works when we use language in brain science research. Therefore, it can't be called a standard. In fact, under the existing technical framework, it's impossible to fully understand natural languages with computers.

Second, we generally think that as long as the machine can correctly respond to our requirements expressed in natural language in a specific scenario, it is to understand natural language.

Note that there are several premises

Of course, this is only the standard followed by the actual NLP system. In fact, some people try to determine the process and standard of "understanding" from different perspectives such as linguistics and brain science. Let's keep our attention and look forward to the future.

Next, I will discuss it according to these points

Difficulties and limitations of natural language processing

We use natural language every day, and we don't know too much about the difficulties of "understanding natural language". But in fact, some characteristics of natural language itself make it very difficult for computers to understand natural language.

The first characteristic is that there is ambiguity in natural language, which is reflected in many aspects:

In the smallest unit of a language, that is, the level of words, there are polysemy, synonyms and other phenomena

What's more classic is a classic example given by an early machine translation researcher in the 1960s

Although we may not know the other meaning of pen is "fence", we can certainly realize that pen in this sentence is not "pen". This badcase has been put forward for more than 50 years, but we can see that there is still no translation system to solve this problem without special treatment (in other words, this badcase can be solved by special and ugly means).

Google translation

Baidu Translate

Sogou translation

In contrast, an example in Chinese is the polysemy of the word "meaning"

Of course, both of the above examples are ingenious examples designed by people. Most of the languages used in people's daily activities are not so complicated.

In addition to polysemous words, synonyms are also very common linguistic phenomena, such as abbreviations of words (such as "shadow office" for "weather office"), aliases of proper nouns (such as "cold" for "upper respiratory tract infection"), network terms (such as "blue thin" for "affliction"), dialects (such as "Baogu" for "corn"), spoken and written languages( As the head is to the head) In NLP applications, there are also a lot of work to distinguish and process synonyms, such as: search engine will rewrite (query Rewrite) to rewrite the user's search into sentences with the same meaning but more accurate expression; the entity link technology in the knowledge map essentially corresponds the different forms of expression of entity names (such as person names and place names) with the standard expression; in the intelligent dialogue, synonyms can be used to better understand the user's problems

At present, there is no once and for all solution in the field of NLP. There are small-scale and high-quality knowledge bases such as WordNet, HowNet and synonym forest, which are constructed by human, and there are also vector representations of words learned from a large number of corpora by machine learning method to implicitly reflect the meaning of words, which is large but not accurate enough. The former is usually of good quality but only covers a small part of people's actual language, while the latter depends on the quantity, quality and field of the corpus. For some words rarely appearing in the corpus, they often get inexplicable results.

In sentences and even higher level language units such as paragraphs and texts, there are also structural ambiguities

For example, a sentence is made up of words, some of which are related and some of which are not, so it will form a structure as a whole, that is, the "grammar" we learn when learning a language.

Ideally, given a sentence, if its grammatical structure is unique and definite, then we only need to calculate its grammatical structure, and then solve the problem of synonyms and polysemy mentioned earlier, then the meaning of the sentence can be determined. But in fact, the grammar rule of natural language is not a very strict rule. When the sentence is more complex, it often results in a variety of different but reasonable explanations.

For example

Take a look at a classic English sentence: "put the block in the box on the table". It can be explained in two ways:

If we add a preposition phrase "in the kitchen", it can have five different interpretations; if we add another preposition phrase, we will get 14 different interpretations. In the case of ambiguity caused by prepositional phrases, the number of possible ambiguity structures increases exponentially with the increase of prepositional phrases. What is the concept of "index rise"? That means it's inefficient or impossible to solve ambiguity by enumerating all the possibilities.

The combination of the above two situations will lead to more and more complex ambiguity.

As mentioned above, how to deal with ambiguity and eliminate ambiguity is a very important part of NLP, and it is a very difficult part, because there are many reasons for language ambiguity.

In addition, metaphor, personification, homophony and other techniques will also make the existing words have new meanings or synonyms with the original unrelated words. In addition, context dependent situations such as ellipsis and reference are also easy to cause ambiguity ... In a word, there are many cases of word ambiguity.

On the issue of word sense disambiguation, in addition to the variety of situations that cause ambiguity, there is also a difficulty that disambiguation often depends on a large number of "common sense" besides the text. For example, in the previous example of machine translation, "the box is in the pen", why can we judge that the meaning of "pen" is not "pen", because we have these knowledge

And from this reasoning: the box in the pen is unreasonable. But sorry, this kind of knowledge that we think is common sense, the computer does not have, it can not make this kind of reasoning. We say that machine learning and deep learning depend on data in essence, that is to say, only when we have seen the same or similar data can we understand it, but we have not learned the knowledge itself.

In the development process of NLP, some people have spent a lot of energy and money to organize our knowledge and rules into computer-readable data, such as CYC, DBpedia, freebase, and turn them into what we now call knowledge map (in fact, previously known as "knowledge base"), "knowledge graph" was originally the name of a knowledge base of Google), but in NLP For example, in word sense disambiguation, how to use these knowledge is still a big problem.

As for structural ambiguity or grammatical ambiguity, there may be several reasons

In addition, the language has been evolving, and new grammar rules have been emerging. For example, the grammar that was originally thought to be wrong has been accepted as new grammar rules because it is widely used. For example, the original grammar rules have been simplified as simpler grammar rules. This leads to many grammatical rules have exceptions, and many exceptions gradually become new grammatical rules, resulting in the phenomenon that "all rules have exceptions, all exceptions are rules". This leads to the situation that if we want to rely on grammar rules to complete the task, we will eventually fall into the situation of constantly adding new grammar rules (or exceptions).

As mentioned above, ambiguity is the characteristic of natural language, and natural language also has a very important characteristic, that is, the dynamic evolution of language mentioned above. In addition to the ambiguity it brings, what's more, because of the dynamic evolution of language, new knowledge and language phenomena have been emerging. How to capture and learn these new knowledge and language rules is also a great challenge for NLP, because if we can't learn new language knowledge, then NLP based on the old data Model, often even basic analysis can not do, let alone to solve ambiguity and understanding.

In addition, for some Chinese, Japanese and other languages, there is a special point, that is, in words, words and words are not naturally separated. For the sake of effect and efficiency, the existing NLP methods are basically based on words, so for these Chinese languages, it is also necessary to carry out additional processing to determine the boundary between words, which is often referred to as "word segmentation aion". However, the process of word segmentation itself is uncertain, that is, the possible result of word segmentation is not necessarily unique in the same sentence.

Let's take a look at these examples

It can be seen that the meaning of the whole sentence is totally different just by one or two wrong words. The difficulty of Chinese word segmentation lies in that if we want to segment words correctly, we must have a correct understanding of sentence semantics; on the other hand, to understand sentence semantics correctly, we need correct word segmentation results. So there's a question of chicken or egg.

Although there are many difficulties mentioned above, NLP field has formed corresponding coping methods, but they can only be said to be "coping" rather than "solving", because the purpose of these methods is to seek to solve the common part of the above problems in practical application, so that the corresponding NLP system can correctly handle most of the problems, such as 80% or 90% in limited scenes The rest can be found out according to the user behavior through some design on the system and product, and then research and update the technology, so that the NLP system can gradually achieve satisfactory results.

Main applications and key technologies of natural language processing

Some of the more systematic NLP applications that we are familiar with are

Of course, there are many applications of NLP in fact. Limited to the space, only the main applications mentioned above are discussed here.

In fact, the above six NLP applications can be divided into two categories: one is to understand and respond to the complete semantics of the text, such as machine translation, text summarization and intelligent dialogue; the other is to understand only the specific information in the text, and not care much about the complete semantics of the text, such as information retrieval, spam filtering and situation analysis. Of course, I don't have a strict division. For example, there are people in search engines who use natural language to describe their problems and expect correct results. Spam filtering and sentiment analysis sometimes need to refine "specific information" on the basis of understanding the complete semantics, but in most cases, I don't think it's a problem.

In the first category, i.e. machine translation, text summarization and intelligent dialogue, the first two have clear objectives and evaluation criteria, such as Bleu index for machine translation and rouge index for text summarization, but there is still no general evaluation criteria for intelligent dialogue, or it can be borrowed from machine translation Bleu, or in a specific form of dialogue and dialogue tasks to develop special standards - such as task-based dialogue in the slot filling rate, retrieval type Q & A in the hit rate and recall rate. A really usable intelligent dialogue system is bound to be a fusion system, with both task-based and retrieval-based questions and answers, and perhaps open domain chatting. In such a fusion system, it is certainly not possible to sweep the snow in front of each home, so now, intelligent dialogue will be a little more difficult than machine translation and text summarization—— It's not that it's technically difficult, it's that it's hard to set goals and evaluate criteria.

In the second category, i.e. information retrieval, spam filtering, and sentiment analysis, sentiment analysis should not only analyze the emotion and attitude tendency, but also determine the object of this tendency, so it will be more difficult than information retrieval and spam filtering.

Generally speaking, in most cases, for the application of NLP, the following principles can be used to judge the degree of difficulty

In terms of technology, the key NLP technologies involved in each application are as follows:

The above is a list to form an overall impression. In fact, I haven't done anything like machine translation and text summarization, so I can't say too much about the key technologies. I'm relatively experienced in information retrieval and intelligent dialogue, so I write more.

Although the goals of each application are different, we will find that many of the methods they use are common or just the same.

Most basic, each application needs to preprocess the text before it starts, but preprocessing is a very broad concept, and there is no one or a set of specific methods. In my personal experience, because the processing is all Chinese, it will basically do simple and complex unified conversion, it will turn full angle characters into half angle characters, and it will do punctuation normalization (I will not tell you The question mark in your Greek alphabet looks the same as the semicolon in English. It will remove some invalid characters (such as the invisible blank character with zero width). In English, it will do stem extraction and shape reduction. Many NLP articles talk about preprocessing, which means that stop words are removed—— The so-called stop words refer to the words that are frequently used in specific fields and often do not express semantics, such as "Le" and "de", but in fact, not all applications need to stop words and can stop words. For example, in the application of author identification, the so-called function words such as function words and prepositions are heavily dependent, while in NLP Function words and prepositions are generally regarded as stop words.

The purpose of preprocessing is to make the text more "clean", less "dirty", and at the same time try to be more standardized, so sometimes it is also called "cleaning". Good preprocessing can reduce the interference information and retain the important information. At the same time, some standardized processing (such as unified conversion of complexity and simplicity, normalization of punctuation, stem extraction and shape reduction of English) can also reduce the amount of information to be processed in the subsequent steps.

After preprocessing, the pattern is usually changed to extract information from the text. From the perspective of machine learning, it is feature extraction. Depending on the application, this step can be very simple or complex.

The initial processing is to extract all words from the text. This step is very natural for English and other languages. It's OK to divide them by spaces and punctuation. For Chinese, it's necessary to do Chinese word segmentation specifically. This is mentioned earlier. From the perspective of computational linguistics, word segmentation is a part of "lexical analysis". In addition to determining the boundary between words, the so-called "lexical analysis" also needs to determine the part of speech (a word is a verb, a noun or something), the meaning of words (apple is a fruit or an electronic product). Generally speaking, it is to solve: What are the words, what are the words and what are the meanings of each word.

There are many ways to segment words, such as preparing a large vocabulary, then finding the words that may spell the current sentence in this vocabulary, and then determining the most likely one in a variety of possible combinations. It looks very rough but it is OK. But now the mainstream is to deal with segmentation as a sequence annotation problem. The so-called sequence tagging means that to give a sequence, the model can mark every element in the sequence as a special mark. For word segmentation, there are three kinds of marks to be marked: the first word of a word, the last word of a word, and the middle word.

Sequence tagging is not unique in NLP, but many NLP tasks or technologies are related to sequence tagging. Just mentioned word segmentation and part of speech tagging (determine part of speech) are basically treated as sequence tagging. In addition, entity extraction or more complex information extraction is also considered as sequence annotation problem. Therefore, as an NLP practitioner, it is necessary to master the relevant methods of sequence tagging. It is suggested that the principles of HMM, CRF and RNN should be thoroughly understood, one or two good sequence tagging tools should be mastered, and the data set of point sequence tagging problems should be stored to polish their senses.

Although the mainstream method is to do word segmentation as a sequence tagging problem, but in practice, hoarding dictionaries is beneficial and harmless, such as doing new word mining, synonym mining, domain entity word accumulation and so on. As for how to do these mining and accumulation, each application has its own practice, for example, search engine relies on query Click log to find synonyms or synonyms, the principle is similar to that two people use different words to search, but they all click the same link at last. If there is no convenient user feedback like search engine, it can be done through some unsupervised or semi supervised methods, which will be more difficult, but in a word, there are methods.

After lexical analysis, there will be some different follow-up processing

As mentioned before, entity extraction can be done as a sequence annotation problem. There's nothing to say about the removal of stop words. Generally, it's just to add high-frequency words in the data on the basis of a general stop word list. I think keyword extraction is a very important thing, but in most cases, we can extract good keywords with TFIDF. On this basis, with some results of previous lexical analysis, such as the part of speech, we can basically sort out the keywords, basically, nouns, verbs, adjectives, and entity words can be used as keywords.

When it comes to TFIDF, it's also a very classic and important technology in NLP. The idea is very simple. For a word, two values are used to reflect its importance in a document or a sentence. The first is the so-called term frequency, TF), that is, the number of times the word appears in the current document / sentence; the second is the so-called inverse document frequency (IDF), which refers to the reciprocal of the number of documents / sentences in which the word has appeared. Generally speaking, the less a word appears in other documents / sentences (the higher the IDF value), the more it appears in the current document / sentence, the more important it is in the current document / sentence. It's a very simple idea. In fact, this idea can be explained by information entropy finally. In short, it's very practical.

Phrase extraction, relation extraction and syntactic analysis are all used to get the hierarchical structure of the text to better understand the semantics. Among them, phrase extraction is relatively simple, because it only needs to see whether there is a relationship between the two words, while relationship extraction and syntactic analysis need to analyze whether there is a relationship between words that are not adjacent. This one is not very familiar. I only used related tools. But in my opinion, syntactic analysis is a very important thing to understand semantics. A good syntactic analysis system or tool can greatly improve the effect of the whole system.

Then there are emotion analysis, intention recognition and topic classification. In fact, most of them are aliases of text classification in different application scenarios. Text classification is a relatively simple but widely used technology, and it is also a piece of content worthy of effort to master, logical regression, support vector machine, gbdt, and then in-depth learning of the common classification model structure, to master these well, and master the process and skills of preprocessing, feature extraction, feature selection in the actual data, to be used Infinite. As for those who don't use features in deep learning, they are all idiots. When I classify them, I can add features just like deep learning, that is, they are better than you and love to tune.

The above technologies can basically make the task-related "understanding" almost the same. After understanding, there will be more specific response processing related to the application, such as machine translation, text summarization, intelligent dialogue, natural language generation, information retrieval and retrieval type Q & a system will have text similarity measurement.

Among them, natural language generation (NLG) is still a very difficult problem. Technically, there are methods based on statistical language model and neural network language model. In recent years, variable auto encoder (VAE) and reinforcement learning based generation methods are used more and more. But NLG here is mostly used to generate short texts based on the previous analysis results in specific tasks. As for the generation of novels, it's hard to see how hard it is without the media and the company to blow.

The language model mentioned here is also a piece of content that can be grasped well.

Then there is text similarity measurement, which is also a very important NLP skill. In a sense, it is equivalent to the ultimate goal of NLP. If any given two sentences, an NLP system can accurately determine whether the two sentences are similar or not, then it is a system that fully understands natural language. Taking machine translation as an example, its Bleu index is to judge whether the translated sentence is similar to the given standard translation, but in fact, its calculation method is very simple only to calculate the surface similarity, so the Bleu index is not very accurate. In practical application, it still depends on people to evaluate the actual effect, but for so many years, the academic community and the industry have not found a comparison Bleu is a better indicator, so it will be used. There are also evaluation standards such as amber and meteor, which are not only similar on the surface, but also too complex to be recognized. If one day there is a better evaluation standard that can be more close to the effect of artificial evaluation, that is the great breakthrough of machine translation.

The difficulty of text similarity is often difficult for many non-technical personnel or non NLP field personnel to understand, "these two sentences are one meaning, why don't they understand?" this kind of words can also be heard frequently. It's really hard to understand.

Simple and traditional similarity measures, such as LCS similarity, Jaccard coefficient, cosine similarity, etc., are very practical methods that have been tested for a long time. There are more measures to deal with some simple tasks. Of course, these methods basically look at the surface similarity, so the generalization will be poor, and the effect can be improved by constantly adding features. If there is enough data, some matching models based on deep learning will have a very good effect, which not only refers to the effect on the fixed test set, but also the generalization of the data outside the set. Of course, the premise is that there should be meters.

There are many applications of text similarity measurement. It can be used to evaluate the system effect in machine translation and text summarization. It can be used to match correct documents or problems in information retrieval and intelligent dialogue. Even some classification tasks can be solved by text similarity measurement (there are 1000 class a data, 1000 class B data, and a data under judgment) Is class data more similar to class B data.

On the other hand, in addition to giving two sentences to judge their similarity, mining and generating similar sentences is also a very interesting thing. The professional term for generating similarity problem is "paraphrase". It and similarity measure promote each other. If similarity measure is well done, more similar sentences can be mined to train the generation model of repetition. If repetition generation is done, it can provide training data for similar measure model.

For the application of deep learning in NLP, the core technologies are

Deep learning related technologies generally need more data, but we will find that in the actual task, there may not be so much data for us to learn deeply, or there may be too much data but contains too much noise data. The way for local tyrants is to use money to do whatever they want and create some special jobs to benefit the society. I support this matter with both feet. The more economical approach is to first use the traditional and classic methods mentioned above, at the same time, design a good feedback channel on products and systems, carry out iterative optimization after cold start, and consciously accumulate data from logs and user feedback, and start deep learning after reaching a certain amount to solve the problem that the generalization of traditional models is not good enough.