2018-06-25
This is the second article in the NLP where to run series, which is as follows:
- Where to run NLP: opening and some fragmentary thoughts
- Where NLP runs: what is natural language processing? Zmonster's blog
- Where to run NLP: some knowledge and tools related to Unicode. Zmonster's blog
- Where to run NLP: a list of text classification tools · zmonster's blog
What is natural language processing
- Professor: "what are you laughing at?"
- Z: "Teacher, learning natural language processing is my dream since I was a child. I am very happy to be here!"
- Professor: "don't be happy. Define natural language processing."
- Z: "The technology that helps us understand natural language is natural language processing."
- Professor: "can you elaborate?"
- Z: "Natural language processing can help us understand human language. You write an article, and the editor tells you that some words are wrong, which is natural language processing; an email comes, which tells you that this is spam, which is natural language processing; enter "Shannon" in the input box, and the search engine displays the encyclopedia and academic achievements of Shannon in front of you, which is natural language processing; you send a micro blog, saying "I If you want to be your boyfriend for another term, it will be automatically deleted by Weibo. This is natural language processing. If you write a passage from the English literature and use Google translation to translate it into Chinese and put it in your own paper, this is natural language processing. If you say "Hey Siri sets an alarm clock at 7 o'clock" to the iPhone, at 7 o'clock Siri reminds you that it's time to get up. This is natural language processing... "
- Professor (angry): "just talk nonsense! What's the definition! "
- Z: "I just said that, teacher."
- Professor: "so are you in the exam? Natural language processing is, re-election of your boyfriend? Idiot! Change the answer! "
- M: "Teacher, natural language processing can be defined as the study of language problems in human to human communication and human to computer communication. Natural language processing needs to develop a model to express language ability and language application, establish a computing framework to realize such a language model, propose corresponding methods to continuously improve such a language model, design various practical systems based on such a language model, and discuss the evaluation technology of these practical systems. "
- Professor: "great! Great! "
The above is a fake sketch, which uses the scene of "what is a mechanical device" in "three fools" when Rancho is asked by the professor. In the movie, through the contrast between the rigidity of professors and "top students" in this scene and the flexible wit of Rancho, we attack the rigid education system. However, film is always a film. We need to admit that its criticism is objective. However, Rancho's behavior is to present and strengthen conflicts and contradictions as the other extreme opposite to one extreme. When we know things and learn knowledge, mechanical rigid memory is not advisable, and it's not an encouraging behavior to have no rules but rely on personal perceptual knowledge.
So what is natural language processing?
Natural language processing, or NLP for short, is a subject aiming at using computer technology to understand and use natural language. In different scenarios, it is sometimes called Computational Linguistics (CL) or natural language understanding (NLU).
To understand the field of NLP, we only need to grasp these points
- NLP is mainly carried out at the bottom theoretical level through computer technology. NLP will involve mathematics, linguistics, cognitive science and other disciplines, but in the end, it generally carries these theoretical knowledge and plays an effective role through computer technology. It seems like nonsense, but it should be emphasized here that unlike human brain, computer has its advantages and disadvantages, and NLP technology will also be affected by the advantages and disadvantages of current computer technology. So don't use our brain to process natural language and its effect to require NLP technology - for most of us, using language is very natural and simple, but don't think that using computer to process it is also very simple.
NLP is mainly carried out by computer technology
At the bottom level of theory, NLP involves mathematics, linguistics, cognitive science and other disciplines, but at the end, it usually carries these theoretical knowledge and plays an effective role through computer technology.
It seems like nonsense, but it should be emphasized here that unlike human brain, computer has its advantages and disadvantages, and NLP technology will also be affected by the advantages and disadvantages of current computer technology. So don't use our brain to process natural language and its effect to require NLP technology - for most of us, using language is very natural and simple, but don't think that using computer to process it is also very simple.
- NLP should understand and use natural language, which means the language naturally evolved in our world, such as English, Chinese and French It's called "natural language" to distinguish it from artificial languages such as programming languages (such as C, Java, python, etc.). Programming language has a very clear and fixed grammar. Every sentence written in programming language has a unique and definite meaning. Therefore, the computer only needs to analyze and execute it according to the grammar rules. Natural language is different, it has relatively stable grammar rules, but these grammar rules will have exceptions and have been evolving, plus some other characteristics, often have ambiguity. Dealing with ambiguity is a very core part of NLP.
NLP should understand and use natural language
The so-called "natural language" refers to the language naturally evolved in our world, such as English, Chinese, French It's called "natural language" to distinguish it from artificial languages such as programming languages (such as C, Java, python, etc.).
Programming language has a very clear and fixed grammar. Every sentence written in programming language has a unique and definite meaning. Therefore, the computer only needs to analyze and execute it according to the grammar rules.
Natural language is different, it has relatively stable grammar rules, but these grammar rules will have exceptions and have been evolving, plus some other characteristics, often have ambiguity. Dealing with ambiguity is a very core part of NLP.
- NLP attempts to understand natural language, but what is "understanding" actually does not have a certain standard and ideal meaning of "understanding" natural language, which requires NLP to understand natural language like human brain. However, there is no systematic and comprehensive understanding of how our brain works when we use language in brain science research. Therefore, it can't be called a standard. In fact, under the existing technical framework, it's impossible to fully understand natural languages with computers. Second, we generally think that as long as the machine can correctly respond to our requirements expressed in natural language in a specific scenario, it is to understand natural language. Note that there are several premises "in a specific scenario": generally, we believe that after limiting the scenario, people's purpose and language expression will also be limited, so that the diversity of language expression can be reduced, so that understanding can be possible to "respond correctly": we think that the behavior of the machine meets expectations is understanding, and we do not care about the middle Whether the process is consistent with the operation mechanism of human brain, and whether it really understands the connotation of language? Of course, this is only the standard followed by the current NLP system. In fact, some people try to determine the process and standard of "understanding" from different perspectives such as linguistics and brain science. Let's keep our attention and look forward to the future.
NLP tries to understand natural language, but there is no definite standard for "understanding"
In an ideal sense, "understanding" natural language requires NLP to understand natural language like human brain. However, there is no systematic and comprehensive understanding of how our brain works when we use language in brain science research. Therefore, it can't be called a standard. In fact, under the existing technical framework, it's impossible to fully understand natural languages with computers.
Second, we generally think that as long as the machine can correctly respond to our requirements expressed in natural language in a specific scenario, it is to understand natural language.
Note that there are several premises
- "In a specific scene": generally, we believe that after limiting the scene, people's purpose and language expression will also be limited, so the diversity of language expression can be reduced, so understanding is possible
- "Correct response": we think that the behavior of the machine meets the expectation is understanding. We don't care whether the middle process is consistent with the operation mechanism of the human brain, and whether we really understand the meaning of language
Of course, this is only the standard followed by the actual NLP system. In fact, some people try to determine the process and standard of "understanding" from different perspectives such as linguistics and brain science. Let's keep our attention and look forward to the future.
- NLP is also used after understanding natural language, so we can say that it is an NLP process when we use computers to process and analyze natural language applications - of course, it is not only NLP.
Next, I will discuss it according to these points
- What are the difficulties and limitations of NLP
- What are the main applications of NLP
- What are the main technologies of NLP
Difficulties and limitations of natural language processing
We use natural language every day, and we don't know too much about the difficulties of "understanding natural language". But in fact, some characteristics of natural language itself make it very difficult for computers to understand natural language.
The first characteristic is that there is ambiguity in natural language, which is reflected in many aspects:
- In the smallest unit of language, that is, the level of words, there are polysemy, synonyms and other phenomena. The box is in the pen is a classic example given by an early machine translation researcher in the 1960s. Although we may not know the other meaning of pen is "fence", we can certainly realize that the meaning of pen in this sentence is not "pen". This badcase has been proposed for more than 50 years, but we can see that there is still no translation system that can solve this problem without special processing (in other words, this badcase can be solved by special and ugly means). In contrast to Google Translate Baidu translate Sogou translate, there is an example in Chinese of the polysemy of the word "meaning". He said: "she is really interesting." "He's funny," she said So people think they have a wish and let him express it to her. He was angry: "I didn't mean that!" She was also angry: "what do you mean by that?" After the event, someone said, "funny." Others said, "it's really not interesting.". (see the sixth edition of daily life on November 13, 1994) of course, the above two examples are both ingenious and artificially designed. Most of the languages used in people's daily activities are not so complicated. In addition to polysemous words, synonyms are also very common linguistic phenomena, such as abbreviations of words (such as "shadow office" for "weather office"), aliases of proper nouns (such as "cold" for "upper respiratory tract infection"), network terms (such as "blue thin" for "affliction"), dialects (such as "Baogu" for "corn"), spoken and written languages( As the head is to the head) In NLP applications, there are also a lot of work to distinguish and process synonyms, such as: search engine will rewrite (query Rewrite) to rewrite the user's search into sentences with the same meaning but more accurate expression; the entity link technology in the knowledge map essentially corresponds the different forms of expression of entity names (such as person names and place names) with the standard expression; in the intelligent dialogue, synonyms can be used to better understand the user's problems At present, there is no once and for all solution in the field of NLP. There are small-scale and high-quality knowledge bases such as WordNet, HowNet and synonym forest, which are constructed by human, and there are also vector representations of words learned from a large number of corpora by machine learning method to implicitly reflect the meaning of words, which is large but not accurate enough. The former is usually of good quality but only covers a small part of people's actual language, while the latter depends on the quantity, quality and field of the corpus. For some words rarely appearing in the corpus, they often get inexplicable results.
In the smallest unit of a language, that is, the level of words, there are polysemy, synonyms and other phenomena
What's more classic is a classic example given by an early machine translation researcher in the 1960s
Although we may not know the other meaning of pen is "fence", we can certainly realize that pen in this sentence is not "pen". This badcase has been proposed for more than 50 years, but we can see that there is still no translation system that can solve this problem without special processing (in other words, this badcase can be solved by special and ugly means).
Google translation
Baidu Translate
Sogou translation
In contrast, an example in Chinese is the polysemy of the word "meaning"
Of course, both of the above examples are ingenious examples designed by people. Most of the languages used in people's daily activities are not so complicated.
In addition to polysemous words, synonyms are also very common linguistic phenomena, such as abbreviations of words (such as "shadow office" for "weather office"), aliases of proper nouns (such as "cold" for "upper respiratory tract infection"), network terms (such as "blue thin" for "affliction"), dialects (such as "Baogu" for "corn"), spoken and written languages( As the head is to the head) In NLP applications, there are also a lot of work to distinguish and process synonyms, such as: search engine will rewrite (query Rewrite) to rewrite the user's search into sentences with the same meaning but more accurate expression; the entity link technology in the knowledge map essentially corresponds the different forms of expression of entity names (such as person names and place names) with the standard expression; in the intelligent dialogue, synonyms can be used to better understand the user's problems
At present, there is no once and for all solution in the field of NLP. There are small-scale and high-quality knowledge bases such as WordNet, HowNet and synonym forest, which are constructed by human, and there are also vector representations of words learned from a large number of corpora by machine learning method to implicitly reflect the meaning of words, which is large but not accurate enough. The former is usually of good quality but only covers a small part of people's actual language, while the latter depends on the quantity, quality and field of the corpus. For some words rarely appearing in the corpus, they often get inexplicable results.
- In sentences and even higher-level language units such as paragraphs and chapters, there is also a phenomenon of structural ambiguity. For example, sentences are composed of words, some of which are related and some of which are not. Therefore, a structure will be formed as a whole, that is, the "grammar" we learn when we learn a language. Ideally, given a sentence, if its grammatical structure is unique and definite, then we only need to calculate its grammatical structure, and then solve the problem of synonyms and polysemy mentioned earlier, then the meaning of the sentence can be determined. But in fact, the grammar rule of natural language is not a very strict rule. When the sentence is more complex, it often results in a variety of different but reasonable explanations. For example, "like a country kid" can have the following two explanations: "like a country kid" can be used as the attribute to modify "like a country kid" and "like a country kid" can be used as the object of "like" and "he secretly deposits money in the bank behind the back of the general manager and deputy general manager" can have the following 2 There are two kinds of explanations: he secretly / carry / put / money / deposit / Bank: "he" saves money alone he [carry / general manager] and / Deputy General Manager / secretly / put / money / deposit / Bank: "he" and "deputy general manager" save money together "giving up beautiful women makes people heartbroken" can be explained as follows: "Woman" makes people heartbreak [give up / beautiful / woman] let / people / heartbreak: "give up" makes people heartbreak and then look at a classic English sentence: "put the block in the box on the table". It can be interpreted in two ways: put the block [in the box on the table]: "on the table" modifies "box" put [the block in the box] on the table: "on the table" qualifies "block". If a preposition phrase "in the kitchen" is added, it can have five different interpretations; if another preposition phrase is added, it will get 14 different interpretations. In the case of ambiguity caused by prepositional phrases, the number of possible ambiguity structures increases exponentially with the increase of prepositional phrases. What is the concept of "index rise"? That means it's inefficient or impossible to solve ambiguity by enumerating all the possibilities.
In sentences and even higher level language units such as paragraphs and texts, there are also structural ambiguities
For example, a sentence is made up of words, some of which are related and some of which are not, so it will form a structure as a whole, that is, the "grammar" we learn when learning a language.
Ideally, given a sentence, if its grammatical structure is unique and definite, then we only need to calculate its grammatical structure, and then solve the problem of synonyms and polysemy mentioned earlier, then the meaning of the sentence can be determined. But in fact, the grammar rule of natural language is not a very strict rule. When the sentence is more complex, it often results in a variety of different but reasonable explanations.
For example
- There are two kinds of explanations for "like country children": like country as the attribute to modify "children" like country / children; like country as the object of "like"
- [like / rural] children: "like rural" as attribute to modify "children"
- Like [country / children]: "country children" as the object of "like"
- "He secretly deposits money in the bank behind the general manager and vice general manager's back" can be explained in the following two ways: "he" deposits money alone he [carries / general manager] and / Vice General Manager / secretly / puts / money / deposits / Bank: "he" deposits money with "vice general manager"
- He [on his back / General Manager / and / deputy general manager] stealthily / put / money / deposit / Bank: "he" deposits money alone
- He [on his back / general manager] and / Deputy General Manager / secretly / put / money / deposit / Bank: "he" and "deputy general manager" deposit money together
- "Giving up beautiful women makes people heartbroken" can be explained in the following two ways: giving up / beautiful / women / letting / people / heartbroken: "women" makes people heartbroken [giving up / beautiful / women] letting / people / heartbroken: "giving up" makes people heartbroken
- [give up / beautiful / women / let / people / heartbreak: "women" Heartbreak
- [give up / beautiful / of / woman] let / person / heartbreak: "give up" Heartbreak
Take a look at a classic English sentence: "put the block in the box on the table". It can be explained in two ways:
- Put the block [in the box on the table]: "on the table" decorates "box"
- Put [the block in the box] on the table: "on the table" limits "block"
If we add a preposition phrase "in the kitchen", it can have five different interpretations; if we add another preposition phrase, we will get 14 different interpretations. In the case of ambiguity caused by prepositional phrases, the number of possible ambiguity structures increases exponentially with the increase of prepositional phrases. What is the concept of "index rise"? That means it's inefficient or impossible to solve ambiguity by enumerating all the possibilities.
The combination of the above two situations will lead to more and more complex ambiguity.
As mentioned above, how to deal with ambiguity and eliminate ambiguity is a very important part of NLP, and it is a very difficult part, because there are many reasons for language ambiguity.
In addition, metaphor, personification, homophony and other techniques will also make the existing words have new meanings or synonyms with the original unrelated words. In addition, context dependent situations such as ellipsis and reference are also easy to cause ambiguity ... In a word, there are many cases of word ambiguity.
On the issue of word sense disambiguation, in addition to the variety of situations that cause ambiguity, there is also a difficulty that disambiguation often depends on a large number of "common sense" besides the text. For example, in the previous example of machine translation, "the box is in the pen", why can we judge that the meaning of "pen" is not "pen", because we have these knowledge
- The box has a volume
- Pen is not a container
- Pens are generally smaller than boxes
And from this reasoning: the box in the pen is unreasonable. But sorry, this kind of knowledge that we think is common sense, the computer does not have, it can not make this kind of reasoning. We say that machine learning and deep learning depend on data in essence, that is to say, only when we have seen the same or similar data can we understand it, but we have not learned the knowledge itself.
In the development process of NLP, some people have spent a lot of energy and money to organize our knowledge and rules into computer-readable data, such as CYC, DBpedia, freebase, and turn them into what we now call knowledge map (in fact, previously known as "knowledge base"), "knowledge graph" was originally the name of a knowledge base of Google), but in NLP For example, in word sense disambiguation, how to use these knowledge is still a big problem.
As for structural ambiguity or grammatical ambiguity, there may be several reasons
- When the words are ambiguous, different meanings of the word may have different grammatical functions, resulting in different final grammatical structures. For example, in the two different interpretations of the example sentence "giving up beautiful women makes people heartbreak", the word "beautiful" is used as a noun and an adjective respectively, resulting in different grammatical analysis results
- Natural language grammar itself is not deterministic, that is to say, there are some situations. Even if the part of speech and meaning of each word are certain, there will be many reasonable parsing results. This is the case in the previous example sentence of "he secretly deposited money in the bank behind the back of the general manager and deputy general manager" and the English example sentence
In addition, the language has been evolving, and new grammar rules have been emerging. For example, the grammar that was originally thought to be wrong has been accepted as new grammar rules because it is widely used. For example, the original grammar rules have been simplified as simpler grammar rules. This leads to many grammatical rules have exceptions, and many exceptions gradually become new grammatical rules, resulting in the phenomenon that "all rules have exceptions, all exceptions are rules". This leads to the situation that if we want to rely on grammar rules to complete the task, we will eventually fall into the situation of constantly adding new grammar rules (or exceptions).
As mentioned above, ambiguity is the characteristic of natural language, and natural language also has a very important characteristic, that is, the dynamic evolution of language mentioned above. In addition to the ambiguity it brings, what's more, because of the dynamic evolution of language, new knowledge and language phenomena have been emerging. How to capture and learn these new knowledge and language rules is also a great challenge for NLP, because if we can't learn new language knowledge, then NLP based on the old data Model, often even basic analysis can not do, let alone to solve ambiguity and understanding.
In addition, for some Chinese, Japanese and other languages, there is a special point, that is, in words, words and words are not naturally separated. For the sake of effect and efficiency, the existing NLP methods are basically based on words, so for these Chinese languages, it is also necessary to carry out additional processing to determine the boundary between words, which is often referred to as "word segmentation aion". However, the process of word segmentation itself is uncertain, that is, the possible result of word segmentation is not necessarily unique in the same sentence.
Let's take a look at these examples
- "Liang Qichao lived here before death" may have two participle results
- Liang Qichao / living / living / here
- Liang Qi / Chaosheng / Qian / Zhu / Zai / here
- "Wuhan Yangtze River Bridge" may have two participle results: Wuhan / mayor / jiangdaqiao Wuhan / Changjiang / Daqiao
- Wuhan / mayor / jiangdaqiao
- Wuhan / Yangtze River / Bridge
- There are two possible participles for "Alaska is hit by a strong snowstorm and many people die." Alaska / was / strong / snowstorm / attacked / caused / many people / died Alaska / was / storm / snowstorm / attacked / caused / many people / died
- Alaska / suffered / strong / snowstorm / attacked / caused / many people / died
- Alaska / suffered / raped / snowstorm / attacked / caused / many people / died
- There may be two participle results of "cadres who have obtained diplomas and who have not obtained diplomas". They have / obtained / diplomas / and / have / obtained / diplomas / of / cadres who have / obtained / diplomas / of / and / have / obtained / diplomas / of / cadres
- Has / obtained / Diploma / of / and / has not / obtained / Diploma / of / cadre
- Has / obtained / Diploma / of / and / has not / obtained / Diploma / of / cadre
- "Will quadruple in the next three years" may have two participle results
- Next / three years / mid term / double / double
- Next / three years / medium / will / double / double
It can be seen that the meaning of the whole sentence is totally different just by one or two wrong words. The difficulty of Chinese word segmentation lies in that if we want to segment words correctly, we must have a correct understanding of sentence semantics; on the other hand, to understand sentence semantics correctly, we need correct word segmentation results. So there's a question of chicken or egg.
Although there are many difficulties mentioned above, NLP field has formed corresponding coping methods, but they can only be said to be "coping" rather than "solving", because the purpose of these methods is to seek to solve the common part of the above problems in practical application, so that the corresponding NLP system can correctly handle most of the problems, such as 80% or 90% in limited scenes The rest can be found out according to the user behavior through some design on the system and product, and then research and update the technology, so that the NLP system can gradually achieve satisfactory results.
Main applications and key technologies of natural language processing
Some of the more systematic NLP applications that we are familiar with are
- Machine translation: translating one natural language into another
- Information retrieval: search engine such as Google / Bing / Baidu is a very typical information retrieval system to retrieve the information we need from a large number of documents
- Text summary: generate a smaller but complete summary from a longer document or article
- Intelligent conversation: directly and automatically answer user's questions or perform specific behaviors in the form of conversation
- Spam filtering: filtering and tagging spam
- And situation analysis: to test the public opinion, attitude, emotion and other tendencies towards public events
Of course, there are many applications of NLP in fact. Limited to the space, only the main applications mentioned above are discussed here.
In fact, the above six NLP applications can be divided into two categories: one is to understand and respond to the complete semantics of the text, such as machine translation, text summarization and intelligent dialogue; the other is to understand only the specific information in the text, and not care much about the complete semantics of the text, such as information retrieval, spam filtering and situation analysis. Of course, I don't have a strict division. For example, there are people in search engines who use natural language to describe their problems and expect correct results. Spam filtering and sentiment analysis sometimes need to refine "specific information" on the basis of understanding the complete semantics, but in most cases, I don't think it's a problem.
In the first category, i.e. machine translation, text summarization and intelligent dialogue, the first two have clear objectives and evaluation criteria, such as Bleu index for machine translation and rouge index for text summarization, but there is still no general evaluation criteria for intelligent dialogue, or it can be borrowed from machine translation Bleu, or in a specific form of dialogue and dialogue tasks to develop special standards - such as task-based dialogue in the slot filling rate, retrieval type Q & A in the hit rate and recall rate. A really usable intelligent dialogue system is bound to be a fusion system, with both task-based and retrieval-based questions and answers, and perhaps open domain chatting. In such a fusion system, it is certainly not possible to sweep the snow in front of each home, so now, intelligent dialogue will be a little more difficult than machine translation and text summarization—— It's not that it's technically difficult, it's that it's hard to set goals and evaluate criteria.
In the second category, i.e. information retrieval, spam filtering, and sentiment analysis, sentiment analysis should not only analyze the emotion and attitude tendency, but also determine the object of this tendency, so it will be more difficult than information retrieval and spam filtering.
Generally speaking, in most cases, for the application of NLP, the following principles can be used to judge the degree of difficulty
- Tasks that need to understand complete semantics are more difficult than tasks that only need to understand part of the information
- It's easier to set goals and evaluate criteria than not
- The task of complex analysis is more difficult than that of simple analysis
In terms of technology, the key NLP technologies involved in each application are as follows:
- Early rule-based translation system: syntactic analysis
- Statistical machine translation: language model, hidden Markov model (HMM), widely used in sequence tagging in the early days
- Neural network machine translation: seq2seq, attention model
- Information retrieval: massive text preprocessing, entity extraction, subject model, text similarity measurement, knowledge map
- Retrieval type Q & a system: the same as information retrieval, but it still needs to be digested by reference and ellipsis
- Task based question answering system: intention recognition, entity extraction, reference resolution, ellipsis resolution, dialogue state management, natural language generation
- Spam filtering: Text Categorization
- Sentiment analysis: opinion word extraction, sentiment analysis, relationship extraction, opinion mining
- Text Abstract: (this is not clear, skip)
The above is a list to form an overall impression. In fact, I haven't done anything like machine translation and text summarization, so I can't say too much about the key technologies. I'm relatively experienced in information retrieval and intelligent dialogue, so I write more.
Although the goals of each application are different, we will find that many of the methods they use are common or just the same.
Most basic, each application needs to preprocess the text before it starts, but preprocessing is a very broad concept, and there is no one or a set of specific methods. In my personal experience, because the processing is all Chinese, it will basically do simple and complex unified conversion, it will turn full angle characters into half angle characters, and it will do punctuation normalization (I will not tell you The question mark in your Greek alphabet looks the same as the semicolon in English. It will remove some invalid characters (such as the invisible blank character with zero width). In English, it will do stem extraction and shape reduction. Many NLP articles talk about preprocessing, which means that stop words are removed—— The so-called stop words refer to the words that are frequently used in specific fields and often do not express semantics, such as "Le" and "de", but in fact, not all applications need to stop words and can stop words. For example, in the application of author identification, the so-called function words such as function words and prepositions are heavily dependent, while in NLP Function words and prepositions are generally regarded as stop words.
The purpose of preprocessing is to make the text more "clean", less "dirty", and at the same time try to be more standardized, so sometimes it is also called "cleaning". Good preprocessing can reduce the interference information and retain the important information. At the same time, some standardized processing (such as unified conversion of complexity and simplicity, normalization of punctuation, stem extraction and shape reduction of English) can also reduce the amount of information to be processed in the subsequent steps.
After preprocessing, the pattern is usually changed to extract information from the text. From the perspective of machine learning, it is feature extraction. Depending on the application, this step can be very simple or complex.
The initial processing is to extract all words from the text. This step is very natural for English and other languages. It's OK to divide them by spaces and punctuation. For Chinese, it's necessary to do Chinese word segmentation specifically. This is mentioned earlier. From the perspective of computational linguistics, word segmentation is a part of "lexical analysis". In addition to determining the boundary between words, the so-called "lexical analysis" also needs to determine the part of speech (a word is a verb, a noun or something), the meaning of words (apple is a fruit or an electronic product). Generally speaking, it is to solve: What are the words, what are the words and what are the meanings of each word.
There are many ways to segment words, such as preparing a large vocabulary, then finding the words that may spell the current sentence in this vocabulary, and then determining the most likely one in a variety of possible combinations. It looks very rough but it is OK. But now the mainstream is to deal with segmentation as a sequence annotation problem. The so-called sequence tagging means that to give a sequence, the model can mark every element in the sequence as a special mark. For word segmentation, there are three kinds of marks to be marked: the first word of a word, the last word of a word, and the middle word.
Sequence tagging is not unique in NLP, but many NLP tasks or technologies are related to sequence tagging. Just mentioned word segmentation and part of speech tagging (determine part of speech) are basically treated as sequence tagging. In addition, entity extraction or more complex information extraction is also considered as sequence annotation problem. Therefore, as an NLP practitioner, it is necessary to master the relevant methods of sequence tagging. It is suggested that the principles of HMM, CRF and RNN should be thoroughly understood, one or two good sequence tagging tools should be mastered, and the data set of point sequence tagging problems should be stored to polish their senses.
Although the mainstream method is to do word segmentation as a sequence tagging problem, but in practice, hoarding dictionaries is beneficial and harmless, such as doing new word mining, synonym mining, domain entity word accumulation and so on. As for how to do these mining and accumulation, each application has its own practice, for example, search engine relies on query Click log to find synonyms or synonyms, the principle is similar to that two people use different words to search, but they all click the same link at last. If there is no convenient user feedback like search engine, it can be done through some unsupervised or semi supervised methods, which will be more difficult, but in a word, there are methods.
After lexical analysis, there will be some different follow-up processing
- To distinguish which words are important and which are not, it may involve keyword extraction, entity extraction, stop words removal, etc
- It may involve phrase extraction, relation extraction, syntactic analysis, etc
- The whole text is classified based on words to get a highly abstract result, which may involve emotion analysis, intention recognition, topic classification, etc
As mentioned before, entity extraction can be done as a sequence annotation problem. There's nothing to say about the removal of stop words. Generally, it's just to add high-frequency words in the data on the basis of a general stop word list. I think keyword extraction is a very important thing, but in most cases, we can extract good keywords with TFIDF. On this basis, with some results of previous lexical analysis, such as the part of speech, we can basically sort out the keywords, basically, nouns, verbs, adjectives, and entity words can be used as keywords.
When it comes to TFIDF, it's also a very classic and important technology in NLP. The idea is very simple. For a word, two values are used to reflect its importance in a document or a sentence. The first is the so-called term frequency, TF), that is, the number of times the word appears in the current document / sentence; the second is the so-called inverse document frequency (IDF), which refers to the reciprocal of the number of documents / sentences in which the word has appeared. Generally speaking, the less a word appears in other documents / sentences (the higher the IDF value), the more it appears in the current document / sentence, the more important it is in the current document / sentence. It's a very simple idea. In fact, this idea can be explained by information entropy finally. In short, it's very practical.
Phrase extraction, relation extraction and syntactic analysis are all used to get the hierarchical structure of the text to better understand the semantics. Among them, phrase extraction is relatively simple, because it only needs to see whether there is a relationship between the two words, while relationship extraction and syntactic analysis need to analyze whether there is a relationship between words that are not adjacent. This one is not very familiar. I only used related tools. But in my opinion, syntactic analysis is a very important thing to understand semantics. A good syntactic analysis system or tool can greatly improve the effect of the whole system.
Then there are emotion analysis, intention recognition and topic classification. In fact, most of them are aliases of text classification in different application scenarios. Text classification is a relatively simple but widely used technology, and it is also a piece of content worthy of effort to master, logical regression, support vector machine, gbdt, and then in-depth learning of the common classification model structure, to master these well, and master the process and skills of preprocessing, feature extraction, feature selection in the actual data, to be used Infinite. As for those who don't use features in deep learning, they are all idiots. When I classify them, I can add features just like deep learning, that is, they are better than you and love to tune.
The above technologies can basically make the task-related "understanding" almost the same. After understanding, there will be more specific response processing related to the application, such as machine translation, text summarization, intelligent dialogue, natural language generation, information retrieval and retrieval type Q & a system will have text similarity measurement.
Among them, natural language generation (NLG) is still a very difficult problem. Technically, there are methods based on statistical language model and neural network language model. In recent years, variable auto encoder (VAE) and reinforcement learning based generation methods are used more and more. But NLG here is mostly used to generate short texts based on the previous analysis results in specific tasks. As for the generation of novels, it's hard to see how hard it is without the media and the company to blow.
The language model mentioned here is also a piece of content that can be grasped well.
Then there is text similarity measurement, which is also a very important NLP skill. In a sense, it is equivalent to the ultimate goal of NLP. If any given two sentences, an NLP system can accurately determine whether the two sentences are similar or not, then it is a system that fully understands natural language. Taking machine translation as an example, its Bleu index is to judge whether the translated sentence is similar to the given standard translation, but in fact, its calculation method is very simple only to calculate the surface similarity, so the Bleu index is not very accurate. In practical application, it still depends on people to evaluate the actual effect, but for so many years, the academic community and the industry have not found a comparison Bleu is a better indicator, so it will be used. There are also evaluation standards such as amber and meteor, which are not only similar on the surface, but also too complex to be recognized. If one day there is a better evaluation standard that can be more close to the effect of artificial evaluation, that is the great breakthrough of machine translation.
The difficulty of text similarity is often difficult for many non-technical personnel or non NLP field personnel to understand, "these two sentences are one meaning, why don't they understand?" this kind of words can also be heard frequently. It's really hard to understand.
Simple and traditional similarity measures, such as LCS similarity, Jaccard coefficient, cosine similarity, etc., are very practical methods that have been tested for a long time. There are more measures to deal with some simple tasks. Of course, these methods basically look at the surface similarity, so the generalization will be poor, and the effect can be improved by constantly adding features. If there is enough data, some matching models based on deep learning will have a very good effect, which not only refers to the effect on the fixed test set, but also the generalization of the data outside the set. Of course, the premise is that there should be meters.
There are many applications of text similarity measurement. It can be used to evaluate the system effect in machine translation and text summarization. It can be used to match correct documents or problems in information retrieval and intelligent dialogue. Even some classification tasks can be solved by text similarity measurement (there are 1000 class a data, 1000 class B data, and a data under judgment) Is class data more similar to class B data.
On the other hand, in addition to giving two sentences to judge their similarity, mining and generating similar sentences is also very interesting. The professional term for generating similarity problem is "paraphrase". It and similarity measure promote each other. If similarity measure is well done, more similar sentences can be mined to train the generation model of repetition. If repetition generation is done, it can provide training data for similar measure model.
For the application of deep learning in NLP, the core technologies are
- Word embedding, also known as word vector, can express the meaning of a word with a vector through a large number of unsupervised (no need to label) training of data. Ideally, all kinds of information such as part of speech, word meaning, emotion, theme and so on can be expressed in this vector—— Of course, this is only a perceptual understanding. At present, we don't care much about the actual meaning and interpretation of this vector. It's right to use it. In addition, because word embedding can better reflect the meaning of words, it can also be used for keyword and synonym mining. In various NLP tasks related to deep learning, it is also standard
- In the aspect of sequence annotation, bilstm + CRF has become a very mainstream approach
- Seq2seq and attention, both of which are proposed in the research of neural network machine translation, are also widely used in important NLP applications such as intelligent question answering and text summarization. Generally speaking, the two have emerged together, and now they have evolved into a computing framework, while the specific methods and details have changed a lot
- In the aspect of natural language generation, VAE and reinforcement learning have been mentioned before. In the past year, the generative adversarial networks (GAN) have also been applied in this field
Deep learning related technologies generally need more data, but we will find that in the actual task, there may not be so much data for us to learn deeply, or there may be too much data but contains too much noise data. The way for local tyrants is to use money to do whatever they want and create some special jobs to benefit the society. I support this matter with both feet. The more economical approach is to first use the traditional and classic methods mentioned above, at the same time, design a good feedback channel on products and systems, carry out iterative optimization after cold start, and consciously accumulate data from logs and user feedback, and start deep learning after reaching a certain amount to solve the problem that the generalization of traditional models is not good enough.