are analysts and journalists out of work? ——current situation, development and future of financial automation report - wenyinterconnected

Posted by santillano at 2020-03-04

Last Saturday (August 20, 2016), Zhang Qiang, the CTO and co-founder of Wenyin Internet, gave a lecture on the 10th issue of intelligent financial Salon - "status, development and future of financial Automation Report".

Report automation is the application of natural language generation in the field of finance. It involves knowledge extraction, automatic text summarization, automatic visual summarization, visualization, knowledge map and so on. Want to know if machines can really replace people to generate reports? Are analysts and journalists out of work? Here is the salon content, enjoy

In recent years, there have been some products in the field of robot writing in China, which have aroused discussion and attention. Such as "will robot writing replace human writing?" And "will analysts and press conferences lose their jobs?" There are a lot of discussions about, I believe everyone has their own judgment. And I'm mainly through this salon to sort out the logic and some firm beliefs for you, hoping that you can find the satisfactory answers through the presentation of more than 20 minutes.

What are analysts and robots doing?

First of all, since we want to discuss the unemployment of analysts and journalists, let's first see what they are doing.

I have had some preliminary understanding of analysts' daily analysis logic in the past period by contacting some analysts in the new third board industry. Generally, the first stage for analysts to obtain the information and open data from various channels after they need to analyze the enterprise or target, so as to form a basic understanding of the enterprise. The second stage is to do some due diligence on the target enterprises, and the last stage is to write an analysis report through all the data collected, including some company highlights and investment risk tips.

The characteristic of such a report is that it is fluent in language. In addition, the report will contain some public data and some internal data obtained from the communication between analysts and enterprises, as well as the combination of analysts' reasoning and some background knowledge, so it is also very rich in content. These are the characteristics of analyst reports. Let's take a look at the current state of robot writing products on the market.

Sohu intelligent offer

The first is Sohu's recently launched intelligent offer, which is based on transaction data to match and write some templates, and then list some publications in the public domain. It is said that intelligent offer is 5 minutes faster than manual editing, so it is characterized by fast speed and time advantage.

Today's headline robot

The second is Zhang Xiaoming, the robot using artificial intelligence, which made headlines at the Olympic Games recently. Zhang Xiaoming is the research and development achievement of today's Toutiao lab. his "writing" module is jointly developed by Toutiao lab and Peking University Computing Institute (WAN Xiaojun team). This is the first artificial intelligence robot that can report the Olympic Games in China. After combining the latest technology of natural language processing, machine learning and visual image processing, it generates news through grammar synthesis and sorting learning. Compared with Tencent's "dream writer" and first finance's "DT king", Zhang Xiaoming's writing technology has entered the second generation of writing level. Compared with the first generation robot, the second generation Zhang Xiaoming has the characteristics of fast speed, many styles and self-adaptive automatic mapping.

Although the manuscript written by Zhang Xiaoming robot still has some template traces, the robot can generate nearly 200 reports in six days, which is a task that cannot be completed for journalists at present, and also a huge advantage in speed and time for robot writing.

Wordsmith of automated insights

The main product of automated insights is wordsmith automated report generation platform. Its main users include associated press, Yahoo and other companies, providing them with a large number of consulting and report generation services.

In the following example, we can see that the user has entered a financial data table, wordsmith, which generates the financial data description information, and is also related to Zack investment research's analysis of the company's financial statements. Here we can see that wordsmith can find the associated data according to the user input, so as to further enrich the content of the report. So it can be said that the characteristics of wordsmith are data association and aggregation based on knowledge base.

Human writing vs robot writing

The advantages of artificial writing are fluent language, rich content and rich insights. The advantages of robot are fast generation, relatively rich content, simple analysis and listing. From the comparison of existing products, people can write articles with high quality point of view, while from the perspective of robot writing, there is nothing special except the advantage of generation speed. Let's start from a technical point of view to see if robot writing is likely to achieve great development in the short term.

Technology behind robot writing

There are many technologies behind robot writing, such as natural language processing, machine learning, lexical analysis, grammatical analysis and so on. This is not going to be described one by one. Let's mainly introduce two technologies: natural language understanding and natural language generation.

From the data processing pipeline, we can see that the main role of natural language understanding is to transform the original various individual initial data into structured data, and the role of natural language generation is to generate good results of data, and ultimately into descriptive articles. For robot writing, different input data will lead to slightly different processing flow. If the input is already structured data, then natural language understanding can be skipped.

Natural language understanding (NLU)

Natural language understanding (NLU) is a process of transforming various original data into structured data with certain internal logic.

First of all, the original data of various formats are cleaned, and an intermediate version of clean data is obtained by a series of operations, such as removing the original file format, removing the duplicate data, and sorting the data.

Next, clean data is processed in a series of ways, including named entity identification (for example, identifying companies in text), enterprise name relationship discovery (for example, if company a invests in company B, then we need to establish a directed investment relationship between the two company entities), entity Association and a series of operations.

(Note: dirty data refers to HTML data, image data, CSV data, etc.; clean data refers to the processed text data, text paragraph data and necessary meta data metadata that remove the external structure. Structured data refers to the structured data generated after ner, lexical and syntactic semantic analysis, which is usually represented by JSON file)

NLG: template based

This template based natural language generation. Relatively straightforward, because the syntax and structure of the whole narrative document are defined by the template, and then some local adjustments will be made according to the specific data content.

Let's take an example here. The above figure is the report generation diagram of automated insights's wordsmith product. There are four parts in the generated phrase that can be changed according to the specific data value. There are three words to express the meaning of "have, have", and different adjectives can be given according to the size of the screen, etc.

The next step of the basic template method is to introduce more external resources to assist the generation of documents, so that it will evolve into a natural language generation based on knowledge base or knowledge map.

NLG: Based on Knowledge Map

The natural language generation based on knowledge map is mainly divided into two stages: data analysis stage and language expression stage.

In the data analysis stage, we will match and contrast the structured data with the domain knowledge map, establish the association, supplement the structured data, and screen out the really valuable and noteworthy information.

In the language expression stage, it is necessary to express the information naturally and fluently. Therefore, it is also a relatively complex process, because it includes: document planning (determining how and in what order information needs to be expressed); selecting what kind of data can be combined to express; using what kind of demonstrative pronouns to simplify the expression; discovering different data points in structured data through domain map and reasoning rules For example, the point far away from the industry average.

Here we take an example of the automobile industry. We can analyze from PDF that the main business of a company is the production of automobile electrophoretic coating. Combined with the data of the whole network, we can get the view that the sales volume of the automobile industry is declining, and then we can infer that the main business income of the company will decline. However, there is a problem that the construction of domain knowledge base and reasoning rules is a very long and high cost process, and the role of artificial intelligence in the construction process is very limited. Machines can spontaneously generate views, but in a long period of time can not surpass the views of people.

So let's go back to the question, "are analysts and journalists out of work?" , the answer is no, because the advantages of analysts and journalists lie in exploring and discovering insights, while the advantages of machines lie in data collection and collation. At present, machines cannot replace analysts and journalists. But at a very long time in the future, with the new development of artificial intelligence and the breakthrough of new technology, the answer may change.

Practice of automatic report of document cause

Next, I will introduce the practice of automated reporting in different stages of investment research tracking after a round of communication research on practitioners on the third board.

During the investigation of investors, we found that the demand of investors mainly focused on the industry research before investment, the new review, the research of newly listed enterprises, the continuous tracking demand of enterprises and the post investment risk prompt demand. But behind these demands is the urgent need to liberate human resources from the complex work of data collection and collation and data overload, so that investors pay more attention to the construction of business logic and domain model, and obtain the necessary data in a shorter time, so as to improve work efficiency.

Therefore, according to the characteristics of more enterprises, more announcements, less fields and less research in the new third board market, we have launched industry dynamic express, H5 visual annual report semi annual report and listed enterprise analysis report.

Industry dynamic Express

The industry dynamic express aggregates the change information of the subdivided industry, mainly displaying the investment opportunity tips of the new enterprises under review, the new listed enterprises and the new fixed increase enterprises. Because the third board market is similar to the early market, there are cross market data changes such as supplementing the early market data. In addition, there are highlights of enterprises.

H5 visual annual report semi annual report

There are more than 8000 enterprises on the new third board, many of which will be noticed only on the day of listing, and then completely disappeared from people's sight. In addition, the manpower of the third board investment and research institutions cannot cover all enterprises, so it is necessary to generate visual reports for all enterprises through machines, so that the investment highlights of enterprises can be displayed more intuitively. Let information flow more smoothly between enterprises and investors.

Analysis report of Listed Enterprises

In this paper, an analysis report of listed companies is generated for a new three board company. As we all know, there are usually more than 200 pages of stock conversion instructions, and investors don't have time to read them carefully one by one. So we use artificial intelligence technology such as natural language extraction to extract key information such as the company's core technology, major customer changes, etc. At the same time, with the reasoning of knowledge map such as investment risk, more than 200 pages of the prospectus will be turned into a more concise and intuitive enterprise analysis report, so as to better save time for investors and improve work efficiency.

Editor in charge: Yan Zexu