learn from the open source framework of self-developed log collection system

Posted by millikan at 2020-03-27

Dried food





Stepping on the waves without trace

More than ten years of data R & D experience, good at data processing field work, such as crawler, search engine, big data application high concurrency, etc. Served as architect, R & D Manager and other positions. Has led the development of large-scale crawler, search engine and big data advertising DMP system is currently responsible for the development and construction of technology data platform.

Project background

The company project needs to collect and manage the logs distributed in multiple machines. I have used logstash, flume and other open source projects. Finally, a set of log collection system based on Java language is developed. The following is an analysis of open source system and self research from the perspective of project concern.

Logstash and flume are very mature log collection platforms, with clear structure, rich plug-ins, concise documents, and many sample codes. Among them, logstash focuses on the preprocessing of fields, flume focuses on the delivery of logs in different network topologies, and gets through each network node through agents.

The company's development team mainly focuses on Java and python. The plug-in of logstash uses ruby, which is not very extensible from the perspective of the team. It's painful to add a plug-in with logstash. After using it for several months, I feel that the performance is low and the startup time is slow.  

➦ flume performance is relatively low, mainly including the following points:

① Single threaded.

Flume each agent is divided into source, channel, sink and other plug-ins. Each plug-in only enables single threaded processing. If the task is to write database and other IO operations, the performance will inevitably be dragged down.

② Timer mechanism of source

When the source thread detects new updates, it will read all the way to the channel. When all the updates are processed, the thread will exit. Start a timer thread. Restart regularly for 3 seconds, so repeatedly. In this process, the Java Multithread notification mechanism is not fully utilized, and there are some scheduling, queuing, detection and task initialization processes at each startup. Affect performance.

③ Flume transaction mechanism

Flume itself has optimized things to allow events to be submitted in batches. However, in essence, it is still necessary to detect the processing results of sink before committing or roolback.

If an agent's task processing string, source - > Channel - > sink, is understood as a task (this task is an abstract concept, which is not in flume), then flume is a single task collection system from the business perspective. If you need to process two tasks at the same time, you must start two flume agent processes. With the increase of collection task, management cost will increase greatly.

(flume processing: multiprocessing and multitasking)

(Bloodhound processing: single process processing and multitasking)

In addition, we have monitoring requirements, statistical requirements, task management, etc. These tasks need to be connected with our grafana platform. Under comprehensive consideration, we choose the self-developed log collection system.

Bloodhound system

From wikipedia:

The Bloodhound is a large scent hound, originally bred for hunting deer, wild boar, and since the Middle Ages for tracking people. Believed to be descended from hounds once kept at the Abbey of Saint-Hubert, Belgium, it is known to French speakers as the Chien de Saint-Hubert.

This breed is famed for its ability to discern human scent over great distances, even days later. Its extraordinarily keen sense of smell is combined with a strong and tenacious tracking instinct, producing the ideal scent hound, and it is used by police and law enforcement all over the world to track escaped prisoners, missing people, lost children and lost pets.

The hound with the most sensitive sense of smell means that it can extract preliminary valuable information from various rough raw data, including flow.

Multi task management system

Strong scalability

Task monitoring

High performance

Project framework

System layering

In order to make full use of the functional features of flume, we also split the Bloodhound into three layers: source > Channel > sink. This design is to make full use of the rich plug-in resources in flume. Please refer to the following configuration file.

Source is data input, usually file, message system, etc. In the example, the source is redis, and the source is a thread running separately. The input is obtained from the queue specified in redis. After reading, it is pushed to the channel. When the queue in the channel is full, the source thread enters the wait.

As a pivot button, channel connects source and channel. Its main functions are as follows:

Maintain a queue, accept the put of source, and send it to sink for processing.

Manage a thread pool and schedule sink tasks. Because sink is usually slow, only sink is multithreaded in the whole core module, and the rest is single threaded.

To control QPS, token or funnel can be used. The main purpose is to protect the database or redis corresponding to sink in the environment of high concurrent write. Not too much pressure, affecting normal business requests.

Send metrics to provide real-time monitoring data sources.

The main methods of channel layer are: popevents, addevents, notifyevents, sendmetrics, etc.

The sink layer is a runnable, which accepts events and is scheduled by the channel to execute the final landing logic.

In the above three layers, the channel layer includes memorychannel and filechannel. If the task is important, filechannel should be selected to ensure that the event will not be lost after the process is interrupted. Memorychannel manages a queue with relatively high performance. Source and sink can reuse a lot of plug-in code in flume.

Task manager, hence its name, is a management module for managing the whole log collection system.

A task can be submitted to the whole process through the task registration interface. As shown in the configuration, the task registration interface provides registration and starts a new task through an HTTP post method.

By default, source is pull mode, which pulls logs from files and queues. It also supports HTTP submission. The data submission interface needs to pass two parameters, jobname and events.

Check the execution of each task in grafana, which is provided by the core framework layer.

Provide list, view task status, start and stop tasks.

Use supervisor to manage processes.

According to each business situation, scheduling tasks are used to manage tasks. Call task start, stop, etc. in task management. This one is not very related to the log collection core, so we will not repeat it.

I have been engaged in many projects that need to use log collection, but also used open-source systems such as logstash, flume, etc. in general, I feel that the open-source system is relatively mature, with a large number of plug-ins and transaction management. But it is not closely integrated with its own business system. The workload of self-developed framework is large, and there will be many holes. The advantage is better integration with business.

⊙ [author]

[dry goods] Senior Architect's actual combat: how to complete the reptile requirements with the minimum cost

⊙ [test]

JMeter introduces and matches with Jenkins to realize automatic test

How to use JMeter to realize API automatic test?

⊙ [CPO series]

Build pit 01 on the road of risk control system - Information Collection

Risk analysis of pit 02 on the road of building risk control system

Build the pit on the road of risk control system 03 - blocking risk

Risk analysis of pit 04 on the road of building risk control system

⊙ [anti fraud]

Declassify the secret of making money from non-profit app: Charles grabs the bag

30 yuan back now, the wool party invites each other to get 30000 yuan every day

The war between airlines and reptiles: the truth and inside story of special ticket