graph data construction of intelligent threat analysis

Posted by millikan at 2020-02-29

AI technology based on deep neural network has made a breakthrough in many fields, but its application in network security is still limited in general. At the current stage, it is unrealistic to expect to identify, correlate and respond to threat events end-to-end through a hierarchical deep neural network. Zhou Tao, an algorithm expert, gives a summary of AI's difficulty in landing in threat detection

1. Machine learning is good at discovering normal patterns, but intrusion is abnormal behavior

2. Having big data does not mean having a lot of labeled data, and unsupervised learning method has limited accuracy

3. Threat detection is an open problem, it is difficult to define loss function

4. Pursuit of interpretability of results

From model, data, and application scenarios, the above points explain the reason why machine learning, especially deep learning is difficult to fit with security modeling. However, deep learning and machine learning are not all AI technologies. In cyberspace, build an intelligent threat analysis platform with abnormal perception, event reasoning and threat response capabilities. Deep learning and machine learning can be used as conventional weapons for data processing, rather than core capabilities.

Data has always been the basis of AI usability, and the object of typical "perception cognition action" intelligent application mode is also data. So to build a more automatic and intelligent threat analysis capability, what data should we collect and analyze, and how to organize these data?

Apt identification / tracking, attack tracing, threat hunting and response, Gang analysis, situation awareness and other security defense targets have far exceeded the application scope of traditional isolated detection system. Application logs, host logs, network logs, detection event logs, asset information, evaluation results and employee information at the business level have been gradually integrated into Siem and ueba schemes, while threat intelligence information has gradually become the standard configuration of detection capabilities. The access and association of multi-source heterogeneous data provide a comprehensive support for event visualization, detection, reasoning, response and governance. With the enrichment and detection of data, the improvement of correlation ability and more automation of response ability, major manufacturers have gradually begun to think about the construction of intelligent security ability, in order to achieve more general automatic reasoning ability for security data.

To build security intelligence, the first problem is how to organize data. First, a brief review of the pyramid hierarchical model dikw of data, information, knowledge and wisdom [1].

Figure 1 dikw pyramid

The road from "data" to "wisdom" is long, but with the continuous attack and defense confrontation, the development of security industry has laid a solid foundation for us to touch the edge of security wisdom. Here, we roughly classify the common safety data according to dikw model:

Data layer (original unprocessed): application log, host log, network traffic log, honeypot log, network architecture data, business layer data, etc.

Information layer (data based on rule and behavior matching processing, with clear meaning and timeliness): all kinds of detection logs, including single data source detection and association detection; all kinds of evaluation logs, including actively acquired asset data, vulnerability data, etc.; Threat Intelligence.

Knowledge: various specifications and knowledge bases, such as CWE, cnnvd, capec, ATT & CK, etc.

The above classification generally describes the resources and levels that can be mastered by current security data analysis. In the context of security data analysis, we use the word "data" to represent all the digital resources used, including the combination of data layer, information layer and knowledge layer. Dikw model describes the hierarchical structure of data and is also the most direct processing mode in network security.

The goal of intelligent threat analysis technology should not be to solve all the network security problems instead of replacing people, but to do the ultimate expert system by automating the analysis of security data, the research and judgment of threat events and the response, so as to extend people's perception of security data, reduce the cost of people's in-depth knowledge of information and knowledge, improve people's ability to act on threat events, and truly promote network security Full protection from passive attack to active attack. As Professor Li Kang of the University of Georgia said, many international manufacturers are stepping up the layout of smart security technology, expecting to absorb data of larger scale and higher dimension through ecological construction, so as to enable security. However, data acquisition is not the focus of smart threat analysis technology itself, and how to organize and use data is the core issue.

The network environment itself has a typical graph structure, so the network security problem is naturally combined with graph data structure and graph algorithm. After Google put forward the concept of knowledge map, the intelligent application scheme based on knowledge map technology has been widely used in recommendation system, question answering system, search engine, social network, risk control and other fields. In the field of security, the most common graph is the asset relationship graph, attack vector graph and so on in the visualization interface of the major security products. Through the graph data association and reasoning, domestic and foreign manufacturers are also constantly in-depth attempt. Microsoft intelligent security graph has almost completely captured the search results of Google engine "graph" + "security" keywords.

Through the comprehensive integration of cloud ecosystem and platform, linking multi-party and multi-dimensional data, it provides comprehensive threat association information, and ensures real-time threat detection with the analysis ability of the cloud. In addition, it provides a API that can be quickly integrated. In the RSAC 2019, the Microsoft security team introduced the concept of data gravity, as well as the threat analysis algorithm based on detection and behavior graph and machine learning in the cloud environment, which can effectively assess the risk of events. Sqrrl (acquired by Amazon in January 2018) provides an online threat hunting platform. Combining with the concept of "behavior graph" proposed by ueba, sqrrl uses behavior assessment and correlation data to support the in-depth investigation of threat events. Mitre company, which initiates and builds multiple threat modeling knowledge bases (capec, CWE, ATT & CK, etc.) and related languages and specifications (Stix 1.0, taxii 1.0, etc.), has made in-depth research in the construction of security data graph model. Cygraph [2] is a prototype system of mitre in graph model research. Cygraph uses hierarchical graph structure, including four levels of graph data: network infrastructure, security post, network threats and mission dependencies, which are used to support attack surface identification and attack situation understanding for key asset protection.

Overseas projects using multi-source security data to build a unified analysis chart structure also include cauldron [3]. Cauldron can normalize vulnerability scanning evaluation results, and support to analyze firewall rules of various formats. Through joint analysis with network topology, it can effectively analyze the dynamic changes of network attack surface. In China, there are many graph analysis methods for products and researches focusing on security data. For example, green alliance technology has designed multiple ontologies to model and analyze the whole network threat, which are compatible with the access and use of miter's capec, MAEC and att & CK models. It can extract key information from multi-party Threat Intelligence and expand the knowledge map as knowledge. Alibaba uses the aggregated original alarm data to generate a directed attack graph, and through the mapping of attack stages, the network distribution of assets and the weight of related sides, it evaluates the priority of alarms and discovers attack scenarios.

The goal of intelligent threat analysis technology should not be to solve all the network security problems instead of replacing people, but to do the ultimate expert system by automating the analysis of security data, the research and judgment of threat events and the response, so as to extend people's perception of security data, reduce the cost of people's in-depth knowledge of information and knowledge, improve people's ability to act on threat events, and truly promote network security Full protection from passive attack to active attack. As Professor Li Kang of the University of Georgia said, many international manufacturers are stepping up the layout of smart security technology, expecting to absorb data of larger scale and higher dimension through ecological construction, so as to enable security. However, data acquisition is not the focus of smart threat analysis technology itself, and how to organize and use data is the core issue.

Figure 2 cygraph architecture

Nowadays, the dimensions and scale of the network security data that can be collected are growing, so it is urgent to organize the data in a systematic way, and combine all the available information into an organic whole as much as possible. Traditional data organization form based on relational database is difficult to deal with complex graph relation operation. Organizing data into graph structure can maximize the graph attribute of security data, and improve the efficiency of data storage, mining and retrieval. The graph gene contained in the network security data structure is not only the basis of data visualization, but also the basis of security intelligent construction to resist the threat of cyberspace. So, what data graphs are needed to build the intelligent threat analysis capability?

Figure 3 construction of key data chart

At present, the access of large-scale multi-dimensional network security big data has created a new opportunity for dealing with network threat events. At the same time, under the condition of limited available resources, the selection and unified processing of security data sources are particularly important. Different from dikw's data hierarchy model and cygraph's security / task knowledge stack structure, starting from the nature of network attack and defense, taking a given network space as the battlefield, protecting assets (including physical assets and virtual assets) and attacking threat agents as the purpose, intelligent threat analysis should collect and build key data graphs of the following dimensions:

Environmental data chart: such as assets, asset vulnerability, file information, user information, IT system architecture information, etc

Behavior data diagram: such as network side detection alarm, terminal side detection alarm, file analysis log, application log, honeypot log, Sandbox log, etc

Intelligence data chart: all kinds of external threat intelligence

Knowledge data map: various knowledge bases (such as att & CK, capec, CWE), etc

All kinds of security related data (not limited to the above four categories) have been used in many big data analysis scenarios, but they are often used in isolation or in part, and there is no unified system to describe the classification and use mode of these data. The four types of data listed here, starting from the practice of network threat event analysis and response, are organized in the form of graphs to realize the correlation within each category of graphs and the correlation between different categories of graphs, which can cover the basic tactical requirements of cyberspace confrontation operations, including the mastery of the environment, the understanding of threat subject action, the fusion of external intelligence and the basic knowledge reserve 。 The separation of four diagrams and the association of entities of specified types ensure the data expression ability of different types of diagrams, and realize the global linking ability. Next, this paper focuses on the necessity of the above four data graphs for the construction of security intelligence.

1 environmental data map

"Environment" can be defined as the attributes (basic information, vulnerability, compliance information, etc.) of various entities and entities in the protection cyberspace, as well as the relationship between entities. The construction of environmental data map needs the support of asset management, vulnerability management, risk assessment and other tools and services, as well as business data like enterprise organization information, it system architecture information, human resource information to support the enrichment and relationship establishment of environmental entities.

Figure 4. Figure based vulnerability analysis of cauldron

Security protection is not only to build thicker firewalls and make more budgets to resist DDoS attacks that may occur at any time. The control over assets, asset vulnerability, user information and it architecture information often determines the upper limit of cyberspace defense capability. Axionius, who provides asset management platform solutions, has become the new RSAC innovation sandbox champion, which seems to remind us that asset management solutions are far from being as mature as they should be. Especially in the era of rapid development of cloud, Internet of things and mobile Internet, the number of assets has increased dramatically, the types are more abundant, and the situation of vulnerability exposure is more severe. "Confidant" is more critical than "knowing the other". Whether it is the assets exposed to the public network or the "black assets" not included in the management within the boundary, it will greatly increase the risk of security protection. In order to deal with the impenetrable threat, it is necessary to find the key entities and relationships of security protection. Before and after the threat event, the potential impact scope and depth of the threat should be comprehensively evaluated to ensure the accurate identification of the attack surface.

2 behavior data chart

Behavior can be defined as the actions of entities in the protected cyberspace that can be collected and detected. It can be all kinds of original logs of dikw data layer, all kinds of detection alarm logs of information layer and aggregated inference alarm logs. The integrated scheme of ueba and Siem can meet the needs of behavioral data collection.

Figure 5 sqrrl: behavior graph

The importance of behavior data graph is self-evident. Multi-dimensional behavior collection, from endpoint to network, from active to passive, from boundary to interior, from rule to statistical machine learning, can comprehensively depict the action trace of cyberspace entities, which is the basic premise of identifying, classifying, responding and tracing tasks. Through the aggregation rules of multiple behavior sequences, the reasoning method of generating new alarm events has been applied in many scenarios. However, the behavior association should not be limited to the behavior aggregation of a single entity. The long-term behavior association of multiple entities is the goal of behavior data analysis. From the point of view of processing and storage efficiency, it is the only way to organize multi-entity behavior vectors into graph model structure. The granularity of behavior collection is largely determined by the existing collection and detection capabilities. At this point, on the basis of ensuring normalization and systematization, it should be a feature of behavior collection that "all comers are welcome". The main difference between behavior graph, environment graph and knowledge intelligence graph is that the timeliness of behavior graph is shorter, and the frequency of updating and adding is higher. The key to maximize the value of behavior graph is to construct the ontology model and entity relationship of behavior data, to design the interaction ability between behavior, environment and knowledge, and to manage the life cycle of behavior graph data.

3 intelligence data chart

Different types of "Threat Intelligence" may lead to different interpretations of the concept of intelligence. Here, for the definition of intelligence, please refer to Gartner's 2014 security threat intelligence service market guide. "Threat intelligence is evidence-based knowledge, including context, mechanism, indicators, impact and operational recommendations. Threat Intelligence describes the existing or imminent threat or danger to the asset, and can be used to inform the subject to take some response to the related threat or danger. " Based on this definition, it can be said that threat intelligence and all kinds of knowledge bases have different emphases and cross each other.

Threat Intelligence can expand the threat vision assigned to the security team, and improve the security event research and judgment ability through more threat contexts. At present, threat intelligence has become an important strategic and commercial resource, widely used in security operations, situation awareness, threat analysis, risk assessment, attack traceability and other fields. It is worth noting that different Threat Intelligence providers have different dimensions and depth of understanding of threat intelligence. To build available intelligence data map is better than rich, accurate and timely Threat Intelligence. Selecting threat intelligence sources that meet specific business scenarios to build a specific intelligence map is the key to improve efficiency and availability.

4 knowledge data chart

Figure 6 att & CK element diagram [4]

Knowledge and intelligence often cross concepts in different situations. Here, we call the safety data that can be summarized, used for reasoning, and weakly related to time as knowledge data, including all kinds of knowledge bases, such as att & CK, capec, and all kinds of enumeration bases, such as CWE, CVE, cnnvd, etc. The construction of knowledge base often depends on expert experience, collection, verification and refinement of threat intelligence, and the abstract concepts and relationships are the general modeling basis. At present, the construction and sharing of knowledge base has become the consensus of the security industry. The threat event analysis under the knowledge map can expand the concept and data context of the related entities of behavior, environment and intelligence map. It is a truly interpretable, inferable, actionable and reusable automatic and intelligent analysis. Compared with more commercial Threat Intelligence, knowledge base can be based on open or open source project data, and many domestic and foreign institutions are also committed to building a broader and more professional threat related knowledge base, such as capec, CVE, cnnvd, ATT & CK, etc.; also, knowledge map can be automatically extracted and constructed from multi-source data through knowledge map technology, and knowledge map can be automatically extracted and constructed through relational reasoning, etc Figure to expand.

The purpose of the research on Intelligent threat analysis technology in cyberspace is not to design a dazzling concept, nor to realize an AI security model that can be used everywhere. Returning to the battlefield of attack and defense, what we hope to get is a unified, highly automated platform and tool chain that can handle massive heterogeneous multi-source data, quickly detect, reason, respond and track threat events, and assist people in safe operation, research and counter attack. Based on the practical experience and the re classification of the common data sources in the network security data analysis, this paper puts forward four key data graphs of environment, behavior, intelligence and knowledge needed to build the graph model of the intelligent security platform, so as to support the further development of the "intelligent" security research work.

Of course, an available and extensible graph data architecture needs not only the support of data processing, storage framework and other infrastructure, but also the data association and interaction between different kinds of data graphs. On the one hand, we need to design and optimize the ontology database, which is the structured concept template of the graph structure pattern layer. On the other hand, we need to use unified and extensible specifications and languages (such as Stix, MAEC, information security technology, network security threat information format specification, etc.) to advance the examples in the graph structure data layer Line description and data interaction through unified interface. In addition, the association between different data layers needs a standard naming and classification system. For example, the enterprise customized IOC detection alarm needs to be mapped to the designated knowledge node of the knowledge base. These tasks challenge the traditional network security architecture and implementation. Finally, from the perspective of security intelligent ecological construction, it is necessary to establish industry standards for smart security technology from more dimensions such as data, technology, architecture, laws and regulations, so as to meet the in-depth sharing and interaction of security big data and realize real industry wisdom.

reference material:

[1] Rowley, J. The wisdom hierarchy: representations of the DIKW hierarchy[J]. Journal of information science, 2007, 33(2): 163-180.

[2] Noel S, Harley E, Tam K H, et al. CyGraph: Graph-Based Analytics and Visualization for Cybersecurity. Cognitive Computing: Theory and Applications Elsevier , 2016.

[3] Jajodia S, Noel S, Kalapa P, et al. Cauldron mission-centric cyber situational awareness with defense in depth[C]. MILCOM 2011 Military Communications Conference, 2011. 1339-1344.

[4] MITRE ATT&CK: Design and Philosophy (

Content editor: Zhang Runzi, Tianshu laboratory editor in charge: Xiao Qing

Past review

Istio series one: analysis of the authentication and authorization mechanism of istio

[interpretation of Report Series IV] Lvmeng technology found two kinds of malicious port mapping families

[interpretation of Report Series III] the resentment between IOT and GDP in 2018

[recruitment] recruitment announcement of interns of Lvmeng science and Technology Innovation Center (long term effective)

The original article of the official account only represents the author's viewpoint and does not represent the position of the Green League. All original content copyright belongs to green alliance technology research communication. Without authorization, no media, WeChat official account is allowed to be copied, reproduced, excerpts or otherwise used. The reprint should be marked from the Green Alliance Technology Research Newsletter and attached to the link.

About us

Lvmeng technology research communication is operated by Lvmeng technology innovation center, which is the leading technology research department of Lvmeng technology. It includes Cloud Security Lab, security big data analysis lab and Internet of things Security Lab. The team members are composed of doctors and masters from Tsinghua University, Peking University, Harbin Institute of technology, Chinese Academy of Sciences, Beijing post and other key universities.

As one of the important training units of "post doctoral workstation sub station of Haidian Park of Zhongguancun Science and Technology Park", Lvmeng science and technology innovation center has carried out post doctoral joint training with Tsinghua University. The scientific research achievements have covered all kinds of national projects, national patents, national standards, high-level academic papers, professional books, etc.

We continue to explore the cutting-edge academic direction in the field of information security, starting from practice, combining the company's resources and advanced technology, to achieve a concept level prototype system, and then deliver product line incubation products and create huge economic value.

Long press the QR code above to follow us