Preface
Siem (security information and Event Management), as the name implies, is a management system for security information and events. For most enterprises, it is not cheap security system. This paper introduces how to use open-source software to analyze data offline and use algorithms to mine unknown attacks based on the author's experience.
Review system architecture
Take the web server log as an example, collect the query log of the web server through logstash, backup to the HDFS cluster in near real time, and analyze the attack behavior offline through Hadoop script.
Custom log format
Open httpd to customize the log format, record user age and referer
<IfModule logio_module>
# You need to enable mod_logio.c to use %I and %O
LogFormat "%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\" %I %O" combinedio
</IfModule>
CustomLog "logs/access_log" combined
Examples of logs
180.76.152.166 - - [26/Feb/2017:13:12:37 +0800] "GET /wordpress/ HTTP/1.1" 200 17443 "http://180.76.190.79:80/" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.21 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.21"
180.76.152.166 - - [26/Feb/2017:13:12:37 +0800] "GET /wordpress/wp-json/ HTTP/1.1" 200 51789 "-" "print `env`"
180.76.152.166 - - [26/Feb/2017:13:12:38 +0800] "GET /wordpress/wp-admin/load-styles.php?c=0&dir=ltr&load[]=dashicons,buttons,forms,l10n,login&ver=Li4vLi4vLi4vLi4vLi4vLi4vLi4vLi4vLi4vLi4vZXRjL3Bhc3N3ZAAucG5n HTTP/1.1" 200 35841 "http://180.76.190.79:80/" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.21 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.21"
180.76.152.166 - - [26/Feb/2017:13:12:38 +0800] "GET /wordpress/ HTTP/1.1" 200 17442 "http://180.76.190.79:80/" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.21 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.21"
testing environment
Add test code 1.php in WordPress directory, the content is phpinfo
Access log for 1.php
[root@instance-8lp4smgv logs]# cat access_log | grep 'wp-admin/1.php'
125.33.206.140 - - [26/Feb/2017:13:09:47 +0800] "GET /wordpress/wp-admin/1.php HTTP/1.1" 200 17 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36"
125.33.206.140 - - [26/Feb/2017:13:11:19 +0800] "GET /wordpress/wp-admin/1.php HTTP/1.1" 200 17 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36"
125.33.206.140 - - [26/Feb/2017:13:13:44 +0800] "GET /wordpress/wp-admin/1.php HTTP/1.1" 200 17 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36"
127.0.0.1 - - [26/Feb/2017:13:14:19 +0800] "GET /wordpress/wp-admin/1.php HTTP/1.1" 200 17 "-" "curl/7.19.7 (x86_64-redhat-linux-gnu) libcurl/7.19.7 NSS/3.14.0.0 zlib/1.2.3 libidn/1.18 libssh2/1.4.2"
127.0.0.1 - - [26/Feb/2017:13:16:04 +0800] "GET /wordpress/wp-admin/1.php HTTP/1.1" 200 107519 "-" "curl/7.19.7 (x86_64-redhat-linux-gnu) libcurl/7.19.7 NSS/3.14.0.0 zlib/1.2.3 libidn/1.18 libssh2/1.4.2"
125.33.206.140 - - [26/Feb/2017:13:16:12 +0800] "GET /wordpress/wp-admin/1.php HTTP/1.1" 200 27499 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36"
[[email protected] logs]#
Hadoop offline processing
Hadoop is based on map and reduce model
Map script
localhost:work maidou$ cat mapper-graph.pl
#!/usr/bin/perl -w
#180.76.152.166 - - [26/Feb/2017:13:12:37 +0800] "GET /wordpress/ HTTP/1.1" 200 17443 "http://180.76.190.79:80/" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.21 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.21"
My $line= "";
while($line=<>)
{
if( $line=~/"GET (\S+) HTTP\/1.[01]" 2\d+ \d+ "(\S+)"/ )
{
My $path=$1;
My $ref=$2;
if( $path=~/(\S+)\?(\S+)/ )
{
$path=$1;
}
if( $ref=~/(\S+)\?(\S+)/ )
{
$ref=$1;
}
if( ($ref=~/^http:\/\/180/)||( "-" eq $ref ) )
{
my $line=$ref."::".$path."\n";
#printf("$ref::$path\n");
print($line);
}
}
}
Reducer script
localhost:work maidou$ cat reducer-graph.pl
#!/usr/bin/perl -w
My%result;
My $line= "";
while($line=<>)
{
if( $line=~/(\S+)\:\:(\S+)/ )
{
unless( exists($result{$line}) )
{
$result{$line}=1;
}
}
}
foreach $key (sort keys %result)
{
if( $key=~/(\S+)\:\:(\S+)/ )
{
My $ref=$1;
My $path=$2;
#Here is an example. Filter the suffix of the webshell file you are concerned about. Common ones are PHP and JSP. There is a risk of omission in the form of white list filtering. You can also filter the file types you ignore in the form of black list filtering
if( $path=~/(\.php)$/ )
{
my $output=$ref." -> ".$path."\n";
print($output);
}
}
}
An example of the generated result is:
localhost:work maidou$ cat r-graph.txt
- -> http://180.76.190.79/wordpress/wp-admin/1.php
- -> http://180.76.190.79/wordpress/wp-admin/admin-ajax.php
- -> http://180.76.190.79/wordpress/wp-admin/customize.php
- -> http://180.76.190.79/wordpress/wp-admin/load-styles.php
- -> http://180.76.190.79/wordpress/wp-admin/post-new.php
- -> http://180.76.190.79/wordpress/wp-login.php
http://180.76.190.79/wordpress/ -> http://180.76.190.79/wordpress/wp-admin/edit-comments.php
http://180.76.190.79/wordpress/ -> http://180.76.190.79/wordpress/wp-admin/profile.php
http://180.76.190.79/wordpress/ -> http://180.76.190.79/wordpress/wp-login.php
http://180.76.190.79/wordpress/ -> http://180.76.190.79/wordpress/xmlrpc.php
http://180.76.190.79/wordpress/wp-admin/ -> http://180.76.190.79/wordpress/wp-admin/edit.php
http://180.76.190.79/wordpress/wp-admin/ -> http://180.76.190.79/wordpress/wp-login.php
http://180.76.190.79/wordpress/wp-admin/customize.php -> http://180.76.190.79/wordpress/wp-admin/load-scripts.php
http://180.76.190.79/wordpress/wp-admin/customize.php -> http://180.76.190.79/wordpress/wp-admin/load-styles.php
http://180.76.190.79/wordpress/wp-admin/edit-comments.php -> http://180.76.190.79/wordpress/wp-admin/load-scripts.php
http://180.76.190.79/wordpress/wp-admin/edit-comments.php -> http://180.76.190.79/wordpress/wp-admin/post-new.php
http://180.76.190.79/wordpress/wp-admin/edit.php -> http://180.76.190.79/wordpress/wp-admin/index.php
http://180.76.190.79/wordpress/wp-admin/edit.php -> http://180.76.190.79/wordpress/wp-admin/post-new.php
http://180.76.190.79/wordpress/wp-admin/index.php -> http://180.76.190.79/wordpress/wp-admin/customize.php
http://180.76.190.79/wordpress/wp-admin/post-new.php -> http://180.76.190.79/wordpress/wp-admin/load-scripts.php
http://180.76.190.79/wordpress/wp-admin/post-new.php -> http://180.76.190.79/wordpress/wp-admin/post.php
http://180.76.190.79/wordpress/wp-admin/post.php -> http://180.76.190.79/wordpress/wp-admin/admin-ajax.php
http://180.76.190.79/wordpress/wp-admin/post.php -> http://180.76.190.79/wordpress/wp-admin/edit.php
http://180.76.190.79/wordpress/wp-admin/post.php -> http://180.76.190.79/wordpress/wp-admin/load-scripts.php
http://180.76.190.79/wordpress/wp-admin/post.php -> http://180.76.190.79/wordpress/wp-admin/post.php
http://180.76.190.79/wordpress/wp-admin/profile.php -> http://180.76.190.79/wordpress/wp-admin/load-scripts.php
http://180.76.190.79/wordpress/wp-login.php -> http://180.76.190.79/wordpress/wp-admin/load-styles.php
http://180.76.190.79/wordpress/wp-login.php -> http://180.76.190.79/wordpress/wp-login.php
http://180.76.190.79:80/ -> http://180.76.190.79/wordpress/wp-admin/load-styles.php
http://180.76.190.79:80/ -> http://180.76.190.79/wordpress/wp-login.php
http://180.76.190.79:80/ -> http://180.76.190.79/wordpress/xmlrpc.php
Graph algorithm
To generate data import graph database neo4j, meet the characteristics of webshell:
- In degree and out degree are all 0
In degree and out degree are all 0
- In degree and out degree are all 1 and they point to themselves
In degree and out degree are all 1 and they point to themselves
Neo4j
Neo4j is a high-performance, NoSQL graphic database, which stores structured data on the network rather than in tables. Because of its advantages of embedded, high-performance, lightweight and so on, it has attracted more and more attention.
Neo4j installation
Download the installation package from https://neo4j.com/ and install it. The default configuration is OK
Ne04j boot
Take my Mac as an example. You can start it through GUI. The default password is ne04j / ne04j. The first time you log in, you will need to change the password
GUI management interface
Python API library installation
sudo pip install neo4j-driver
Download JPype
https://pypi.python.org/pypi/JPype1
Install JPype
tar -zxvf JPype1-0.6.2.tar.gz
cd JPype1-0.6.2
sudo python setup.py install
The code of importing data into the graph database is as follows:
B0000000B60544:freebuf liu.yan$ cat load-graph.py
Import re
from neo4j.v1 import GraphDatabase, basic_auth
Nodes={}
Index=1
driver = GraphDatabase.driver("bolt://localhost:7687",auth=basic_auth("neo4j","maidou"))
session = driver.session()
file_object = open('r-graph.txt', 'r')
Try:
for line in file_object:
matchObj = re.match( r'(\S+) -> (\S+)', line, re.M|re.I)
If matchObj:
path = matchObj.group(1);
ref = matchObj.group(2);
if path in nodes.keys():
path_node = nodes[path]
Else:
path_node = "Page%d" % index
nodes[path]=path_node
sql = "create (%s:Page {url:\"%s\" , id:\"%d\",in:0,out:0})" %(path_node,path,index)
index=index+1
session.run(sql)
#print SQL
if ref in nodes.keys():
ref_node = nodes[ref]
Else:
ref_node = "Page%d" % index
nodes[ref]=ref_node
sql = "create (%s:Page {url:\"%s\",id:\"%d\",in:0,out:0})" %(ref_node,ref,index)
index=index+1
session.run(sql)
#print SQL
sql = "create (%s)-[:IN]->(%s)" %(path_node,ref_node)
session.run(sql)
#print SQL
sql = "match (n:Page {url:\"%s\"}) SET n.out=n.out+1" % path
session.run(sql)
#print SQL
sql = "match (n:Page {url:\"%s\"}) SET n.in=n.in+1" % ref
session.run(sql)
#print SQL
Finally:
file_object.close( )
session.close()
The generated digraph is as follows
The nodes with a query in degree of 1 and a query out degree of 0, or the nodes with a query in degree of 1 and a query out degree of 1, point to their own nodes. Since the case where ref is empty is also identified as a "-" node, the nodes with a query in degree of 1 and an query out degree of 0 are all zero.
Optimization point
In the actual use of production environment, we encounter the following types of false positives:
- Home page, various index pages
Home page, various index pages
- PhpMyAdmin, ZABBIX and other operation and maintenance management background
PhpMyAdmin, ZABBIX and other operation and maintenance management background
- The console of Hadoop, elk and other open source software
The console of Hadoop, elk and other open source software
- API interface
API interface
These can be effectively solved by adding white in a short period of time. What is more troublesome is the impact of scanner on the results. This part needs to remove the interference through the fingerprint of scanner or the use of high-level human-computer algorithm.
Epilogue
Using algorithm to mine unknown attack behavior is a very popular research direction at present. This paper only introduces an algorithm which is easy to understand and implement. This algorithm is not my first one, and many security companies have more or less practice. Limited space, I will continue to introduce other algorithms in other articles on enterprise security construction. Algorithm or machine learning is essentially the trend of scientific laws in large data sets, so it is difficult to achieve accurate alarm. At present, it needs to be assisted by various rules and models, but it is really a magic weapon for mining unknown attacks.