how to use onionscan to customize the dark crawler

Posted by punzalan at 2020-03-26

People are ugly but not ugly

Contribution fee: 100RMB (if you don't agree with it, you can contribute too!)

Submission method: send an email to Linwei Chen, or log in to the web page for online submission


Onionscan 0.2 has been released! This article will take you to learn about one of its latest and best features custom Crawlers

Don't talk, get in the car!

Onionscan packages many types of identification methods for dark network services, such as bitcoin address, PGP key, and email address.

However, because many services publish data in non-standard format, it is more difficult to use tools for automatic processing.

Onionscan solves this problem by specifying a custom relationship for each website, which will be imported into their related engine later for better use.


As an example, let's take a look at Hansa mall. To write about reptiles, we want to get the name of the goods they sell. What kind of goods do they belong to and who are the sellers? By reviewing the information, we learned that all of these information can be obtained from the /listing product page.

Before that, we must build a custom web crawler to extract the information, process the information, and add the information to the form so that we can analyze it. In onionscan 0.2, we just need to define the configuration file.

{     "onion":"hansamkt2rr6nfg3.onion",     "base":"/",     "exclude":["/forums","/support", "/login","/register","?showFilters=true","/img", "/inc", "/css", "/link", "/dashboard", "/feedback", "/terms", "/message"],             "relationships":[{"name":"Listing",                       "triggeridentifierregex":"/listing/([0-9]*)/",                       "extrarelationships":[                             {                               "name":"Title",                               "type":"listing-title",                               "regex":"<h2>(.*)</h2>"                             },                             {                               "name":"Vendor",                               "type":"username",                               "regex":"<a href="/vendor/([^/]*)/">"                             },                             {                               "name":"Price",                               "type":"price",                               "regex":"<strong>(USD [^<]*)</strong>"                             },                             {                               "name":"Category",                               "type":"category",                               "regex":"<li><a href="/category/[0-9]*/">([^<]*)</a></li>",                               "rollup": true                             }                       ]                     }                     ] }

Let's explain it step by step:

The first two lines specify the dark network server. Here our goal is "onion": "hansamkt2rr6nfg3. Onion". As for base URL, we want to start scanning from the root directory ("base": "/"). Some dark network servers only have available data under subdirectories, such as / lists. In that case we can use the base parameter to tell onionscan to skip the rest of the site.

Next, exclude tells onionscan to exclude some links like "/ forums", "support", "login", and "/ register", because the contents of these links can't cause our adrenal hormones to secrete.

Finally, the relationships parameter is the key part of our crawler.

Relationships are defined by a name parameter and a triggeridentifierregex parameter. Regular expressions are applied to the URL of the website, and relationships will be triggered once the rules match. In this example, we tell onionscan to trigger listing relationships whenever the URL matches "/ listing / ([0-9] *) /". At the same time, onionscan also uses the number ([0-9] *) in the URL as a unique identifier for relationships.

Secondly, each relationship information can have an extrarelationships parameter. Onionscan will look for this information and assign a unique identifier to the extracted information.

For example, in our configuration file we define four additional association information: title, vendor, price, and category. Each extra association has a name and type parameter, as well as a regular expression regex, which onionscan will use in their correlation engine. The regular expression is used to extract information from the web page that triggered the condition before.

For the example of Hansa mall, we can find the product supplier name from the / listing / product sales page by looking for the hyperlink structure < a href = "/ vendor / ([^ /] *) /" >. By looking for a similar hyperlink structure, we can also get the title, price and product catalog list.

The roll up parameter under category is an instruction used by onionscan to make statistics. After that, we can draw a chart for research.

At this moment, we should tell onionscan how to read a sales list from Hansa mall. Next, we will see the performance of onionscan.

Put the above configuration file in the folder named service configs, and then call onionscan to scan the mall with the following command:

./onionscan -scans web --depth 1 --crawlconfigdir ./service-configs/ --webport 8080 --verbose hansamkt2rr6nfg3.onion

After IONSCAN has been running for a while, you can open localhost: 8080 in the browser and type hansakt2rr6nfg3.onion in the search box. Scroll down the list, and finally you should be able to get the following information:


As you can see, onionscan does so many things with a simple configuration. The crawler association information we defined earlier can now be searched and associated with other information discovered by onionscan.

Because we set the rollup parameter of onionscan, there is a chart to generate category. I hope you can maintain and share different types of configuration information with us while recognizing the powerful function.

This is just the beginning! We still have a lot of ideas to implement on onionscan. Find us through GitHub as soon as possible.