IMCAFS

Home

chineking / weibocrawler / wiki / home: bitbucket

Posted by barello at 2020-02-28
all

Update 2013 / 6 / 17: Weibo crawler has said before that it will not be maintained any more. Now please pay attention to my new project, cola. Cola is a distributed crawler framework that supports crawling including Sina Weibo (WEB version). Moreover, in the current experiment, long-term capture will not be blocked. Project address: https://github.com/chinaking/cola

Weibo crawler is a distributed crawler, which is mainly used to capture the data of weibo.cn.

First of all, Sina Weibo does have an API to get a user's data, but the number of calls to an application is limited; in addition, Sina Weibo oauth2.0 has an expiration time, and it will be difficult to authorize after a period of time (only one day for the tested application). I hope that the crawler will continue without human intervention.

When running in a distributed environment, the user data captured is stored in mongodb. Therefore, the installation of mongodb is required first. Because the crawler is written by python, pymongo is required (file storage is not required in stand-alone mode). If setuptools is installed, you can:

In addition, pyquery is used to parse web pages. Because pyquery depends on lxml, the installation of lxml may have problems. Windows users can download the binary installation package here.

Monitor has a web interface (see screenshot), and tornado needs to be installed to run.

The whole crawler program is divided into three parts.

For a crawler, its class diagram is as follows:

A crawler uses a catcher to grab the web page of sina Weibo, and then gives the resulting web page to a specific parser (cnweiborparser is used to parse the user's Weibo page, cninfoparser is used to parse the user's personal information page, cnrelationshipparser is used to parse the user's fav and fans page), and the parser gives the parsed data to a storage for storage. At present, storage implements filestorage and mongostorage, and stores data in files and mongodb respectively.

Monitor and scheduler are not needed at this time, just run the file "init. Py" under the weibocrawler directory. The following options are supported:

Example:

Among them, UID1 and UID2 are the UIDs of microblog users respectively.

Note: if importerror occurs during operation:

Because weibocrawler is not in sys.path. resolvent:

Add the PTH file to site packages. Site packages vary from system to system. For windows, if Python is installed in C: \ python27, it is C: \ python27 \ lib \ site packages; for Linux, it should be / usr / local / lib / pythonx. X / dist packages.

Suppose the root directory is in / my / path / Weibo crawler, and there are intro.py, Weibo crawler folder and so on. Then create a new Weibo crawler.pth file under site packages, and write the path in / my / path / Weibo crawler.

Monitor and scheduler should remember that only one can be started. Monitor needs to be started after the scheduler starts. Crawler can start as many as you like.

If you want to configure monitor, scheduler or Weibo crawler, you need to create local settings.py in these three directories to override the default settings.py settings. Next, they are configured by default:

Start scheduler run

Start monitor run:

Running crawler can:

If there are more than one crawler instance on a machine, please refer to the instructions for setting the instance? Index.

At present, because the reptile is from the requirements of the experiment, some data are not obtained, including:

So if you want to get these data, just modify the specific part of storage.

In addition, in the distributed mode, once the crawler starts to run, it cannot stop until the process is killed or all users have grabbed it.

First, make sure the user name and password are correct. Configure the user name and password under local_settings.py or settings.py, and remember that this is not a uid.

Second, check crawler.log under weibocrawler. If similar error message appears:

The problem is pyquery. Pyquery uses the cssselect package, but the cssselect package is too new, which will cause pyquery to make an error.

Solution: use easy install to uninstall cssselect and pyquery. (the command may be easy \ u install - M * * *, or easy \ u install - mxn * * *, depending on the version) then reinstall cssselect as version 0.7.1 and pyquery as Version 1.2.1: easy \ u install cssselect = = 0.7.1, the same as pyquery.

Finally, after completing the above steps, delete the cookies.txt in the weibocrawler folder and all the contents in the target folder. Rerun to see if you can grab data.

If my open source code or articles in my blog help you, and you want to give me some encouragement, you can donate through Alipay.