Update 2013 / 6 / 17: Weibo crawler has said before that it will not be maintained any more. Now please pay attention to my new project, cola. Cola is a distributed crawler framework that supports crawling including Sina Weibo (WEB version). Moreover, in the current experiment, long-term capture will not be blocked. Project address: https://github.com/chinaking/cola
Weibo crawler is a distributed crawler, which is mainly used to capture the data of weibo.cn.
First of all, Sina Weibo does have an API to get a user's data, but the number of calls to an application is limited; in addition, Sina Weibo oauth2.0 has an expiration time, and it will be difficult to authorize after a period of time (only one day for the tested application). I hope that the crawler will continue without human intervention.
When running in a distributed environment, the user data captured is stored in mongodb. Therefore, the installation of mongodb is required first. Because the crawler is written by python, pymongo is required (file storage is not required in stand-alone mode). If setuptools is installed, you can:
In addition, pyquery is used to parse web pages. Because pyquery depends on lxml, the installation of lxml may have problems. Windows users can download the binary installation package here.
Monitor has a web interface (see screenshot), and tornado needs to be installed to run.
The whole crawler program is divided into three parts.
- Scheduler: a simple scheduler that allocates UIDs to each worker and responds to monitor commands.
- Monitor: monitor program, collect heartbeat of each worker program, and have a web interface. Users can set it in monitor.
- Crawler: crawler worker program. Each worker gets the uid from the scheduler and grabs the data of the microblog user.
For a crawler, its class diagram is as follows:
A crawler uses a catcher to grab the web page of sina Weibo, and then gives the resulting web page to a specific parser (cnweiborparser is used to parse the user's Weibo page, cninfoparser is used to parse the user's personal information page, cnrelationshipparser is used to parse the user's fav and fans page), and the parser gives the parsed data to a storage for storage. At present, storage implements filestorage and mongostorage, and stores data in files and mongodb respectively.
Monitor and scheduler are not needed at this time, just run the file "init. Py" under the weibocrawler directory. The following options are supported:
- -M (- - mode): mode. The supported values are DC and SG. The default is distributed (DC). Running in stand-alone mode must be set to SG.
- -T (- - type): storage type, which can be file and Mongo. The default value is Mongo. Only in stand-alone mode can the setting be valid.
- -L (- - LOC): the absolute path of the folder when it is set to the storage type of file. The setting is valid in stand-alone mode.
- UIDs: a series of user UIDs to be grabbed, separated by spaces. The setting is valid in stand-alone mode.
Example:
Among them, UID1 and UID2 are the UIDs of microblog users respectively.
Note: if importerror occurs during operation:
Because weibocrawler is not in sys.path. resolvent:
Add the PTH file to site packages. Site packages vary from system to system. For windows, if Python is installed in C: \ python27, it is C: \ python27 \ lib \ site packages; for Linux, it should be / usr / local / lib / pythonx. X / dist packages.
Suppose the root directory is in / my / path / Weibo crawler, and there are intro.py, Weibo crawler folder and so on. Then create a new Weibo crawler.pth file under site packages, and write the path in / my / path / Weibo crawler.
Monitor and scheduler should remember that only one can be started. Monitor needs to be started after the scheduler starts. Crawler can start as many as you like.
If you want to configure monitor, scheduler or Weibo crawler, you need to create local settings.py in these three directories to override the default settings.py settings. Next, they are configured by default:
- Data_port: the port used for crawler communication. The default is 1123.
- Control port: the port used by the monitor to send commands. The default is 1124.
- Start UIDs: the user uid that is initially fetched. The default value is empty list [].
- Fetch_size: the number of target fetched users. The default value is 500.
- Mongo_host: the IP address of the machine where mongodb is located, the same below.
- Mongo_port: the port of the machine where mongodb is located, the same below.
Start scheduler run
- Scheduler_host: the IP address of the scheduler, which is local by default, the same below
- Scheduler control port: the control port of the machine where the scheduler is located. The default is 1124, the same below.
- Mongo_host
- Mongo_port
Start monitor run:
- Account: the Weibo account used by this crawler.
- PWD: the Weibo password used by this crawler.
- Instance_index: if you want to start multiple crawlers on one machine, you need to put the folder where crawler is in different places on the machine. At the same time, the instance [u index] here represents the instance number of this machine. The default is 0.
- Mongo_host
- Mongo_port
- scheduler_host
- scheduler_port
- Monitor_enable: whether to communicate with monitor or not. It is false by default. It needs to be set to true in distributed situation.
- Monitor_host: the IP address of the monitor.
- Monitor port: the port used by monitor. The default is 8888.
Running crawler can:
If there are more than one crawler instance on a machine, please refer to the instructions for setting the instance? Index.
At present, because the reptile is from the requirements of the experiment, some data are not obtained, including:
- All users only get their own microblog content, so they do not get the content forwarded by the user.
- The initial user gets the list of fans, but not the user who gets their follow.
- Non initial users did not get their fans and follow users.
So if you want to get these data, just modify the specific part of storage.
In addition, in the distributed mode, once the crawler starts to run, it cannot stop until the process is killed or all users have grabbed it.
First, make sure the user name and password are correct. Configure the user name and password under local_settings.py or settings.py, and remember that this is not a uid.
Second, check crawler.log under weibocrawler. If similar error message appears:
The problem is pyquery. Pyquery uses the cssselect package, but the cssselect package is too new, which will cause pyquery to make an error.
Solution: use easy install to uninstall cssselect and pyquery. (the command may be easy \ u install - M * * *, or easy \ u install - mxn * * *, depending on the version) then reinstall cssselect as version 0.7.1 and pyquery as Version 1.2.1: easy \ u install cssselect = = 0.7.1, the same as pyquery.
Finally, after completing the above steps, delete the cookies.txt in the weibocrawler folder and all the contents in the target folder. Rerun to see if you can grab data.
If my open source code or articles in my blog help you, and you want to give me some encouragement, you can donate through Alipay.