Tinyspider is a network data capture framework based on tiny Htmlparser.
Maven reference coordinates:
<dependency>
<groupId>org.tinygroup</groupId>
<artifactId>tinyspider</artifactId>
<version>0.0.12</version>
</dependency>
Web crawler, generally used in full-text retrieval or content acquisition.
The tiny framework also has limited support for this. Although it has few functions, it is very convenient to do full-text search or get data from the web page.
Frame characteristics
- Powerful node filtering capability
- Support post and get data submission methods
- Avoid duplicate processing of web pages
- Support multi site content capture function
- Strong HTML fault tolerance processing
framework design
Internet worm
public interface Spinder {
/**
* 添加站点访问器
*
* @param siteVisitor
*/
void addSiteVisitor(SiteVisitor siteVisitor);
/**
* 添加监视器
*
* @param watcher
*/
void addWatcher(Watcher watcher);
/**
* 处理url
*
* @param url
*/
void processUrl(String url);
/**
* 处理url
* @param url
* @param parameter
*/
void processUrl(String url, Map<String, Object> parameter);
/**
* 设置URL仓库
*
* @param urlRepository
*/
void setUrlRepository(UrlRepository urlRepository);
}
A crawler must contain at least one site visitor, which is used to access the URL. If there is no matching site visitor, the URL will be ignored and will not be processed.
A crawler needs to contain at least one monitor, which is used to filter the content in the URL and process the hit nodes. Without a monitor, the content crawled back by a crawler is of no value.
A crawler needs at least one URL warehouse, which is used to judge whether or not it has been grabbed and processed. If there is no URL warehouse, it will not be able to determine whether the URL has been processed. In many cases, it will cause a dead cycle and cannot exit.
Of course, a crawler must also be able to handle URLs.
Website visitors
Since a crawler can have multiple site accessors, there needs to be an ismatch method to tell the crawler whether it should be handled by itself. Access method. You can set whether to get data through get or post.
URL warehouse
public interface UrlRepository {
/**
* 返回url是否已经在仓库中存在
*
* @param url
* @return
*/
boolean isExist(String url);
/**
* 返回url是否已经在仓库中存在,带有参数
*
* @param url
* @param parameter
* @return
*/
boolean isExist(String url, Map<String, Object> parameter);
/**
* 如果不存在,则放放,如果已经存在,则替换
*
* @param url
* @param content
*/
void putUrlWithContent(String url, String content);
/**
* 如果不存在,则放放,如果已经存在,则替换
*
* @param url
* @param parameter
* @param content
*/
void putUrlWithContent(String url, Map<String, Object> parameter,
String content);
/**
* 如果存在,则返回内容;如果不存在,则抛出运行时异常
*
* @param url
* @return
*/
String getContent(String url);
/**
* 如果存在,则返回内容;如果不存在,则抛出运行时异常
*
* @param url
* @param parameter
* @return
*/
String getContent(String url, Map<String, Object> parameter);
}
URL warehouse is used to manage URL and its content. As the methods are simple and clear, no more introduction will be made.
monitor
public interface Watcher {
/**
* 设置节点过滤器
*
* @param filter
*/
void setNodeFilter(NodeFilter<HtmlNode> filter);
/**
* 获取节点过滤器
*
* @return
*/
NodeFilter<HtmlNode> getNodeFilter();
/**
* 添加处理器
*
* @param processor
*/
void addProcessor(Processor processor);
/**
* 获取处理器列表
*
* @return
*/
List<Processor> getProcessorList();
}
A monitor must have a node filter, but it can have multiple processors.
processor
public interface Processor {
/**
* 处理节点
*
* @param node
*/
void process(HtmlNode node);
}
The processor is very simple, that is, processing the hit nodes.
Example
By visiting [http://www.oschina.net/question? Catalog = 1], we can see that there are many technical questions and answers. Let's write a program to print out these Titles:
Writing crawler
public static void main(String[] args) {
Spinder spinder = new SpinderImpl();
Watcher watcher = new WatcherImpl();
watcher.addProcessor(new PrintOsChinaProcessor());
QuickNameFilter<HtmlNode> nodeFilter = new QuickNameFilter<HtmlNode>();
nodeFilter.setNodeName("div");
nodeFilter.setIncludeAttribute("class", "qbody");
watcher.setNodeFilter(nodeFilter);
spinder.addWatcher(watcher);
spinder.processUrl("http://www.oschina.net/question?catalog=1");
}
Write processor
public class PrintOsChinaProcessor implements Processor {
public void process(HtmlNode node) {
FastNameFilter<HtmlNode> filter = new FastNameFilter<HtmlNode>(node);
filter.setNodeName("h2");
filter.setIncludeNode("a");
HtmlNode h3 = filter.findNode();
if (h3 != null) {
System.out.println(h3.getSubNode("a").getContent());
}
}
}
Operation result
The output may not be the same as the result because the data is always changing.
约瑟夫环问题,一段代码求讲解
求推荐一款分享,回复的前端开源js
MySQL什么情况使用MyISAM,什么时候使用InnoDB?
phpstorm中使用搜狗输入中文出现乱行问题怎样解决?
Android中如何实现快播中娱乐风向标的效果
使用java做手机后台开发!
Chrome 29的alert对话框好漂亮,有木有啊有木有
Eclipse+ADT+Android环境配置问题
关于android holderview的疑惑
蛋疼 从一个公司到另外一个公司都是一个人开发 有木有
wsunit 官方访问不了
android求大神给我看看什么问题
关于Hibernate search 查询结果与数据库不相符的问题
求推荐Oracle好的书籍或PDF
关于"记事本"的 "自动换行" 的实现
swing在线html文本编辑器
android下网络阻塞问题
文件上线系统该如何做(代码上线)
ztree节点设置成check多选框的时候如何只获取叶节点,不要其他节点
怎么设置上传的图片不自动压缩
js 正则表达式问题
eclipse 经常loading descriptor for XXX ,然后卡死
关于android开发xml显示问题
RMI远程对象是共享的吧?
参与开源项目如何进行文档编写
php如何以文件图标的形式列出服务器上的所有文件?
php中一个简单的问题?请帮助解决一下,菜鸟
请教 solr query分词查询,结果为空的问题
这段代码有问题吗,怎么我运行报错?
jquery mobile 页面中切换闪屏问题
你帮我改好,我给你讲个笑话可好TUT
asp.net问题:Js如何获取cookie中的值?
android 电话拦截并处理
iis7 下 php 如何显示报错?
安装virtualbox的时候提示要安装通用串行总线控制器,这个要安装吗?
API获取新浪微博消息
工厂该不该有默认行为
如何处理开发过程中遗留无用的代码
ireport 设计时报表模板时,无法使用sybase驱动com.sybase.jdbc3.jdbc.SybDriver?
关于 使用druid后的一些问题.
Summary
From the example, it can be seen that it's really easy to get data from the web page. Only a few lines (about 20 lines) can collect the data we want. If we want to capture more data, we only need to refine the analysis layer by layer.