IMCAFS

Home

a leisurely personal page

Posted by punzalan at 2020-03-10
all

Tinyspider is a network data capture framework based on tiny Htmlparser.

Maven reference coordinates:

<dependency> <groupId>org.tinygroup</groupId> <artifactId>tinyspider</artifactId> <version>0.0.12</version> </dependency>

Web crawler, generally used in full-text retrieval or content acquisition.

The tiny framework also has limited support for this. Although it has few functions, it is very convenient to do full-text search or get data from the web page.

Frame characteristics

framework design

Internet worm

public interface Spinder { /** * 添加站点访问器 * * @param siteVisitor */ void addSiteVisitor(SiteVisitor siteVisitor); /** * 添加监视器 * * @param watcher */ void addWatcher(Watcher watcher); /** * 处理url * * @param url */ void processUrl(String url); /** * 处理url * @param url * @param parameter */ void processUrl(String url, Map<String, Object> parameter); /** * 设置URL仓库 * * @param urlRepository */ void setUrlRepository(UrlRepository urlRepository); }

A crawler must contain at least one site visitor, which is used to access the URL. If there is no matching site visitor, the URL will be ignored and will not be processed.

A crawler needs to contain at least one monitor, which is used to filter the content in the URL and process the hit nodes. Without a monitor, the content crawled back by a crawler is of no value.

A crawler needs at least one URL warehouse, which is used to judge whether or not it has been grabbed and processed. If there is no URL warehouse, it will not be able to determine whether the URL has been processed. In many cases, it will cause a dead cycle and cannot exit.

Of course, a crawler must also be able to handle URLs.

Website visitors

Since a crawler can have multiple site accessors, there needs to be an ismatch method to tell the crawler whether it should be handled by itself. Access method. You can set whether to get data through get or post.

URL warehouse

public interface UrlRepository { /** * 返回url是否已经在仓库中存在 * * @param url * @return */ boolean isExist(String url); /** * 返回url是否已经在仓库中存在,带有参数 * * @param url * @param parameter * @return */ boolean isExist(String url, Map<String, Object> parameter); /** * 如果不存在,则放放,如果已经存在,则替换 * * @param url * @param content */ void putUrlWithContent(String url, String content); /** * 如果不存在,则放放,如果已经存在,则替换 * * @param url * @param parameter * @param content */ void putUrlWithContent(String url, Map<String, Object> parameter, String content); /** * 如果存在,则返回内容;如果不存在,则抛出运行时异常 * * @param url * @return */ String getContent(String url); /** * 如果存在,则返回内容;如果不存在,则抛出运行时异常 * * @param url * @param parameter * @return */ String getContent(String url, Map<String, Object> parameter); }

URL warehouse is used to manage URL and its content. As the methods are simple and clear, no more introduction will be made.

monitor

public interface Watcher { /** * 设置节点过滤器 * * @param filter */ void setNodeFilter(NodeFilter<HtmlNode> filter); /** * 获取节点过滤器 * * @return */ NodeFilter<HtmlNode> getNodeFilter(); /** * 添加处理器 * * @param processor */ void addProcessor(Processor processor); /** * 获取处理器列表 * * @return */ List<Processor> getProcessorList(); }

A monitor must have a node filter, but it can have multiple processors.

processor

public interface Processor { /** * 处理节点 * * @param node */ void process(HtmlNode node); }

The processor is very simple, that is, processing the hit nodes.

Example

By visiting [http://www.oschina.net/question? Catalog = 1], we can see that there are many technical questions and answers. Let's write a program to print out these Titles:

Writing crawler

public static void main(String[] args) { Spinder spinder = new SpinderImpl(); Watcher watcher = new WatcherImpl(); watcher.addProcessor(new PrintOsChinaProcessor()); QuickNameFilter<HtmlNode> nodeFilter = new QuickNameFilter<HtmlNode>(); nodeFilter.setNodeName("div"); nodeFilter.setIncludeAttribute("class", "qbody"); watcher.setNodeFilter(nodeFilter); spinder.addWatcher(watcher); spinder.processUrl("http://www.oschina.net/question?catalog=1"); }

Write processor

public class PrintOsChinaProcessor implements Processor { public void process(HtmlNode node) { FastNameFilter<HtmlNode> filter = new FastNameFilter<HtmlNode>(node); filter.setNodeName("h2"); filter.setIncludeNode("a"); HtmlNode h3 = filter.findNode(); if (h3 != null) { System.out.println(h3.getSubNode("a").getContent()); } } }

Operation result

The output may not be the same as the result because the data is always changing.

约瑟夫环问题,一段代码求讲解 求推荐一款分享,回复的前端开源js MySQL什么情况使用MyISAM,什么时候使用InnoDB? phpstorm中使用搜狗输入中文出现乱行问题怎样解决? Android中如何实现快播中娱乐风向标的效果 使用java做手机后台开发! Chrome 29的alert对话框好漂亮,有木有啊有木有 Eclipse+ADT+Android环境配置问题 关于android holderview的疑惑 蛋疼 从一个公司到另外一个公司都是一个人开发 有木有 wsunit 官方访问不了 android求大神给我看看什么问题 关于Hibernate search 查询结果与数据库不相符的问题 求推荐Oracle好的书籍或PDF 关于"记事本"的 "自动换行" 的实现 swing在线html文本编辑器 android下网络阻塞问题 文件上线系统该如何做(代码上线) ztree节点设置成check多选框的时候如何只获取叶节点,不要其他节点 怎么设置上传的图片不自动压缩 js 正则表达式问题 eclipse 经常loading descriptor for XXX ,然后卡死 关于android开发xml显示问题 RMI远程对象是共享的吧? 参与开源项目如何进行文档编写 php如何以文件图标的形式列出服务器上的所有文件? php中一个简单的问题?请帮助解决一下,菜鸟 请教 solr query分词查询,结果为空的问题 这段代码有问题吗,怎么我运行报错? jquery mobile 页面中切换闪屏问题 你帮我改好,我给你讲个笑话可好TUT asp.net问题:Js如何获取cookie中的值? android 电话拦截并处理 iis7 下 php 如何显示报错? 安装virtualbox的时候提示要安装通用串行总线控制器,这个要安装吗? API获取新浪微博消息 工厂该不该有默认行为 如何处理开发过程中遗留无用的代码 ireport 设计时报表模板时,无法使用sybase驱动com.sybase.jdbc3.jdbc.SybDriver? 关于 使用druid后的一些问题.

Summary

From the example, it can be seen that it's really easy to get data from the web page. Only a few lines (about 20 lines) can collect the data we want. If we want to capture more data, we only need to refine the analysis layer by layer.