pyrailgun网页抓取工具
这是一个非常简单易用的抓取工具
怎么使用? 首先你需要创建一个对应站点的规则文件 比如test.json
{ "name": "bing searcher", "action": "main", "subaction": [ { "action": "fetcher", "url": "http://www.bing.com/search?q=${@q}", "timeout": 1, "subaction": [ { "action": "parser", "subaction": [ { "action": "shell", "subaction": [ { "action": "parser", "setField": "title", "pos": 0, "rule": "a", "strip": "true" }, { "action": "parser", "setField": "description", "pos": 0, "rule": "p" } ], "group": "default" } ], "rule": "#results .sa_wr" } ] } ] }
然后在代码里面把它作为一个任务加入到railgun
from railgun import RailGun railgun = RailGun() railgun.setTask(file("testsite.yaml")); railgun.fire(); nodes = railgun.getShells('default') print nodes
然后你就可以得到一个包含了所有解析后数据的节点列表 [{img:xxx,src:xxx,score:xxx,dest:xxx,description:xxx},{img:xxx,src:xxx,score:xxx,dest:xxx,description:xxx}]
同时支持用webkit内核运行javascript抓取网页,css方式的dom选择方式
跨平台 支持windows
评论