pyrailgun网页抓取工具
这是一个非常简单易用的抓取工具
怎么使用? 首先你需要创建一个对应站点的规则文件 比如test.json
{
"name": "bing searcher",
"action": "main",
"subaction": [
{
"action": "fetcher",
"url": "http://www.bing.com/search?q=${@q}",
"timeout": 1,
"subaction": [
{
"action": "parser",
"subaction": [
{
"action": "shell",
"subaction": [
{
"action": "parser",
"setField": "title",
"pos": 0,
"rule": "a",
"strip": "true"
},
{
"action": "parser",
"setField": "description",
"pos": 0,
"rule": "p"
}
],
"group": "default"
}
],
"rule": "#results .sa_wr"
}
]
}
]
}
然后在代码里面把它作为一个任务加入到railgun
from railgun import RailGun
railgun = RailGun()
railgun.setTask(file("testsite.yaml"));
railgun.fire();
nodes = railgun.getShells('default')
print nodes
然后你就可以得到一个包含了所有解析后数据的节点列表 [{img:xxx,src:xxx,score:xxx,dest:xxx,description:xxx},{img:xxx,src:xxx,score:xxx,dest:xxx,description:xxx}]
同时支持用webkit内核运行javascript抓取网页,css方式的dom选择方式
跨平台 支持windows
评论
