jparser网页转码 python 库-技术圈

jparser是一个python库，用于网页转码，也就是从html源码中抽取正文的结构化数据：文本段落和图片。目前主要针对新闻资讯类页面进行了优化。

用法：

import urllib2

from jparser import PageModel

html = urllib2.urlopen("http://news.sohu.com/20170512/n492734045.shtml").read().decode('gb18030')

pm = PageModel(html)

result = pm.extract()



print "==title=="

print result['title']

print "==content=="

for x in result['content']:

    if x['type'] == 'text':

        print x['data']

    if x['type'] == 'image':

        print "[IMAGE]", x['data']['src']

示例：

http://jparser.duapp.com/

依赖：lxml