数据竞赛必备的NLP库总结!
NLP必备的库
上周在给大家介绍了OpenMMlab一系列的CV库后,有很多同学问有没有推荐的NLP库。因此本周我们给大家整理了机器学习和竞赛相关的NLP库,方便大家进行使用,又一篇收藏即学习系列。
import jiebaseg_list = jieba.cut("我来到北京清华大学", cut_all=True)print("Full Mode: " + "/ ".join(seg_list)) # 全模式# 【全模式】: 我/ 来到/ 北京/ 清华/ 清华大学/ 华大/ 大学seg_list = jieba.cut("我来到北京清华大学", cut_all=False)print("Default Mode: " + "/ ".join(seg_list)) # 精确模式# 【精确模式】: 我/ 来到/ 北京/ 清华大学seg_list = jieba.cut("他来到了网易杭研大厦") # 默认是精确模式print(", ".join(seg_list))# 【新词识别】:他, 来到, 了, 网易, 杭研, 大厦
spaCy是功能强化的NLP库,可与深度学习框架一起运行。spaCy提供了大多数NLP任务的标准功能(标记化,PoS标记,解析,命名实体识别)。spaCy与现有的深度学习框架接口可以一起使用,并预装了常见的语言模型。
import spacy# Load English tokenizer, tagger, parser, NER and word vectorsnlp = spacy.load("en_core_web_sm")# Process whole documentstext = ("When Sebastian Thrun started working on self-driving cars at ""Google in 2007, few people outside of the company took him ""seriously. “I can tell you very senior CEOs of major American ""car companies would shake my hand and turn away because I wasn’t ""worth talking to,” said Thrun, in an interview with Recode earlier ""this week.")doc = nlp(text)# Analyze syntaxprint("Noun phrases:", [chunk.text for chunk in doc.noun_chunks])print("Verbs:", [token.lemma_ for token in doc if token.pos_ == "VERB"])# Find named entities, phrases and conceptsfor entity in doc.ents:print(entity.text, entity.label_)
spaCy项目主页:https://spacy.io/
from gensim.test.utils import common_texts, get_tmpfilefrom gensim.models import Word2Vecpath = get_tmpfile("word2vec.model")model = Word2Vec(common_texts, size=100, window=5, min_count=1, workers=4)model.save("word2vec.model")
import nltksentence = """At eight o'clock on Thursday morningArthur didn't feel very good."""tokens = nltk.word_tokenize(sentence)tokens['At', 'eight', "o'clock", 'on', 'Thursday', 'morning','Arthur', 'did', "n't", 'feel', 'very', 'good', '.']tagged = nltk.pos_tag(tokens)tagged[0:6][('At', 'IN'), ('eight', 'CD'), ("o'clock", 'JJ'), ('on', 'IN'),('Thursday', 'NNP'), ('morning', 'NN')]
from textblob import TextBlobtext = '''The titular threat of The Blob has always struck me as the ultimate moviemonster: an insatiably hungry, amoeba-like mass able to penetratevirtually any safeguard, capable of--as a doomed doctor chillinglydescribes it--"assimilating flesh on contact.Snide comparisons to gelatin be damned, it's a concept with the mostdevastating of potential consequences, not unlike the grey goo scenarioproposed by technological theorists fearful ofartificial intelligence run rampant.'''blob = TextBlob(text)blob.tags # [('The', 'DT'), ('titular', 'JJ'),# ('threat', 'NN'), ('of', 'IN'), ...]blob.noun_phrases # WordList(['titular threat', 'blob',# 'ultimate movie monster',# 'amoeba-like mass', ...])for sentence in blob.sentences:print(sentence.sentiment.polarity)# 0.060# -0.341
TextBlob官网:https://textblob.readthedocs.io/en/dev/



Transformers是现如今最流行的库,它实现了从 BERT 和 GPT-2 到 BART 和 Reformer 的各种转换。huggingface 的代码可读性强和文档也是清晰易读。在官方github的存储库中,甚至通过不同的任务来组织 python 脚本,例如语言建模、文本生成、问题回答、多项选择等。

huggingface官网:https://huggingface.co/
OpenNMT 是用于机器翻译和序列学习任务的便捷而强大的工具。其包含的高度可配置的模型和培训过程,让它成为了一个非常简单的框架。因其开源且简单的特性,建议大家使用 OpenNMT 进行各种类型的序列学习任务。

OpenNMT官网:https://opennmt.net/
若进群失败,可在后台回复【竞赛群】
即可得到最新的二维码!
评论
