数据竞赛必备的NLP库总结!
Datawhale
共 4653字,需浏览 10分钟
·
2020-10-11 19:05
NLP必备的库
上周在给大家介绍了OpenMMlab一系列的CV库后,有很多同学问有没有推荐的NLP库。因此本周我们给大家整理了机器学习和竞赛相关的NLP库,方便大家进行使用,又一篇收藏即学习系列。
import jieba
seg_list = jieba.cut("我来到北京清华大学", cut_all=True)
print("Full Mode: " + "/ ".join(seg_list)) # 全模式
# 【全模式】: 我/ 来到/ 北京/ 清华/ 清华大学/ 华大/ 大学
seg_list = jieba.cut("我来到北京清华大学", cut_all=False)
print("Default Mode: " + "/ ".join(seg_list)) # 精确模式
# 【精确模式】: 我/ 来到/ 北京/ 清华大学
seg_list = jieba.cut("他来到了网易杭研大厦") # 默认是精确模式
print(", ".join(seg_list))
# 【新词识别】:他, 来到, 了, 网易, 杭研, 大厦
spaCy是功能强化的NLP库,可与深度学习框架一起运行。spaCy提供了大多数NLP任务的标准功能(标记化,PoS标记,解析,命名实体识别)。spaCy与现有的深度学习框架接口可以一起使用,并预装了常见的语言模型。
import spacy
# Load English tokenizer, tagger, parser, NER and word vectors
nlp = spacy.load("en_core_web_sm")
# Process whole documents
text = ("When Sebastian Thrun started working on self-driving cars at "
"Google in 2007, few people outside of the company took him "
"seriously. “I can tell you very senior CEOs of major American "
"car companies would shake my hand and turn away because I wasn’t "
"worth talking to,” said Thrun, in an interview with Recode earlier "
"this week.")
doc = nlp(text)
# Analyze syntax
print("Noun phrases:", [chunk.text for chunk in doc.noun_chunks])
print("Verbs:", [token.lemma_ for token in doc if token.pos_ == "VERB"])
# Find named entities, phrases and concepts
for entity in doc.ents:
print(entity.text, entity.label_)
spaCy项目主页:https://spacy.io/
from gensim.test.utils import common_texts, get_tmpfile
from gensim.models import Word2Vec
path = get_tmpfile("word2vec.model")
model = Word2Vec(common_texts, size=100, window=5, min_count=1, workers=4)
model.save("word2vec.model")
import nltk
"""At eight o'clock on Thursday morning sentence =
Arthur didn't feel very good."""
tokens = nltk.word_tokenize(sentence)
tokens
['At', 'eight', "o'clock", 'on', 'Thursday', 'morning',
'Arthur', 'did', "n't", 'feel', 'very', 'good', '.']
tagged = nltk.pos_tag(tokens)
0:6] tagged[
[('At', 'IN'), ('eight', 'CD'), ("o'clock", 'JJ'), ('on', 'IN'),
('Thursday', 'NNP'), ('morning', 'NN')]
from textblob import TextBlob
text = '''
The titular threat of The Blob has always struck me as the ultimate movie
monster: an insatiably hungry, amoeba-like mass able to penetrate
virtually any safeguard, capable of--as a doomed doctor chillingly
describes it--"assimilating flesh on contact.
Snide comparisons to gelatin be damned, it's a concept with the most
devastating of potential consequences, not unlike the grey goo scenario
proposed by technological theorists fearful of
artificial intelligence run rampant.
'''
blob = TextBlob(text)
blob.tags # [('The', 'DT'), ('titular', 'JJ'),
# ('threat', 'NN'), ('of', 'IN'), ...]
blob.noun_phrases # WordList(['titular threat', 'blob',
# 'ultimate movie monster',
# 'amoeba-like mass', ...])
for sentence in blob.sentences:
print(sentence.sentiment.polarity)
# 0.060
# -0.341
TextBlob官网:https://textblob.readthedocs.io/en/dev/
Transformers是现如今最流行的库,它实现了从 BERT 和 GPT-2 到 BART 和 Reformer 的各种转换。huggingface 的代码可读性强和文档也是清晰易读。在官方github的存储库中,甚至通过不同的任务来组织 python 脚本,例如语言建模、文本生成、问题回答、多项选择等。
huggingface官网:https://huggingface.co/
OpenNMT 是用于机器翻译和序列学习任务的便捷而强大的工具。其包含的高度可配置的模型和培训过程,让它成为了一个非常简单的框架。因其开源且简单的特性,建议大家使用 OpenNMT 进行各种类型的序列学习任务。
OpenNMT官网:https://opennmt.net/
若进群失败,可在后台回复【竞赛群】
即可得到最新的二维码!
评论