【NLP】竞赛必备的NLP库
NLP必备的库
本周我们给大家整理了机器学习和竞赛相关的NLP库,方便大家进行使用,建议收藏本文。
import jiebaseg_list = jieba.cut("我来到北京清华大学", cut_all=True)print("Full Mode: " + "/ ".join(seg_list)) # 全模式# 【全模式】: 我/ 来到/ 北京/ 清华/ 清华大学/ 华大/ 大学seg_list = jieba.cut("我来到北京清华大学", cut_all=False)print("Default Mode: " + "/ ".join(seg_list)) # 精确模式# 【精确模式】: 我/ 来到/ 北京/ 清华大学seg_list = jieba.cut("他来到了网易杭研大厦") # 默认是精确模式print(", ".join(seg_list))# 【新词识别】:他, 来到, 了, 网易, 杭研, 大厦
spaCy是功能强化的NLP库,可与深度学习框架一起运行。spaCy提供了大多数NLP任务的标准功能(标记化,PoS标记,解析,命名实体识别)。spaCy与现有的深度学习框架接口可以一起使用,并预装了常见的语言模型。
import spacy# Load English tokenizer, tagger, parser, NER and word vectorsnlp = spacy.load("en_core_web_sm")# Process whole documentstext = ("When Sebastian Thrun started working on self-driving cars at ""Google in 2007, few people outside of the company took him ""seriously. “I can tell you very senior CEOs of major American ""car companies would shake my hand and turn away because I wasn’t ""worth talking to,” said Thrun, in an interview with Recode earlier ""this week.")doc = nlp(text)# Analyze syntaxprint("Noun phrases:", [chunk.text for chunk in doc.noun_chunks])print("Verbs:", [token.lemma_ for token in doc if token.pos_ == "VERB"])# Find named entities, phrases and conceptsfor entity in doc.ents:print(entity.text, entity.label_)
spaCy项目主页:https://spacy.io/
from gensim.test.utils import common_texts, get_tmpfilefrom gensim.models import Word2Vecpath = get_tmpfile("word2vec.model")model = Word2Vec(common_texts, size=100, window=5, min_count=1, workers=4)model.save("word2vec.model")
import nltksentence = """At eight o'clock on Thursday morningArthur didn't feel very good."""tokens = nltk.word_tokenize(sentence)tokens['At', 'eight', "o'clock", 'on', 'Thursday', 'morning','Arthur', 'did', "n't", 'feel', 'very', 'good', '.']tagged = nltk.pos_tag(tokens)tagged[0:6][('At', 'IN'), ('eight', 'CD'), ("o'clock", 'JJ'), ('on', 'IN'),('Thursday', 'NNP'), ('morning', 'NN')]
from textblob import TextBlobtext = '''The titular threat of The Blob has always struck me as the ultimate moviemonster: an insatiably hungry, amoeba-like mass able to penetratevirtually any safeguard, capable of--as a doomed doctor chillinglydescribes it--"assimilating flesh on contact.Snide comparisons to gelatin be damned, it's a concept with the mostdevastating of potential consequences, not unlike the grey goo scenarioproposed by technological theorists fearful ofartificial intelligence run rampant.'''blob = TextBlob(text)blob.tags # [('The', 'DT'), ('titular', 'JJ'),# ('threat', 'NN'), ('of', 'IN'), ...]blob.noun_phrases # WordList(['titular threat', 'blob',# 'ultimate movie monster',# 'amoeba-like mass', ...])for sentence in blob.sentences:print(sentence.sentiment.polarity)# 0.060# -0.341
TextBlob官网:https://textblob.readthedocs.io/en/dev/



Transformers是现如今最流行的库,它实现了从 BERT 和 GPT-2 到 BART 和 Reformer 的各种转换。huggingface 的代码可读性强和文档也是清晰易读。在官方github的存储库中,甚至通过不同的任务来组织 python 脚本,例如语言建模、文本生成、问题回答、多项选择等。

huggingface官网:https://huggingface.co/
OpenNMT 是用于机器翻译和序列学习任务的便捷而强大的工具。其包含的高度可配置的模型和培训过程,让它成为了一个非常简单的框架。因其开源且简单的特性,建议大家使用 OpenNMT 进行各种类型的序列学习任务。

OpenNMT官网:https://opennmt.net/
往期精彩回顾
获取一折本站知识星球优惠券,复制链接直接打开:
https://t.zsxq.com/662nyZF
本站qq群704220115。
加入微信群请扫码进群(如果是博士或者准备读博士请说明):
评论
