（七）RASA NLU实体提取器-技术圈

作者简介

原文：https://zhuanlan.zhihu.com/p/333641672

转载者：杨夕

面筋地址：https://github.com/km1994/NLP-Interview-Notes

个人笔记：https://github.com/km1994/nlp_paper_study

一个对话机器人，除了理解用户的语义以外，还需要从用户获取必要的信息，用于信息检索的变量，我们简称为slot（槽），而填槽的内容大部分来自于用户对话中的命名实体，极个别也有用户的意图作为slot。举例来说，用户意图为订火车票，那机器人必须知道是从哪里出发目的地是哪里，这个信息就需要从用户对话中提取地名这个命名实体。RASA的实体提取器完成这一功能，目前RASA支持的实体提取器有：

MitieEntityExtractor

使用MitieNLP提取命名实体。需要引入MitieNLP语言模型，虽然在pipeline里面也需要配置MitieTokenizer，MitieFeaturizer，但实际上在MitieEntityExtractor执行的时候，它会自己重新生成Feature。

前面提到过，Mitie使用多分类线性SVM做的实体提取，输出的时候并没有提供置信度参数。

SpacyEntityExtractor

使用SpacyNLP提取命名实体。需要引入SpacyNLP语言模型， SpacyTokenizer，SpacyFeaturizer。
spaCy 使用统计BILOU转换模型。到目前为止，SpacyEntityExtractor只能使用内置的NER模型，不能重新训练新模型，而且模型输出也没有置信度分数。

SpacyEntityExtractor配置使用的时候，可以通过dimensions参数指定提取的实体包括的内容，一共有这么多种：

PERSON People, including fictional.NORP Nationalities or religious or political groups.FAC Buildings, airports, highways, bridges, etc.ORG Companies, agencies, institutions, etc.GPE Countries, cities, states.LOC Non-GPE locations, mountain ranges, bodies of water.PRODUCT Objects, vehicles, foods, etc. (Not services.)EVENT Named hurricanes, battles, wars, sports events, etc.WORK_OF_ART Titles of books, songs, etc.LAW Named documents made into laws.LANGUAGE Any named language.DATE Absolute or relative dates or periods.TIME Times smaller than a day.PERCENT Percentage, including ”%“.MONEY Monetary values, including unit.QUANTITY Measurements, as of weight or distance.ORDINAL “first”, “second”, etc.CARDINAL Numerals that do not fall under another type.

如果不指定，默认会返回所有。配置方式如下

pipeline:
- name: "SpacyEntityExtractor"
  # dimensions to extract
  dimensions: ["PERSON", "LOC", "ORG", "PRODUCT"]

CRFEntityExtractor

条件随机场实体提取器，目前最常用的NER工具，跟LSTM组合，或者和BERT组合能取到非常好的效果。

如果要将自定义特征（例如预训练的单词嵌入）传递给CRFEntityExtractor，则可以pipeline里面CRFEntityExtractor之前添加任何能输出稠密特征的Featurizer。CRFEntityExtractor能自动查找稠密特征向量，并检查稠密特征是否为len(tokens)的可迭代项，其中每个条目均为向量。如果检查失败，将显示警告。然后CRFEntityExtractor将继续训练，丢弃自定义特征向量。如果自定义特征满足要求，CRFEntityExtractor会将稠密特征向量传递给sklearn_crfsuite用于训练。

因为CRF需要判断出每个Token为NER的概率，因此一个句子的稠密特征向量应该是[Token的个数*每个Token的特征向量的维数]这样一个矩阵。

CRFEntityExtractor有一个默认特征列表。可以用以下选项替换默认配置项：

==============  ==========================================================================================
Feature Name    Description
==============  ==========================================================================================
low             Checks if the token is lower case.
upper           Checks if the token is upper case.
title           Checks if the token starts with an uppercase character and all remaining characters are
                lowercased.
digit           Checks if the token contains just digits.
prefix5         Take the first five characters of the token.
prefix2         Take the first two characters of the token.
suffix5         Take the last five characters of the token.
suffix3         Take the last three characters of the token.
suffix2         Take the last two characters of the token.
suffix1         Take the last character of the token.
pos             Take the Part-of-Speech tag of the token (``SpacyTokenizer`` required).
pos2            Take the first two characters of the Part-of-Speech tag of the token
                (``SpacyTokenizer`` required).
pattern         Take the patterns defined by ``RegexFeaturizer``.
bias            Add an additional "bias" feature to the list of features.
==============  ==========================================================================================

当featureizer的滑动窗口在用户消息中的Token上移动时，可以为滑动窗口中的前一个token、当前token，下一个token定义特征模板。特征模板定义方式为[before，token，after]数组格式。另外，可以设置一个BILOU_flag标志来决定是否使用BILOU标记模式（一种编码格式，指示实体的开始token，中间token，结束token）。

pipeline:
- name: "CRFEntityExtractor"
  # BILOU_flag determines whether to use BILOU tagging or not.
  "BILOU_flag": True
  # features to extract in the sliding window
  "features": [
    ["low", "title", "upper"],
    [
      "bias",
      "low",
      "prefix5",
      "prefix2",
      "suffix5",
      "suffix3",
      "suffix2",
      "upper",
      "title",
      "digit",
      "pattern",
    ],
    ["low", "title", "upper"],
  ]
  # The maximum number of iterations for optimization algorithms.
  "max_iterations": 50
  # weight of the L1 regularization
  "L1_c": 0.1
  # weight of the L2 regularization
  "L2_c": 0.1
  # Name of dense featurizers to use.
  # If list is empty all available dense features are used.
  "featurizers": []
  # Indicated whether a list of extracted entities should be split into individual entities for a given entity type
  "split_entities_by_comma":
      address: False
      email: True

如果使用POS特性（POS或pos2），则需要在管道中使用SpacyTokenizer

如果使用pattern 功能，则需要在管道中使用RegexFeatureizer。

DucklingHTTPExtractor

这个组件允许Rasa调用一个远程http服务来提前命名实体，成为Duckling服务器。

可以通过启动docker容器的方式启动duckling服务

docker run -p 8000:8000 rasa/duckling

或者，可以直接安装Duckling，然后启动服务器。

Duckling可以识别日期，数字，距离和其他结构化实体并将其标准化。Duckling会尝试在不提供排名的情况下提取尽可能多的实体类型。例如，I will be there in 10 minutes这句话，如果同时指定number和time作为Duckling的实体，Duckling将提取两个实体：10作为数字和 in 10 minutes作为时间，这种情况下，需要应用程序去判断哪种实体类型是正确的。Duckling是基于规则的系统，所以提取的实体终返回1.0作为置信度。

可以在Duckling GitHub存储库中找到受支持的语言列表。

配置方式：

pipeline:
- name: "DucklingHTTPExtractor"
  # url of the running duckling server
  url: "http://localhost:8000"
  # dimensions to extract
  dimensions: ["time", "number", "amount-of-money", "distance"]
  # allows you to configure the locale, by default the language is
  # used
  locale: "de_DE"
  # if not set the default timezone of Duckling is going to be used
  # needed to calculate dates from relative expressions like "tomorrow"
  timezone: "Europe/Berlin"
  # Timeout for receiving response from http url of the running duckling server
  # if not set the default timeout of duckling http url is set to 3 seconds.
  timeout : 3

DIETClassifier

前面介绍过，在意图分类的时候，DIET会同时将实体识别一并做了。

RegexEntityExtractor

该组件使用在训练数据中定义的查找表和正则表达式提取实体。该组件检查用户消息是否包含某个查找表的条目或与某个正则表达式匹配。如果找到匹配项，则将该值提取为实体。

此组件只使用那些名称等于训练数据中定义的实体之一的正则表达式的pattern，所以训练数据中，要确保每个实体至少注释一个示例。

case_sensitive：配置参数case_sensitive指定是否大小写敏感。

use_word_boundaries：在中文中没用，在whitespaceTokenizer中使用，

    pipeline:
    - name: RegexEntityExtractor
      # text will be processed with case insensitive as default
      case_sensitive: False
      # use lookup tables to extract entities
      use_lookup_tables: True
      # use regexes to extract entities
      use_regexes: True
      # use match word boundaries for lookup table
      "use_word_boundaries": True

EntitySynonymMapper

实体同义词映射，这个组件功能主要是将其他Extractor提取到的实体，使用同义词表，归一化到同一种说法，为后续处理提供方便。例如:

[
    {
      "text": "I moved to New York City",
      "intent": "inform_relocation",
      "entities": [{
        "value": "nyc",
        "start": 11,
        "end": 24,
        "entity": "city",
      }]
    },
    {
      "text": "I got a new flat in NYC.",
      "intent": "inform_relocation",
      "entities": [{
        "value": "nyc",
        "start": 20,
        "end": 23,
        "entity": "city",
      }]
    }
]

不管用户消息里面是New York City，还是NYC，都会被统一映射为nyc。但是EntitySynonymMapper并不提取实体，他只是将其他提取器提取的实体做映射。

如何不满足要求，RASA提供自定义组件扩展。这个具体专门一章讲。