基于Python的语料库数据处理（六）-技术圈

△ 是新朋友吗？记得先点数据科学与人工智能关注我哦～

《Python玩转语料库数据》专栏·第6篇

文 | 段洵

3756字 | 10 分钟阅读

【数据科学与人工智能】已开通Python语言社群，学用Python，玩弄数据，求解问题，以创价值。喜乐入群者，请加微信号shushengya360，或扫描文末二维码，添加为好友，同时附上Python-入群。有朋自远方来，不亦乐乎，并诚邀入群，以达相互学习和进步之美好心愿。

一起来学习用Python进行语料库数据处理吧！

今天我们学习的内容是匹配零个或多个字符！

一、常用的表示数量的符号

我们常常需要一次匹配零个、一个或多个字符，因此需要使用一些表示数量的符号，下表列出了常用的表示数量的符号。

符号	注释
*	匹配零个或多个字符
+	匹配一个或多个字符
?	匹配零个或一个字符

这些表示数量的符号不能单独使用,必须与其他普通字符或元字符配合使用。例如：b+可以匹配一个或者连续多个b字母；\w+可以匹配一个或多个字母或数字或下划线；\d*可以匹配零个或多个数字;\s?可以匹配零个或一个空白。

我们来看一个例子。例子文本节选自 FROWN语料库。请完成下列检索匹配任务:①如何检索文本中所有以-ing结尾的单词?②如何检索文本中所有以th-开头的单词?③如何检索文本中所有数字或者含有数字的字符串?④如何检索诸如 co-author这样含连字符的单词?⑤如何检索所有含两个字符的字符串?⑥文本中每行开头都含有诸如“A0 117”的字符串。如何搜索出文本中所有类似的字符串?

关于问题①,使用\w*ing\b或者\w+ing\b。\w*ing或者\w+ing在上述文本中可以搜索所有以ing结尾的单词。但是,也可以匹配诸如Washington、 Salinger或 hearings等单词。\w*ing和\w+ing的不同在于,\w+ing只能匹配“一个或多个字符+ing”,而\w*ing可以匹配“ing”或者“一个或多个字符+ing”。关于问题②,使用\bth\w+。关于问题③,使用\d+可以搜索出所有数字;\w*\d+\w*可以搜索出所有数字或者同时含字母和数字的字符串,如A01、17、308、114等。需要注意的是\w*\d+\w*不能搜索出“308-14”。如果需要搜索如“308-114”或“2-kilo”等同时含字母、数字和连字符“-”的字符串,则需使用表达式\w+-\w+。关于问题④,使用\w+-\w+。关于问题⑤,使用\b\w\w\b。关于问题⑥,使用A\d+\s+\d+\s。

请看如下代码。

import re string = ''' A01 17 The bill was immediately sent to the House, which voted 308-114 A01 18 for the override, 26 more than needed. A cheer went up as the House A01 19 vote was tallied, ending Bush's string of successful vetoes at A01 20 35. A01 21 Among those voting to override in the Senate was Democratic A01 22 vice presidential nominee Al Gore, a co-author of the bill. He then A01 23 left the chamber to join Democratic presidential nominee Bill A01 24 Clinton on 'Larry King Live' on CNN.

''' print(re.findall(r'\w*ing\b', string)) # ['ending', 'string', 'voting', 'King'] print(re.findall(r'\bth\w+', string)) # ['the', 'the', 'than', 'the', 'those', 'the', 'the', 'then', 'the'] print(re.findall(r'\w*\d+\w*', string)) # ['A01', '17', '308', '114', 'A01', '18', '26', 'A01', '19', 'A01', '20', '35', 'A01', '21', 'A01', '22', 'A01', '23', 'A01', '24'] print(re.findall(r'\w+-\w+', string)) # ['308-114', 'co-author'] print(re.findall(r'\b\w\w\b', string)) # ['17', 'p_', 'to', '18', '26', 'up', 'as', '19', 'of', 'at', '20', '35', '21', 'p_', 'to', 'in', '22', 'Al', 'co', 'of', 'He', '23', 'to', '24', 'on', 'on'] print(re.findall(r'A\d+\s+\d+\s', string)) # ['A01 17 ', 'A01 18 ', 'A01 19 ', 'A01 20 ', 'A01 21 ', 'A01 22 ', 'A01 23 ', 'A01 24 ']

二、{}、[]和（）的用法

所有的字母、数字、没有特殊意义的符号(如下划线等)都是普通字符

1.{}的用法

{}中添加数字,跟在普通字符或者元字符后面,也可以表示数量。比如,r{2}可以匹配“rr”;r{2,}可以匹配连续2次或更多次出现的r字母,如“rr”或者“rrrr”等;r{0,3}可以匹配出现0次或者1次或连续出现2次或3次的r字母。因此,我们前面所述的\d*等同于\d{0,}；\d+等同于\d(1,}；\d?等同于\d{0,1}。

2.[]的用法

[]中加入普通字符表示可以匹配其中任意字符。比如,[abcd]可以匹配a或b或c或d。而[abd]+则可匹配由abcd四个字母任意组合的字符串,如“adc“add”“abdc”“ bcdaadbc”等。[abcd]等同于[abcd]{1},而[abcd]+等同于[abcd]{1,}。另外,[a-z]表示从a到z所有字母中的任意一个,[0-9]表示所有数字中的任意一个。

3.()的用法

如果需要重复多次某个表达式,可以用()将表达式括起来,然后再在后面加表示数量的表达式。如果要匹配诸如“abc98cdef54r45gsdh56539”这样重复多次的“字母+数字”组合的字符串,我们可以用([a-z]+[0-9]+)+来匹配,括弧后面的“+”表示重复([a-z]+[0-9]+)组合一次或者多次(当然,可以简单地用\w+来匹配)。假设我们只希望匹配重复2次或3次的“字母+数字”组合,则需要用([a-z]+[0-9]+){2,3}来匹配。

我们来看一个例子。假设有如下字符串,完成下列检索任务:①字符串的人名中,哪些由3个或4个字母组成?②字符串的人名中,哪些由6个或以上字母组成?③字符串的人名中,哪些由以J字母开头且以a字母结尾?④字符串的人名中,哪些由以J字母开头、以a字母结尾且字母数大于5?⑤字符串的人名中,哪些由以J、K、L、M字母开头且字母数大于或等于5?

import re

string = '''
Mary  Michael  Susan  Larry  Christina
Elizabeth   Juliana   Julia   Leo  Jane
Jason  Johansson  John   Bill  Katherine
'''

print(re.findall(r'\b\w{3,4}\b', string))         # ['Mary', 'Leo', 'Jane', 'John', 'Bill']

print(re.findall(r'\b\w{6,}\b', string))          # ['Michael', 'Christina', 'Elizabeth', 'Juliana', 'Johansson', 'Katherine']

print(re.findall(r'\bJ\w*a\b', string))           # ['Juliana', 'Julia']

print(re.findall(r'\bJ\w{5,}a\b', string))        # ['Juliana']

print(re.findall(r'\b[JKLM]\w{4,}\b', string))    # ['Michael', 'Larry', 'Juliana', 'Julia', 'Jason', 'Johansson', 'Katherine']

三、贪婪（greedy）还是懒惰（lazy）

前面我们讲到“*”表示零个或多个,“+”表示一个或多个。由于“*”和“+”可以匹配多个字符,它们会尽可能多地匹配字符,所以它们被称作“贪婪数量符( greedy quantifiers)”。

请看下面的范例。我们对字符串进行两次搜索。第一次匹配, re.findall(r'.+',string)将返回由一个元素(即整个字符串)构成的列表。第二次匹配re.findall(r'.*', string)将返回：["The bill was immediately sent to the House,which voted 308-114 for the override, 26 more than needed. A cheer went up as the House vote was tallied, ending Bush's string of successful vetoes at 35.

","]。返回结果是由两个元素构成的列表,第一个元素是整个字符串,第二个元素由一个零字符构成。

两次搜索结果不同的原因在于,“+”表示一个或多个,在第一次匹配到字符串的最后一个字符“>”后,搜索过程即完成;而“*”表示零个或多个,在第一次匹配到字符串的最后一个字符“>”后,再进行第二次检索,检索结果为零个字符,也匹配成功,所以第二次检索多了一个零字符。

两次检索的结果都说明,无论是“+”还是“*”,都是“贪婪的”,它们都尽可能多地匹配字符。

import re

string = "The bill was immediately sent to the House, which voted 308-114 for the override, 26 more than needed. A cheer went up as the House vote was tallied, ending Bush's string of successful vetoes at 35."

print(re.findall(r'.+', string))
print(re.findall(r'.*', string))

又如,\d+将匹配文本中的308、114、26、35等数字,其原因在于“+”是贪婪( greedy)的,所以d+会匹配所有连续数字。那么,如果我们匹配所有数值,但需要每次只匹配一个数字字符,就需要使用'?'。

与'*'和'+'相反,"是“懒惰数量符(lazy quantifier)”,它匹配尽可能少的相应字符。所以\d+?将匹配文本中的所有数值,但每次只匹配一个由连续数值字符组成的数值。

我们来看下面的例子。读者可以比较使用'<.*>'和'<.*?>'两个表达式搜索下面文本的异同。

import re

string = '''The bill was immediately sent to the House, which voted 308-114 for the override, 26 more than needed. A cheer went up as the House vote was tallied, ending Bush's string of successful vetoes at 35.'''

print(re.findall(r'<.*>', string))
# ["The bill was immediately sent to the House, which voted 308-114 for the override, 26 more than needed. A cheer went up as the House vote was tallied, ending Bush's string of successful vetoes at 35.
"]

print(re.findall(r'<.*?>', string))
# ['', '']

<.*>将匹配所有文本内容。由于“.*”是“贪婪的”,所以<.*>的搜索方式是,先搜索文本中的第一个“<”,然后搜索文本最后一个“>”,最后匹配文本第一个“<”与文本最后一个“>”之间的所有内容。

<.*?>将匹配和

。由于“.*?”是“懒惰的”,所以<.*?>的搜索方式是,先搜索文本中的第一个“<”,然后搜索文本中下一个出现的“>”,最后匹配文本第一个“<”与下一个“>”之间的所有内容。