基于Python的语料库数据处理（四）-技术圈

△ 是新朋友吗？记得先点数据科学与人工智能关注我哦～

《Python玩转语料库数据》专栏·第4篇

文 | 段洵

2759字 | 10 分钟阅读

【数据科学与人工智能】已开通Python语言社群，学用Python，玩弄数据，求解问题，以创价值。喜乐入群者，请加微信号shushengya360，或扫描文末二维码，添加为好友，同时附上Python-入群。有朋自远方来，不亦乐乎，并诚邀入群，以达相互学习和进步之美好心愿。

一起来学习用Python进行语料库数据处理吧！

一、列表

（一）列表的概念

列表List是一个序列对象,是一个或多个数据的集合。比如,一个列表可以包含一个或多个字符串或数值元素;一个列表也可以包含一个或多个列表或元

组等元素。列表的数据是可变的 mutable),也就是说,列表的元素可以增加、修改、删除等。

我们通常将列表的元素置于方括号中,比如列表['We','use','Python']由三个

字符串元素组成,而列表[1,2,3,4,5]由五个整数数字元素组成。

range(x,y)函数生成从x到y-1构成的整数列表。比如 range(1,6)生成列表[1,

2,3,4,5]。请看下面的代码示例：

list1 = range(1, 6)

for i in list1:
    print(i, '*', i, '=', i * i)

当用逗号连接print()函数的参数时，其打印结果自动在参数间添加空格:

（二）列表下标

与字符串下标类似,我们可以在列表变量后面加[x:y]，x，y为整数,以访问列表元素。列表下标从0开始,如1ist[0]返回列表it的第一个元素。

list[0:x]返回列表list的第一个至第x-1个元素；

list[x:y]返回列表list的第x个至第y-1个元素；

list[x:]返回列表lst的第x个至最后一个元素；

list[-1]返回列表list的最后一个元素。

我们来看下面的范例：

list1 = range(1, 6)

print(list1[0])       # 1
print(list1[-1])      # 5

for i in list1[0:2]:
    print(i)          # print 1 and 2

二、列表与字符串的相互转换

在进行数据处理时,我们经常需要对列表数据和字符串数据进行相互转换。

本小节我们讨论列表和字符串数据相互转换的常用函数。

若要将字符串转换成列表，可以使用split()函数和list()函数，其基本句法分别为：

string.split()

list(string)

示例：

str1 = '  Life is'
print(str1.split())       # ['Life', 'is']

str2 = '2013-10-06'
print(str2.split('-'))    # ['2013', '10', '06']

string = "Python"
print(list(string))

若要将列表转换成字符串，可以使用join()函数，其基本句法为：

'x'.join(list)

示例：

list1 = ['Life', 'is', 'short']

print(''.join(list1))           # Lifeisshort
print(' '.join(list1))          # Life is short
print('--'.join(list1))         # Life--is--short

三、常用列表函数

（一）len()

len()函数是计算字符串长度的，即计算一个字符串中包含的字符数目。

示例：

str1 = '''My father's family name being Pirrip, and my Christian name Philip, my infant tongue could make of both names nothing longer or more explicit than Pip. So, I called myself Pip, and came to be called Pip.'''

list1 = str1.split()
print(len(list1))       # 37

（二）append()

append()函数可以对某个列表增加新的元素。新增加的元素置于列表末尾。

示例1：假设我们现在需要将一个文本(如一首诗)的每一行前面加上一个流水序号。解决此问题的一个可能算法是,将诗文本读入一个列表中，该列表的第一个元素是诗的第一行,其下标为0;列表的第二个元素是诗的第二行,其下标为1;余类推。因此,每一行前面所加的序号实际上是该列表元素下标数值+1,最后一行的序号是列表长度数值。请看下面的代码。

# add_line_number.py
# this is to add a line number to each line of a text

file_in = open("../texts/poem.txt", "r")
file_out = open("../poem2.txt", "a")

list0 = []

for line in file_in.readlines():
    list0.append(line)

list0_max = len(list0)

i = 0

for line in list0:
    if i < list0_max:
        line_out = str(i + 1) + '\t' + line
        file_out.write(line_out)
        i = i + 1

file_in.close()
file_out.close()

示例2：对一段话的文本单词进行判断，挑选长度大于或等于6的单词。请看下面代码。

str1 = '''My father's family name being Pirrip, and my Christian name Philip, my infant tongue could make of both names nothing longer or more explicit than Pip. So, I called myself Pip, and came to be called Pip.'''

list1 = str1.split()
list2 = []

for word in list1:
    if len(word) >= 6:
        list2.append(word)

print(list2)

（三）set()

set()函数是用来删除重复元素的，但会将列表转换为集合。

示例：

list3 = ['a', 'c', 'b', 'b', 'a']
print(set(list3))
print(list(set(list3)))    #将集合转换成列表

（四）pop()

pop()函数表示删除列表中的最后一个元素。

示例：

list3 = ['a', 'c', 'b', 'b', 'a']

list3.pop()
print(list3)                       # ['a', 'c', 'b', 'b']

list3.pop()
print(list3)                       # ['a', 'c', 'b']

（五）sorted()函数

sorted()函数可以对列表元素进行排序。

示例：

list3 = [12, 1, 8, 5]

print(sorted(list3))       # [1, 5, 8, 12]

list4 = ['a', 'BB', 'Aa', 'ba', 'c', 'A', 'Cb', 'b', 'CC']

print(sorted(list4))    # ['A', 'Aa', 'BB', 'CC', 'Cb', 'a', 'b', 'ba', 'c']

(六)count()

count()函数对列表中某个元素出现的频次进行计数。

示例：

list3 = ['a', 'c', 'b', 'b', 'a']
print(list3.count('a'))

四、列表相关文本处理实例

(一)制作词表

写代码制作一个基于ge.txt文本的按字母顺序排序的单词表。要完成此任务,可进行如下操作:①逐行读取文本,将每行字符串全部转换成小写,并按空格对字符串进行切分,将之转换成一个单词列表(lit1);②将列表(list)元素写入一个空列表(ist0);③重复上述第一和第二步,直至将文本的所有单词都写入列表list0中;④删除list0列表中的重复项,并存为一个新列表(list2);⑤对list列表中的元素按照字母顺序排序,并存为一个新列表(list3);⑥将list3列表中的元素全部写出到 ge_wordlist.txt中。

示例：

# wordlist1.py

file_in = open("../texts/ge.txt", "r")
file_out = open("../ge_wordlist.txt", "a")

list0 = []                             # an empty list to store all words 

for line in file_in.readlines():       # read in all lines of the text
    line_new = line.lower()            # change line into lower case
    list1 = line_new.split()           # split the line into words by space
    
    for word in list1:
        list0.append(word)             # append the words into list0

list2 = list(set(list0))               # delete repetitions of list0

list3 = sorted(list2, key = str.lower) # alphabeticall sort list2

for word in list3:
    file_out.write(word + '\n')        # write out the words

file_in.close()
file_out.close()

(二)颠倒单词字母顺序（回文词）

示例：

w = "word"
w_length = len(w)               # length of the word
index_end = w_length - 1        # length minus 1, i.e. index of the word’ last letter

w_new = []

i = index_end
while i >= 0:
    w_new.append(w[i])         # write the last letter into the w_new list
    i = i - 1                  # index of the word’s last letter but 1

print(''.join(w_new))

（三）删除文本中的空段落

示例：

file_in = open("../texts/ge.txt", "r")
file_out = open("../ge_compact.txt", "a")

for line in file_in.readlines():
    if not line.isspace():        #isspace()函数可判断一个字符串是否仅由换行符、空白或制表符等字符组成。
        file_out.write(line)

file_in.close()
file_out.close()