如何利用深度学习写诗歌（使用Python进行文本生成）-技术圈

翻译：李雪冬

编辑：李雪冬

前言

从短篇小说到写5万字的小说，机器不断涌现出前所未有的词汇。在web上有大量的例子可供开发人员使用机器学习来编写文本，呈现的效果有荒谬的也有令人叹为观止的。
由于自然语言处理(NLP)领域的重大进步，机器能够自己理解上下文和编造故事。

文本生成的例子包括，机器编写了流行小说的整个章节，比如《权力的游戏》和《哈利波特》，取得了不同程度的成功。在本文中，我们将使用python和文本生成的概念来构建一个机器学习模型，可以用莎士比亚的风格来写十四行诗。让我们来看看它!

本文的主要内容

1.什么是文本生成?
2.文本生成的不同步骤。
3.入口的依赖
4.加载数据
5.创建字符/单词映射
6.数据预处理
7.模型创建
8.生成文本
9.尝试不同的模型
10.更多的训练模型
（1）一个更深层次的模型
（2）.一个更广泛的模型
（3）一个超大的模型

什么是文本生成

现在，有大量的数据可以按顺序分类。它们以音频、视频、文本、时间序列、传感器数据等形式存在。针对这样特殊类别的数据，如果两个事件都发生在特定的时间内,A先于B和B先于A是完全不同的两个场景。然而，在传统的机器学习问题中，一个特定的数据点是否被记录在另一个数据点之前是不重要的。这种考虑使我们的序列预测问题有了新的解决方法。

文本是由一个挨着一个的字符组成的，实际中是很难处理的。这是因为在处理文本时，可以训练一个模型来使用之前发生的序列来做出非常准确的预测，但是之前的一个错误的预测有可能使整个句子变得毫无意义。这就是让文本生成器变得棘手的原因!
为了更好地理解代码，请浏览这两篇文章。LSTM背后的理论（链接：https://www.analyticsvidhya.com/blog/2017/12/fundamentals-of-deep-learning-introduction-to-lstm/）

文本生成的步骤

文本生成通常包括以下步骤:

导入依赖
加载数据
创建映射
数据预处理
模型构建
生成文本

让我们详细地看一下每一个。

导入依赖

import numpy as np
import pandas as pd
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout
from keras.layers import LSTM
from keras.utils import np_utils

加载数据

text=(open("/Users/pranjal/Desktop/text_generator/sonnets.txt").read())
text=text.lower()

这里，我们正在加载所有莎士比亚十四行诗的集合，这些十四行诗可以从这里（链接：http://www.gutenberg.org/cache/epub/1041/pg1041.txt）下载。我清理了这个文件以删除开始和结束的学分，并且可以从我的git存储库下载。文本文件被打开并保存在text中。然后将该内容转换为小写，以减少可能单词的数量(稍后将对此进行详细介绍)。

创建映射

映射是在文本中为字符/单词分配任意数字的步骤。这样，所有的惟一字符/单词都映射到一个数字。这一点很重要，因为机器比文本更能理解数字，这使得训练过程更加容易。

characters = sorted(list(set(text))) n_to_char = {n:char for n, char in enumerate(characters)} char_to_n = {char:n for n, char in enumerate(characters)}

我已经创建了一个字典，其中给文本中每个独特的字符分配一个数字。所有独特的字符首先存储在字符中，然后被枚举。

这里还必须注意，我使用了字符级别的映射，而不是单词映射。然而，与基于字符的模型相比，基于单词的模型与其他模型相比具有更高的准确性。这是因为基于字符需要一个更大的网络来学习长期依赖关系，因为它不仅要记住单词的顺序，而且还要学会预测一个语法正确的单词。但是，在基于单词的模型中，后者已经被处理好了。

数据预处理

在构建LSTM模型时，这是最棘手的部分。将手头的数据转换成可供模型训练的格式是一项困难的任务。我会把这个过程分解成小的部分，让你更容易理解。

X = []
Y = []
length = len(text)
seq_length = 100
for i in range(0, length-seq_length, 1):
sequence = text[i:i + seq_length]
label =text[i + seq_length]
X.append([char_to_n[char] for char in sequence])
Y.append(char_to_n[label])

这里，X是我们的训练序列，Y是我们的目标数组。seq_length是我们在预测某个特定字符之前要考虑的字符序列的长度。for循环用于遍历文本的整个长度，并创建这样的序列(存储在X中)和它们的真实值(存储在Y中)，为了更好地弄清楚“真实值”的概念。让我们以一个例子来理解这一点:

对于4的序列长度和文本“hello india”，我们将有X和Y表示如下:

现在，LSTMs接受输入的形式是(number_of_sequence, length_of_sequence, number_of_features)，这不是数组的当前格式。另外，我们需要将数组Y转换成一个one-hot编码格式。

X_modified = np.reshape(X, (len(X), seq_length, 1))
X_modified = X_modified / float(len(characters))
Y_modified = np_utils.to_categorical(Y)

我们首先将数组X重构为所需的维度。然后，我们将X_modified的值进行缩放，这样我们的神经网络就可以更快地训练，并且更少的机会被困在局部最小值中。此外，我们的Y_modified是一个热编码，以删除在映射字符过程中可能引入的任何顺序关系。也就是说，与“z”相比，“a”可能会被分配一个较低的数字，但这并不表示两者之间有任何关系。
最后的数组将是:
点击查看大图

建立模型

model = Sequential()
model.add(LSTM(400, input_shape=(X_modified.shape[1],            X_modified.shape[2]), return_sequences=True))
model.add(Dropout(0.2))
model.add(LSTM(400))
model.add(Dropout(0.2))
model.add(Dense(Y_modified.shape[1], activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam')

我们正在构建一个具有两个LSTM层的序列模型，每个层有400个单元。第一层需要用输入形状输入。为了使下一个LSTM层能够处理相同的序列，我们输入return_sequence参数为真。此外，设置参数为0.2的dropout层，以检查是否过拟合。最后一层输出one-hot编码的矢量，它提供字符输出。

文本生成

string_mapped = X[99]
for i in range(seq_length):
x = np.reshape(string_mapped,(1,len(string_mapped), 1))
x = x / float(len(characters))
pred_index = np.argmax(model.predict(x, verbose=0))
seq = [n_to_char[value] for value in string_mapped]
string_mapped.append(pred_index)
string_mapped = string_mapped[1:len(string_mapped)]

我们从X数组中的随机一行开始，这是一个100个字符的数组。在此之后，我们的目标是预测x后面的另外100个字符，输入将如同前面一样被重新塑造和缩放，预测下一个具有最大概率的字符。seq用于存储到现在为止预测行为用到的字符串的解码格式。接下来，新字符串被更新，这样第一个字符被删除，新的预测字符被包含进来。您可以在这里找到整个代码。这里提供了训练文件，注释和训练的模型权重供您参考。

尝试不同模型

基线模型：

当训练为1个周期，批大小为100时，给出如下输出:

's the riper should by time decease,
his tender heir might bear his memory:
but thou, contracted toet she the the the the the the the the
thi the the the the the the the the the the the the the the the the the
thi the the the the the the the the the the the the the the the the the
thi the the the the the the the the the the the the the the the the the
thi the the the the the the the the the the the the the the the the the
thi the the the the the the the the th'

这个输出没有多大意义。它只是重复同样的预测，就好像它被困在一个循环中一样。这是因为与我们训练的微型模型相比，语言预测模型太复杂了。
让我们试着训练同样的模型，但是将时间周期变长。

加强训练时间的模型：

这次我们训练我们的模型为100个周期，批大小设置为50。我们至少得到了一个非重复的字符序列，其中包含了相当数量的合理单词。此外，该模型还学会了生成一个类似14行诗的结构。

'The riper should by time decease,
his tender heir might bear his memory:
but thou, contracted to thine own besire,
that in the breath ther doomownd wron to ray,
dorh part nit backn oy steresc douh dxcel;
for that i have beauty lekeng norirness,
for all the foowing of a former sight,
which in the remame douh a foure to his,
that in the very bumees of toue mart detenese;
how ap i am nnw love, he past doth fiamee.
to diserace but in the orsths of are orider,
waie agliemt would have me '

但是，这个模型还不够好，不能产生高质量的内容。所以现在我们要做的是当一个深度学习模型没有产生好的结果时，每个人都会做的事情。建立一个更深层次的架构!

一个更深的模型：

一位机器学习的大牛曾经说过:如果模型做得不好，那就增加层数!我将用我的模型做同样的事情。让我们再添加一个LSTM层，里面有400个单元，然后是一个参数为0.2的dropout层，看看我们得到了什么。

"The riper should by time decease,
his tender heir might bear his memory:
but thou, contracted to the world's false sporoe,
with eyes so dond touls be thy domfornds,
which for memorion of the seasons new;
mike own self-love to shou art constant
how can i then be oy love doth give,
the rose looks fair, but fairer bomments age.
now with the seas that i have seen dychl
that mot to teed of you, if h ho bontent,
and all the ress of the heartai thee;
the world will was iis bound and farty "

结果开始变得有趣。语法实现了自我增强，保持了十四行诗的结构和标点的完整。然而，这仍然需要很大的改进。让我们尝试探索一个更广泛的网络，一个拥有更多单元的网络。

一个更广的模型：

在两个LSTM层中，我将单元的数量增加到700个。这个调整产生了以下的诗歌:

"The riper should by time decease,
his tender heir might bear his memory:
but thou, contracted to thine own bright eyes,
feed'st thy sigh aare so beuider poise,
oor maty dis surre that gairs mot me love thee;
when i braye the would and lays in the eesire.
than thmu disgrmed stand with my bootr still nsentente;
tell ia to thou art thou wilt woon'thy sook,
and touain then tor, give thy soue semping.
whose whod the better of your befaiss nu sante,
though i, suoll oas, and i lose bouh twa"

这个做法并没有令我满意，因为这些词已经失去了意义。但是，有趣的是，这个模型开始尝试建立一些韵脚。模型毕竟是在开始尝试更深层次理解诗歌!但是，对于训练出既有意义又优美的诗歌的目标，我们不能妥协，对吗?
让我们把它放在一个超大的模型中。

一个超大的模型：

我将层数增加到3层，每层有700个单元，并训练了100个周期。产生的结果是一篇宏伟的诗篇。请看:

"The riper should by time decease,
his tender heir might bear his memory:
but thou, contracted to thine own bright eyes,
feed'st thy light's flame with self-substantial fuel,
my beept is she breat oe bath dasehr ill:
tirse do i pine and turfeit day by day,
or gluttoning on all, or all away.
Lxxvi
why is my verse so barren of new pride,
so far from variation or quick change?
why with the time do i not glance aside
to new-found methods, and to compounds strange?
why write i stil"

这篇诗歌不仅用词准确，而且也学会了押韵。我们本可以有一件更宏伟的艺术品，但是作为入门LSTM的一个开始，这就够了。这已经是比大多数人更好地表达了!

参考资料

欢迎交流与转载，公众号：机器学习算法全栈工程师

本文翻译自pranjal seivastava的一篇文章

（链接：https://www.analyticsvidhya.com/blog/2018/03/text-generation-using-python-nlp/）原作者保留版权

end

机器学习算法全栈工程师

一个用心的公众号

长按，识别，加关注

进群，学习，得帮助

你的关注，我们的热度，

我们一定给你学习最大的帮助