从零开始使用 Nadam 进行梯度下降优化
共 25637字,需浏览 52分钟
·
2021-04-24 21:17
梯度下降是一种优化算法,它使用目标函数的梯度来导航搜索空间。 纳丹(Nadam)是亚当(Adam)版本的梯度下降的扩展,其中包括了内斯特罗夫的动量。 如何从头开始实现Nadam优化算法并将其应用于目标函数并评估结果。
梯度下降 Nadam优化算法 娜达姆(Nadam)的梯度下降 二维测试问题 Nadam的梯度下降优化 可视化的Nadam优化
f()
返回给定输入集合的分数,导数函数f'()
给出给定输入集合的目标函数的导数。梯度下降算法需要问题中的起点(x)
,例如输入空间中的随机选择点。x(t)= x(t-1)–step* f'(x(t))
在给定点的目标函数越陡峭,梯度的大小越大,反过来,在搜索空间中采取的步伐也越大。使用步长超参数来缩放步长的大小。
步长:超参数,用于控制算法每次迭代相对于梯度在搜索空间中移动多远。
如果步长太小,则搜索空间中的移动将很小,并且搜索将花费很长时间。如果步长太大,则搜索可能会在搜索空间附近反弹并跳过最优值。
现在我们已经熟悉了梯度下降优化算法,接下来让我们看一下Nadam算法。
Nadam优化算法
Nesterov加速的自适应动量估计或Nadam算法是对自适应运动估计(Adam)优化算法的扩展,添加了Nesterov的加速梯度(NAG)或Nesterov动量,这是一种改进的动量。更广泛地讲,Nadam算法是对梯度下降优化算法的扩展。Timothy Dozat在2016年的论文“将Nesterov动量整合到Adam中”中描述了该算法。尽管论文的一个版本是在2015年以同名斯坦福项目报告的形式编写的。动量将梯度的指数衰减移动平均值(第一矩)添加到梯度下降算法中。这具有消除嘈杂的目标函数和提高收敛性的影响。Adam是梯度下降的扩展,它增加了梯度的第一和第二矩,并针对正在优化的每个参数自动调整学习率。NAG是动量的扩展,其中动量的更新是使用对参数的预计更新量而不是实际当前变量值的梯度来执行的。在某些情况下,这样做的效果是在找到最佳位置时减慢了搜索速度,而不是过冲。
纳丹(Nadam)是对亚当(Adam)的扩展,它使用NAG动量代替经典动量。让我们逐步介绍该算法的每个元素。Nadam使用衰减步长(alpha)和一阶矩(mu)超参数来改善性能。为了简单起见,我们暂时将忽略此方面,并采用恒定值。首先,对于搜索中要优化的每个参数,我们必须保持梯度的第一矩和第二矩,分别称为m和n。在搜索开始时将它们初始化为0.0。
m = 0
n = 0
该算法在从t = 1
开始的时间t内迭代执行,并且每次迭代都涉及计算一组新的参数值x,例如。从x(t-1)
到x(t)
。如果我们专注于更新一个参数,这可能很容易理解该算法,该算法概括为通过矢量运算来更新所有参数。首先,计算当前时间步长的梯度(偏导数)。
g(t)= f'(x(t-1))
接下来,使用梯度和超参数“ mu”
更新第一时刻。
m(t)=mu* m(t-1)+(1 –mu)* g(t)
然后使用“ nu”
超参数更新第二时刻。
n(t)= nu * n(t-1)+(1 – nu)* g(t)^ 2
接下来,使用Nesterov动量对第一时刻进行偏差校正。
mhat =(mu * m(t)/(1 – mu))+((1 – mu)* g(t)/(1 – mu))
nhat = nu * n(t)/(1 – nu)
x(t)= x(t-1)– alpha /(sqrt(nhat)+ eps)* mhat
sqrt()
是平方根函数,eps(epsilon)
是一个较小的值,如1e-8
,以避免除以零误差。alpha:初始步长(学习率),典型值为0.002。
mu:第一时刻的衰减因子(Adam中的beta1),典型值为0.975。
nu:第二时刻的衰减因子(Adam中的beta2),典型值为0.999。
Objective()
函数实现了此功能# objective function
def objective(x, y):
return x**2.0 + y**2.0
# 3d plot of the test function
from numpy import arange
from numpy import meshgrid
from matplotlib import pyplot
# objective function
def objective(x, y):
return x**2.0 + y**2.0
# define range for input
r_min, r_max = -1.0, 1.0
# sample input range uniformly at 0.1 increments
xaxis = arange(r_min, r_max, 0.1)
yaxis = arange(r_min, r_max, 0.1)
# create a mesh from the axis
x, y = meshgrid(xaxis, yaxis)
# compute targets
results = objective(x, y)
# create a surface plot with the jet color scheme
figure = pyplot.figure()
axis = figure.gca(projection='3d')
axis.plot_surface(x, y, results, cmap='jet')
# show the plot
pyplot.show()
f(0,0)= 0
的熟悉的碗形状。# contour plot of the test function
from numpy import asarray
from numpy import arange
from numpy import meshgrid
from matplotlib import pyplot
# objective function
def objective(x, y):
return x**2.0 + y**2.0
# define range for input
bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]])
# sample input range uniformly at 0.1 increments
xaxis = arange(bounds[0,0], bounds[0,1], 0.1)
yaxis = arange(bounds[1,0], bounds[1,1], 0.1)
# create a mesh from the axis
x, y = meshgrid(xaxis, yaxis)
# compute targets
results = objective(x, y)
# create a filled contour plot with 50 levels and jet color scheme
pyplot.contourf(x, y, results, levels=50, cmap='jet')
# show the plot
pyplot.show()
x ^ 2
的导数在每个维度上均为x * 2
。f(x)= x ^ 2
f'(x)= x * 2
derived()
函数在下面实现了这一点。# derivative of objective function
def derivative(x, y):
return asarray([x * 2.0, y * 2.0])
接下来,我们可以使用Nadam实现梯度下降优化。首先,我们可以选择问题范围内的随机点作为搜索的起点。假定我们有一个数组,该数组定义搜索范围,每个维度一行,并且第一列定义最小值,第二列定义维度的最大值。
# generate an initial point
x = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0])
score = objective(x[0], x[1])
# initialize decaying moving averages
m = [0.0 for _ in range(bounds.shape[0])]
n = [0.0 for _ in range(bounds.shape[0])]
然后,我们运行由“ n_iter”
超参数定义的算法的固定迭代次数。
...
# run iterations of gradient descent
for t in range(n_iter):
...
...
# calculate gradient g(t)
g = derivative(x[0], x[1])
...
# build a solution one variable at a time
for i in range(x.shape[0]):
...
# m(t) = mu * m(t-1) + (1 - mu) * g(t)
m[i] = mu * m[i] + (1.0 - mu) * g[i]
# nhat = nu * n(t) / (1 - nu)
nhat = nu * n[i] / (1.0 - nu)
# n(t) = nu * n(t-1) + (1 - nu) * g(t)^2
n[i] = nu * n[i] + (1.0 - nu) * g[i]**2
# mhat = (mu * m(t) / (1 - mu)) + ((1 - mu) * g(t) / (1 - mu))
mhat = (mu * m[i] / (1.0 - mu)) + ((1 - mu) * g[i] / (1.0 - mu))
# nhat = nu * n(t) / (1 - nu)
nhat = nu * n[i] / (1.0 - nu)
# x(t) = x(t-1) - alpha / (sqrt(nhat) + eps) * mhat
x[i] = x[i] - alpha / (sqrt(nhat) + eps) * mhat
# evaluate candidate point
score = objective(x[0], x[1])
# report progress
print('>%d f(%s) = %.5f' % (t, x, score))
nadam()
的函数中,该函数采用目标函数和派生函数的名称以及算法超参数,并返回在搜索及其评估结束时找到的最佳解决方案。# gradient descent algorithm with nadam
def nadam(objective, derivative, bounds, n_iter, alpha, mu, nu, eps=1e-8):
# generate an initial point
x = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0])
score = objective(x[0], x[1])
# initialize decaying moving averages
m = [0.0 for _ in range(bounds.shape[0])]
n = [0.0 for _ in range(bounds.shape[0])]
# run the gradient descent
for t in range(n_iter):
# calculate gradient g(t)
g = derivative(x[0], x[1])
# build a solution one variable at a time
for i in range(bounds.shape[0]):
# m(t) = mu * m(t-1) + (1 - mu) * g(t)
m[i] = mu * m[i] + (1.0 - mu) * g[i]
# n(t) = nu * n(t-1) + (1 - nu) * g(t)^2
n[i] = nu * n[i] + (1.0 - nu) * g[i]**2
# mhat = (mu * m(t) / (1 - mu)) + ((1 - mu) * g(t) / (1 - mu))
mhat = (mu * m[i] / (1.0 - mu)) + ((1 - mu) * g[i] / (1.0 - mu))
# nhat = nu * n(t) / (1 - nu)
nhat = nu * n[i] / (1.0 - nu)
# x(t) = x(t-1) - alpha / (sqrt(nhat) + eps) * mhat
x[i] = x[i] - alpha / (sqrt(nhat) + eps) * mhat
# evaluate candidate point
score = objective(x[0], x[1])
# report progress
print('>%d f(%s) = %.5f' % (t, x, score))
return [x, score]
alpha
为0.02,μ
为0.8,nu
为0.999,这是经过一点点反复试验后发现的。# seed the pseudo random number generator
seed(1)
# define range for input
bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]])
# define the total iterations
n_iter = 50
# steps size
alpha = 0.02
# factor for average gradient
mu = 0.8
# factor for average squared gradient
nu = 0.999
# perform the gradient descent search with nadam
best, score = nadam(objective, derivative, bounds, n_iter, alpha, mu, nu)
# summarize the result
print('Done!')
print('f(%s) = %f' % (best, score))
# gradient descent optimization with nadam for a two-dimensional test function
from math import sqrt
from numpy import asarray
from numpy.random import rand
from numpy.random import seed
# objective function
def objective(x, y):
return x**2.0 + y**2.0
# derivative of objective function
def derivative(x, y):
return asarray([x * 2.0, y * 2.0])
# gradient descent algorithm with nadam
def nadam(objective, derivative, bounds, n_iter, alpha, mu, nu, eps=1e-8):
# generate an initial point
x = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0])
score = objective(x[0], x[1])
# initialize decaying moving averages
m = [0.0 for _ in range(bounds.shape[0])]
n = [0.0 for _ in range(bounds.shape[0])]
# run the gradient descent
for t in range(n_iter):
# calculate gradient g(t)
g = derivative(x[0], x[1])
# build a solution one variable at a time
for i in range(bounds.shape[0]):
# m(t) = mu * m(t-1) + (1 - mu) * g(t)
m[i] = mu * m[i] + (1.0 - mu) * g[i]
# n(t) = nu * n(t-1) + (1 - nu) * g(t)^2
n[i] = nu * n[i] + (1.0 - nu) * g[i]**2
# mhat = (mu * m(t) / (1 - mu)) + ((1 - mu) * g(t) / (1 - mu))
mhat = (mu * m[i] / (1.0 - mu)) + ((1 - mu) * g[i] / (1.0 - mu))
# nhat = nu * n(t) / (1 - nu)
nhat = nu * n[i] / (1.0 - nu)
# x(t) = x(t-1) - alpha / (sqrt(nhat) + eps) * mhat
x[i] = x[i] - alpha / (sqrt(nhat) + eps) * mhat
# evaluate candidate point
score = objective(x[0], x[1])
# report progress
print('>%d f(%s) = %.5f' % (t, x, score))
return [x, score]
# seed the pseudo random number generator
seed(1)
# define range for input
bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]])
# define the total iterations
n_iter = 50
# steps size
alpha = 0.02
# factor for average gradient
mu = 0.8
# factor for average squared gradient
nu = 0.999
# perform the gradient descent search with nadam
best, score = nadam(objective, derivative, bounds, n_iter, alpha, mu, nu)
print('Done!')
print('f(%s) = %f' % (best, score))
>40 f([ 5.07445337e-05 -3.32910019e-03]) = 0.00001
>41 f([-1.84325171e-05 -3.00939427e-03]) = 0.00001
>42 f([-6.78814472e-05 -2.69839367e-03]) = 0.00001
>43 f([-9.88339249e-05 -2.40042096e-03]) = 0.00001
>44 f([-0.00011368 -0.00211861]) = 0.00000
>45 f([-0.00011547 -0.00185511]) = 0.00000
>46 f([-0.0001075 -0.00161122]) = 0.00000
>47 f([-9.29922627e-05 -1.38760991e-03]) = 0.00000
>48 f([-7.48258406e-05 -1.18436586e-03]) = 0.00000
>49 f([-5.54299505e-05 -1.00116899e-03]) = 0.00000
Done!
f([-5.54299505e-05 -1.00116899e-03]) = 0.000001
nadam()
函数以维护在搜索过程中找到的所有解决方案的列表,然后在搜索结束时返回此列表。下面列出了具有这些更改的功能的更新版本。# gradient descent algorithm with nadam
def nadam(objective, derivative, bounds, n_iter, alpha, mu, nu, eps=1e-8):
solutions = list()
# generate an initial point
x = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0])
score = objective(x[0], x[1])
# initialize decaying moving averages
m = [0.0 for _ in range(bounds.shape[0])]
n = [0.0 for _ in range(bounds.shape[0])]
# run the gradient descent
for t in range(n_iter):
# calculate gradient g(t)
g = derivative(x[0], x[1])
# build a solution one variable at a time
for i in range(bounds.shape[0]):
# m(t) = mu * m(t-1) + (1 - mu) * g(t)
m[i] = mu * m[i] + (1.0 - mu) * g[i]
# n(t) = nu * n(t-1) + (1 - nu) * g(t)^2
n[i] = nu * n[i] + (1.0 - nu) * g[i]**2
# mhat = (mu * m(t) / (1 - mu)) + ((1 - mu) * g(t) / (1 - mu))
mhat = (mu * m[i] / (1.0 - mu)) + ((1 - mu) * g[i] / (1.0 - mu))
# nhat = nu * n(t) / (1 - nu)
nhat = nu * n[i] / (1.0 - nu)
# x(t) = x(t-1) - alpha / (sqrt(nhat) + eps) * mhat
x[i] = x[i] - alpha / (sqrt(nhat) + eps) * mhat
# evaluate candidate point
score = objective(x[0], x[1])
# store solution
solutions.append(x.copy())
# report progress
print('>%d f(%s) = %.5f' % (t, x, score))
return solutions
# seed the pseudo random number generator
seed(1)
# define range for input
bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]])
# define the total iterations
n_iter = 50
# steps size
alpha = 0.02
# factor for average gradient
mu = 0.8
# factor for average squared gradient
nu = 0.999
# perform the gradient descent search with nadam
solutions = nadam(objective, derivative, bounds, n_iter, alpha, mu, nu)
# sample input range uniformly at 0.1 increments
xaxis = arange(bounds[0,0], bounds[0,1], 0.1)
yaxis = arange(bounds[1,0], bounds[1,1], 0.1)
# create a mesh from the axis
x, y = meshgrid(xaxis, yaxis)
# compute targets
results = objective(x, y)
# create a filled contour plot with 50 levels and jet color scheme
pyplot.contourf(x, y, results, levels=50, cmap='jet')
# plot the sample as black circles
solutions = asarray(solutions)
pyplot.plot(solutions[:, 0], solutions[:, 1], '.-', color='w')
# example of plotting the nadam search on a contour plot of the test function
from math import sqrt
from numpy import asarray
from numpy import arange
from numpy import product
from numpy.random import rand
from numpy.random import seed
from numpy import meshgrid
from matplotlib import pyplot
from mpl_toolkits.mplot3d import Axes3D
# objective function
def objective(x, y):
return x**2.0 + y**2.0
# derivative of objective function
def derivative(x, y):
return asarray([x * 2.0, y * 2.0])
# gradient descent algorithm with nadam
def nadam(objective, derivative, bounds, n_iter, alpha, mu, nu, eps=1e-8):
solutions = list()
# generate an initial point
x = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0])
score = objective(x[0], x[1])
# initialize decaying moving averages
m = [0.0 for _ in range(bounds.shape[0])]
n = [0.0 for _ in range(bounds.shape[0])]
# run the gradient descent
for t in range(n_iter):
# calculate gradient g(t)
g = derivative(x[0], x[1])
# build a solution one variable at a time
for i in range(bounds.shape[0]):
# m(t) = mu * m(t-1) + (1 - mu) * g(t)
m[i] = mu * m[i] + (1.0 - mu) * g[i]
# n(t) = nu * n(t-1) + (1 - nu) * g(t)^2
n[i] = nu * n[i] + (1.0 - nu) * g[i]**2
# mhat = (mu * m(t) / (1 - mu)) + ((1 - mu) * g(t) / (1 - mu))
mhat = (mu * m[i] / (1.0 - mu)) + ((1 - mu) * g[i] / (1.0 - mu))
# nhat = nu * n(t) / (1 - nu)
nhat = nu * n[i] / (1.0 - nu)
# x(t) = x(t-1) - alpha / (sqrt(nhat) + eps) * mhat
x[i] = x[i] - alpha / (sqrt(nhat) + eps) * mhat
# evaluate candidate point
score = objective(x[0], x[1])
# store solution
solutions.append(x.copy())
# report progress
print('>%d f(%s) = %.5f' % (t, x, score))
return solutions
# seed the pseudo random number generator
seed(1)
# define range for input
bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]])
# define the total iterations
n_iter = 50
# steps size
alpha = 0.02
# factor for average gradient
mu = 0.8
# factor for average squared gradient
nu = 0.999
# perform the gradient descent search with nadam
solutions = nadam(objective, derivative, bounds, n_iter, alpha, mu, nu)
# sample input range uniformly at 0.1 increments
xaxis = arange(bounds[0,0], bounds[0,1], 0.1)
yaxis = arange(bounds[1,0], bounds[1,1], 0.1)
# create a mesh from the axis
x, y = meshgrid(xaxis, yaxis)
# compute targets
results = objective(x, y)
# create a filled contour plot with 50 levels and jet color scheme
pyplot.contourf(x, y, results, levels=50, cmap='jet')
# plot the sample as black circles
solutions = asarray(solutions)
pyplot.plot(solutions[:, 0], solutions[:, 1], '.-', color='w')
# show the plot
pyplot.show()
作者:沂水寒城,CSDN博客专家,个人研究方向:机器学习、深度学习、NLP、CV
Blog: http://yishuihancheng.blog.csdn.net
赞 赏 作 者
更多阅读
特别推荐
点击下方阅读原文加入社区会员