TRFL强化学习构建模块库-技术圈

TRFL（发音为“truffle”）建立在 TensorFlow 之上，它是一个强化学习构建模块库。

它是 DeepMind 内部大量用于诸如 DQN、DDPG 和 Importance Weighted Actor Learner Architecture 这些成功的代理如的关键算法组件的集合。

TRFL 库包括实现经典 RL 算法以及更尖端技术的功能，提供的损失函数和其它操作在纯 TensorFlow 中实现。它们不是完整的算法，而是实现了在构建全功能强化学习代理时需要的数学运算。

对于基于值的强化学习，TRFL 提供了 TensorFlow 操作用于在离散动作空间中学习，例如 TD-learning、Sarsa、Q-learning 及其变体，同时也提供了用于实现连续控制算法的操作，例如 DPG。此外 TRFL 还包括用于学习分配值功能的操作。

使用示例

import tensorflow as tf

import trfl



# Q-values for the previous and next timesteps, shape [batch_size, num_actions].

q_tm1 = tf.constant([[1, 1, 0], [1, 2, 0]], dtype=tf.float32)

q_t = tf.constant([[0, 1, 0], [1, 2, 0]], dtype=tf.float32)



# Action indices, pcontinue and rewards, shape [batch_size].

a_tm1 = tf.constant([0, 1], dtype=tf.int32)

pcont_t = tf.constant([0, 1], dtype=tf.float32)

r_t = tf.constant([1, 1], dtype=tf.float32)



loss, q_learning = trfl.qlearning(q_tm1, a_tm1, r_t, pcont_t, q_t)

大多数情况下，您可能只对损失感兴趣：

loss, _ = trfl.qlearning(q_tm1, a_tm1, r_t, pcont_t, q_t)



# You can also do this, which returns the identical `loss` tensor:

loss = trfl.qlearning(q_tm1, a_tm1, r_t, pcont_t, q_t).loss



reduced_loss = tf.reduce_mean(loss)



optimizer = tf.train.AdamOptimizer(learning_rate=0.1)

train_op = optimizer.minimize(reduced_loss)

该模块中的所有损失函数使用上述约定返回损失张量和额外信息。