使用像素级 Pong 游戏进行深度强化学习#

注意

由于底层 gym 和 atari-py 依赖项的许可/安装问题，本文目前未进行测试。请通过开发一个减少依赖项占用空间的示例来帮助改进本文！

本教程演示如何使用 NumPy 从头开始实现一个深度强化学习（RL）智能体，该智能体通过策略梯度方法学习玩 Pong 视频游戏，并以屏幕像素作为输入。您的 Pong 智能体将使用人工神经网络作为其策略，即时获取经验。

Pong 是 1972 年的一款 2D 游戏，两名玩家使用“球拍”进行一种形式的乒乓球比赛。每位玩家上下移动球拍，并通过触碰球来将其击向对手方向。目标是击球使其越过对手的球拍（他们错过了击球）。根据规则，如果一名玩家达到 21 分，则获胜。在 Pong 游戏中，学习与对手比赛的 RL 智能体显示在右侧。

Diagram showing operations detailed in this tutorial

本示例基于 Andrej Karpathy 于 2017 年在加州大学伯克利分校深度 RL 新兵训练营中开发的代码。他 2016 年的博客文章也提供了更多关于 Pong RL 中使用的机制和理论的背景信息。

先决条件#

OpenAI Gym：为了帮助构建游戏环境，您将使用 Gym — 一个由 OpenAI 开发的开源 Python 接口，它支持许多模拟环境，有助于执行 RL 任务。
Python 和 NumPy：读者应具备一定的 Python、NumPy 数组操作和线性代数知识。
深度学习和深度 RL：您应该熟悉深度学习的主要概念，这些概念在 Yann LeCun、Yoshua Bengio 和 Geoffrey Hinton（被认为是该领域的先驱）于 2015 年发表的《深度学习》论文中有所解释。本教程将尝试引导您了解深度 RL 的主要概念，并且为了您的方便，您会找到各种文献并附有原始来源链接。
Jupyter Notebook 环境：由于 RL 实验可能需要较高的计算能力，您可以免费使用 Binder 或 Google Colaboratory（提供免费有限的 GPU 和 TPU 加速）在云端运行本教程。
Matplotlib：用于绘制图像。请查看安装指南以在您的环境中进行设置。

本教程也可以在 Virtualenv 和 conda 等独立环境中本地运行。

目录#

关于 RL 和深度 RL 的说明
深度 RL 词汇表

设置 Pong
预处理帧（观测）
创建策略（神经网络）和前向传播
设置更新步骤（反向传播）
定义折扣奖励（预期回报）函数
训练智能体进行 3 回合
下一步
附录
- 关于 RL 和深度 RL 的说明
- 如何在 Jupyter Notebook 中设置视频播放

关于 RL 和深度 RL 的说明#

在强化学习（RL）中，您的智能体通过使用所谓的策略与环境交互来试错学习，以获取经验。在执行一个动作后，智能体接收到关于其奖励（可能获得也可能不获得）和环境的下一个观测的信息。然后它可以继续执行另一个动作。这会持续若干回合和/或直到任务被认为完成。

智能体的策略通过将智能体的观测“映射”到其动作来工作——即，将智能体观测到的呈现与所需的动作关联起来。总体目标通常是优化智能体的策略，使其最大化每次观测的预期奖励。

有关 RL 的详细信息，有一本 Richard Sutton 和 Andrew Barton 的入门书籍。

更多信息请查阅本教程末尾的附录。

深度 RL 词汇表#

以下是深度 RL 术语的简洁词汇表，您可能会觉得对本教程的其余部分有所帮助

在一个有限时间范围的世界中，例如 Pong 游戏，学习智能体可以在一个回合中探索（并利用）环境。智能体通常需要很多回合才能学习。
智能体使用动作与环境交互。
采取动作后，智能体根据所采取的动作和所处的状态，通过奖励（如果有的话）接收一些反馈。状态包含有关环境的信息。
智能体的观测是对状态的部分观测——这是本教程首选的术语（而不是状态）。
智能体可以根据累积的奖励（也称为价值函数）和策略选择动作。累积奖励函数使用其策略评估智能体访问的观测的质量。
策略（由神经网络定义）输出动作选择（以（对数）概率形式），这些选择应最大化智能体所处状态的累积奖励。
从观测中获得的预期回报，以动作为条件，称为动作-价值函数。为了赋予短期奖励比长期奖励更大的权重，您通常使用折扣因子（通常是 0.9 到 0.99 之间的浮点数）。
智能体在每次策略“运行”期间的动作和状态（观测）序列有时被称为轨迹——这样的序列会产生奖励。

您将通过使用策略梯度的“on-policy”方法训练您的 Pong 智能体——它是一种属于基于策略方法的算法族。策略梯度方法通常使用在机器学习中广泛使用的梯度下降来更新策略的参数，以最大化长期累积奖励。而且，由于目标是最大化函数（奖励），而不是最小化它，所以这个过程也称为梯度上升。换句话说，您使用策略让智能体采取行动，目标是最大化奖励，您通过计算梯度并使用它们更新策略（神经网络）中的参数来做到这一点。

设置 Pong#

1. 首先，您应该安装 OpenAI Gym（使用 pip install gym[atari]——此包目前在 conda 上不可用），并导入 NumPy、Gym 和必要的模块。

import numpy as np
import gym

Gym 可以使用 Monitor 包装器监控并保存输出。

from gym import wrappers
from gym.wrappers import Monitor

2. 实例化一个用于 Pong 游戏的 Gym 环境。

env = gym.make("Pong-v0")

3. 让我们回顾一下 Pong-v0 环境中有哪些可用动作。

print(env.action_space)

print(env.get_action_meanings())

有 6 种动作。然而，LEFTFIRE 实际上是 LEFT，RIGHTFIRE 是 RIGHT，而 NOOP 是 FIRE。

为简单起见，您的策略网络将只有一个输出——“向上移动”（索引为 2 或 RIGHT）的（对数）概率。另一个可用动作的索引将为 3（“向下移动”或 LEFT）。

4. Gym 可以以 MP4 格式保存智能体学习过程的视频——通过运行以下代码将 Monitor() 包装在环境周围。

env = Monitor(env, "./video", force=True)

虽然您可以在 Jupyter Notebook 中执行各种 RL 实验，但渲染 Gym 环境的图像或视频以可视化您的智能体在训练后如何玩 Pong 游戏可能相当具有挑战性。如果您想在 Notebook 中设置视频播放，可以在本教程末尾的附录中找到详细信息。

预处理帧（观测）#

在本节中，您将设置一个函数来预处理输入数据（游戏观测），使其可供神经网络消化，因为神经网络只能处理张量（多维数组）形式的浮点类型输入。

您的智能体将使用 Pong 游戏的帧——来自屏幕帧的像素——作为策略网络的输入观测。游戏观测告诉智能体球在被输入到神经网络（策略）之前（通过前向传播）的位置。这与 DeepMind 的 DQN 方法相似（附录中将进一步讨论）。

Pong 屏幕帧为 210x160 像素，具有 3 种颜色维度（红、绿、蓝）。数组用 uint8（或 8 位整数）编码，这些观测存储在 Gym Box 实例中。

1. 检查 Pong 的观测。

print(env.observation_space)

在 Gym 中，智能体的动作和观测可以是 Box (n 维) 或 Discrete (固定范围整数) 类的一部分。

2. 您可以通过以下方式查看一个随机观测——一帧。

1) Setting the random `seed` before initialization (optional).

2) Calling  Gym's `reset()` to reset the environment, which returns an initial observation.

3) Using Matplotlib to display the `render`ed observation.

（有关 Gym 核心类和方法的更多信息，您可以参考 OpenAI Gym 核心 API。）

import matplotlib.pyplot as plt

env.seed(42)
env.reset()
random_frame = env.render(mode="rgb_array")
print(random_frame.shape)
plt.imshow(random_frame)

为了将观测输入到策略（神经网络）中，您需要将它们转换为 6,400 (80x80x1) 浮点数组的 1D 灰度向量。（在训练期间，您将使用 NumPy 的 np.ravel() 函数来平坦化这些数组。）

3. 设置一个用于帧（观测）预处理的辅助函数。

def frame_preprocessing(observation_frame):
    # Crop the frame.
    observation_frame = observation_frame[35:195]
    # Downsample the frame by a factor of 2.
    observation_frame = observation_frame[::2, ::2, 0]
    # Remove the background and apply other enhancements.
    observation_frame[observation_frame == 144] = 0  # Erase the background (type 1).
    observation_frame[observation_frame == 109] = 0  # Erase the background (type 2).
    observation_frame[observation_frame != 0] = 1  # Set the items (rackets, ball) to 1.
    # Return the preprocessed frame as a 1D floating-point array.
    return observation_frame.astype(float)

4. 预处理之前随机帧以测试函数——策略网络的输入是一个 80x80 的 1D 图像。

preprocessed_random_frame = frame_preprocessing(random_frame)
plt.imshow(preprocessed_random_frame, cmap="gray")
print(preprocessed_random_frame.shape)

创建策略（神经网络）和前向传播#

接下来，您将把策略定义为一个简单的前馈网络，它以游戏观测作为输入并输出动作对数概率。

对于输入，它将使用 Pong 视频游戏帧——预处理后的 6,400 (80x80) 浮点数组的 1D 向量。
隐藏层将使用 NumPy 的点积函数 np.dot() 对数组计算输入的加权和，然后应用非线性激活函数，例如 ReLU。
然后，输出层将再次执行权重参数与隐藏层输出的矩阵乘法（使用 np.dot()），并通过 softmax 激活函数发送该信息。
最后，策略网络将为智能体输出一个动作对数概率（给定该观测）——在环境中索引为 2（“向上移动球拍”）的 Pong 动作的概率。

1. 让我们实例化输入层、隐藏层和输出层的某些参数，并开始设置网络模型。

首先为实验创建一个随机数生成器实例（为可复现性而设置种子）。

rng = np.random.default_rng(seed=12288743)

然后

设置输入（观测）维度 - 您的预处理屏幕帧。

D = 80 * 80

设置隐藏层神经元数量。

H = 200

将您的策略（神经网络）模型实例化为一个空字典。

model = {}

在神经网络中，权重是重要的可调节参数，网络通过前向和反向传播数据来微调它们。

2. 使用一种称为 Xavier 初始化的技术，使用 NumPy 的 Generator.standard_normal() 来设置网络模型的初始权重，该函数返回服从标准正态分布的随机数，以及 np.sqrt()。

model["W1"] = rng.standard_normal(size=(H, D)) / np.sqrt(D)
model["W2"] = rng.standard_normal(size=H) / np.sqrt(H)

3. 您的策略网络通过随机初始化权重开始，并将输入数据（帧）从输入层通过隐藏层向前馈送到输出层。这个过程称为前向传播或正向传播，并在函数 policy_forward() 中进行了概述。

def policy_forward(x, model):
    # Matrix-multiply the weights by the input in the one and only hidden layer.
    h = np.dot(model["W1"], x)
    # Apply non-linearity with ReLU.
    h[h < 0] = 0
    # Calculate the "dot" product in the outer layer.
    # The input for the sigmoid function is called logit.
    logit = np.dot(model["W2"], h)
    # Apply the sigmoid function (non-linear activation).
    p = sigmoid(logit)
    # Return a log probability for the action 2 ("move up")
    # and the hidden "state" that you need for backpropagation.
    return p, h

请注意，有两个激活函数用于确定输入和输出之间的非线性关系。这些非线性函数应用于层的输出。

修正线性单元 (ReLU)：上面定义为 h[h<0] = 0。它对负输入返回 0，如果为正则返回相同的值。
Sigmoid：下面定义为 sigmoid()。它“封装”最后一层的输出，并返回 (0, 1) 范围内的动作对数概率。

4. 使用 NumPy 的 np.exp() 分别定义 Sigmoid 函数，用于计算指数。

def sigmoid(x):
    return 1.0 / (1.0 + np.exp(-x))

设置更新步骤（反向传播）#

在深度强化学习算法的学习过程中，您使用动作对数概率（给定观测）和折扣回报（例如，Pong 中的 +1 或 -1），并执行反向传播来更新参数——策略网络的权重。

1. 让我们借助 NumPy 模块的数组乘法功能——np.dot()（矩阵乘法）、np.outer()（外积计算）和 np.ravel()（将数组展平为 1D 数组）来定义反向传播函数（policy_backward()）。

def policy_backward(eph, epdlogp, model):
    dW2 = np.dot(eph.T, epdlogp).ravel()
    dh = np.outer(epdlogp, model["W2"])
    dh[eph <= 0] = 0
    dW1 = np.dot(dh.T, epx)
    # Return new "optimized" weights for the policy network.
    return {"W1": dW1, "W2": dW2}

利用网络中间隐藏的“状态”（eph）和某回合动作对数概率的梯度（epdlogp），policy_backward 函数将梯度反向传播通过策略网络并更新权重。

2. 在智能体训练期间应用反向传播时，您需要为每个回合保存一些变量。让我们实例化空列表来存储它们。

# All preprocessed observations for the episode.
xs = []
# All hidden "states" (from the network) for the episode.
hs = []
# All gradients of probability actions
# (with respect to observations) for the episode.
dlogps = []
# All rewards for the episode.
drs = []

在训练过程中，当这些变量“满”后，您将在每回合结束时手动重置它们，并使用 NumPy 的 np.vstack() 进行重塑。这将在本教程末尾的训练阶段进行演示。

3. 接下来，为了在优化智能体策略时执行梯度上升，通常使用深度学习优化器（您正在使用梯度进行优化）。在此示例中，您将使用 RMSProp——一种自适应优化方法。让我们为优化器设置一个折扣因子——衰减率。

decay_rate = 0.99

4. 您还需要存储梯度（借助 NumPy 的 np.zeros_like()）以进行训练期间的优化步骤。

首先，保存累加批次梯度的更新缓冲区。

grad_buffer = {k: np.zeros_like(v) for k, v in model.items()}

其次，为梯度上升的优化器存储 RMSProp 内存。

rmsprop_cache = {k: np.zeros_like(v) for k, v in model.items()}

定义折扣奖励（预期回报）函数#

在本节中，您将设置一个函数用于计算折扣奖励（discount_rewards()）——即从观测中获得的预期回报——该函数使用奖励的 1D 数组作为输入（借助 NumPy 的 np.zeros_like() 函数）。

为了使短期奖励比长期奖励拥有更大的权重，您将使用一个折扣因子（gamma），它通常是 0.9 到 0.99 之间的浮点数。

gamma = 0.99


def discount_rewards(r, gamma):
    discounted_r = np.zeros_like(r)
    running_add = 0
    # From the last reward to the first...
    for t in reversed(range(0, r.size)):
        # ...reset the reward sum
        if r[t] != 0:
            running_add = 0
        # ...compute the discounted reward
        running_add = running_add * gamma + r[t]
        discounted_r[t] = running_add
    return discounted_r

训练智能体进行若干回合#

本节介绍如何设置训练过程，在此过程中您的智能体将学习使用其策略玩 Pong 游戏。

Pong 策略梯度方法的伪代码

实例化策略——您的神经网络——并在策略网络中随机初始化权重。
初始化一个随机观测。
在策略网络中随机初始化权重。
重复若干回合
- 将观测输入到策略网络，并输出智能体的动作概率（前向传播）。
- 智能体对每个观测采取行动，观察收到的奖励并收集状态-动作经验的轨迹（在预定义的回合数或批次大小内）。
- 计算交叉熵（带正号，因为您需要最大化奖励而不是最小化损失）。
- 对于每个批次的回合
  - 使用交叉熵计算动作对数概率的梯度。
  - 计算累积回报，并为了给予短期回报比长期回报更大的权重，使用折扣因子。
  - 将动作对数概率的梯度乘以折扣奖励（“优势”）。
  - 执行梯度上升（反向传播）以优化策略网络的参数（其权重）。
    - 最大化导致高奖励的动作的概率。

Diagram showing operations detailed in this tutorial

您可以随时停止训练，或者/和检查磁盘上 /video 目录中保存的 MP4 视频。您可以根据您的设置，设置最大回合数。

1. 出于演示目的，我们将训练回合数限制为 3。如果您使用硬件加速（CPU 和 GPU），您可以将数量增加到 1,000 或更高。作为比较，Andrej Karpathy 的原始实验大约进行了 8,000 回合。

max_episodes = 3

2. 设置批次大小和学习率的值。

批次大小决定了模型执行参数更新的频率（以回合为单位）。它是您的智能体可以收集状态-动作轨迹的次数。在收集结束时，您可以执行动作概率倍数的最大化。
学习率有助于限制权重更新的幅度，以防止它们过度修正。

batch_size = 3
learning_rate = 1e-4

3. 为 Gym 的 render 方法设置游戏渲染默认变量（它用于显示观测，是可选的，但在调试时可能很有用）。

render = False

4. 通过调用 reset() 设置智能体的初始（随机）观测。

observation = env.reset()

5. 初始化之前的观测。

prev_x = None

6. 初始化奖励变量和回合计数。

running_reward = None
reward_sum = 0
episode_number = 0

7. 为了模拟帧之间的运动，将策略网络的单个输入帧（x）设置为当前和之前预处理帧之间的差异。

def update_input(prev_x, cur_x, D):
    if prev_x is not None:
        x = cur_x - prev_x
    else:
        x = np.zeros(D)
    return x

8. 最后，开始训练循环，使用您预定义的函数。

:tags: [output_scroll]

while episode_number < max_episodes:
    # (For rendering.)
    if render:
        env.render()

    # 1. Preprocess the observation (a game frame) and flatten with NumPy's `ravel()`.
    cur_x = frame_preprocessing(observation).ravel()

    # 2. Instantiate the observation for the policy network
    x = update_input(prev_x, cur_x, D)
    prev_x = cur_x

    # 3. Perform the forward pass through the policy network using the observations
    # (preprocessed frames as inputs) and store the action log probabilities
    # and hidden "states" (for backpropagation) during the course of each episode.
    aprob, h = policy_forward(x, model)
    # 4. Let the action indexed at `2` ("move up") be that probability
    # if it's higher than a randomly sampled value
    # or use action `3` ("move down") otherwise.
    action = 2 if rng.uniform() < aprob else 3

    # 5. Cache the observations and hidden "states" (from the network)
    # in separate variables for backpropagation.
    xs.append(x)
    hs.append(h)

    # 6. Compute the gradients of action log probabilities:
    # - If the action was to "move up" (index `2`):
    y = 1 if action == 2 else 0
    # - The cross-entropy:
    # `y*log(aprob) + (1 - y)*log(1-aprob)`
    # or `log(aprob)` if y = 1, else: `log(1 - aprob)`.
    # (Recall: you used the sigmoid function (`1/(1+np.exp(-x)`) to output
    # `aprob` action probabilities.)
    # - Then the gradient: `y - aprob`.
    # 7. Append the gradients of your action log probabilities.
    dlogps.append(y - aprob)
    # 8. Take an action and update the parameters with Gym's `step()`
    # function; obtain a new observation.
    observation, reward, done, info = env.step(action)
    # 9. Update the total sum of rewards.
    reward_sum += reward
    # 10. Append the reward for the previous action.
    drs.append(reward)

    # After an episode is finished:
    if done:
        episode_number += 1
        # 11. Collect and reshape stored values with `np.vstack()` of:
        # - Observation frames (inputs),
        epx = np.vstack(xs)
        # - hidden "states" (from the network),
        eph = np.vstack(hs)
        # - gradients of action log probabilities,
        epdlogp = np.vstack(dlogps)
        # - and received rewards for the past episode.
        epr = np.vstack(drs)

        # 12. Reset the stored variables for the new episode:
        xs = []
        hs = []
        dlogps = []
        drs = []

        # 13. Discount the rewards for the past episode using the helper
        # function you defined earlier...
        discounted_epr = discount_rewards(epr, gamma)
        # ...and normalize them because they have high variance
        # (this is explained below.)
        discounted_epr -= np.mean(discounted_epr)
        discounted_epr /= np.std(discounted_epr)

        # 14. Multiply the discounted rewards by the gradients of the action
        # log probabilities (the "advantage").
        epdlogp *= discounted_epr
        # 15. Use the gradients to perform backpropagation and gradient ascent.
        grad = policy_backward(eph, epdlogp, model)
        # 16. Save the policy gradients in a buffer.
        for k in model:
            grad_buffer[k] += grad[k]
        # 17. Use the RMSProp optimizer to perform the policy network
        # parameter (weight) update at every batch size
        # (by default: every 10 episodes).
        if episode_number % batch_size == 0:
            for k, v in model.items():
                # The gradient.
                g = grad_buffer[k]
                # Use the RMSProp discounting factor.
                rmsprop_cache[k] = (
                    decay_rate * rmsprop_cache[k] + (1 - decay_rate) * g ** 2
                )
                # Update the policy network with a learning rate
                # and the RMSProp optimizer using gradient ascent
                # (hence, there's no negative sign)
                model[k] += learning_rate * g / (np.sqrt(rmsprop_cache[k]) + 1e-5)
                # Reset the gradient buffer at the end.
                grad_buffer[k] = np.zeros_like(v)

        # 18. Measure the total discounted reward.
        running_reward = (
            reward_sum
            if running_reward is None
            else running_reward * 0.99 + reward_sum * 0.01
        )
        print(
            "Resetting the Pong environment. Episode total reward: {} Running mean: {}".format(
                reward_sum, running_reward
            )
        )

        # 19. Set the agent's initial observation by calling Gym's `reset()` function
        # for the next episode and setting the reward sum back to 0.
        reward_sum = 0
        observation = env.reset()
        prev_x = None

    # 20. Display the output during training.
    if reward != 0:
        print(
            "Episode {}: Game finished. Reward: {}...".format(episode_number, reward)
            + ("" if reward == -1 else " POSITIVE REWARD!")
        )

几点说明

如果您之前运行过实验并想重复它，您的 Monitor 实例可能仍在运行，这可能导致下次尝试训练智能体时抛出错误。因此，您应该首先通过取消注释并运行以下单元格来调用 env.close() 来关闭 Monitor。

# env.close()

在 Pong 中，如果一名玩家没有将球击回，他们会收到负奖励（-1），而另一名玩家会收到 +1 奖励。智能体通过玩 Pong 获得的奖励具有显著的方差。因此，最佳做法是使用相同的均值（使用 np.mean()）和标准差（使用 NumPy 的 np.std()）对它们进行归一化。
仅使用 NumPy 时，深度强化学习的训练过程，包括反向传播，会跨越多行代码，看起来相当长。主要原因之一是您没有使用带有自动微分库的深度学习框架，这些框架通常会简化此类实验。本教程展示了如何从头开始执行所有操作，但您也可以使用许多基于 Python 的框架，它们具有“自动微分”和“自动梯度”功能，您将在本教程末尾了解到这些。

下一步#

您可能会注意到，如果您将回合数从 100 增加到 500 或 1,000+，RL 智能体的训练时间会很长，这取决于您用于此任务的硬件——CPU 和 GPU。

策略梯度方法如果给予它们足够的时间，可以学习一项任务，并且 RL 中的优化是一个具有挑战性的问题。训练智能体学习玩 Pong 或任何其他任务可能效率低下，需要大量回合。您可能还会注意到，在您的训练输出中，即使经过数百回合，奖励也可能具有较高的方差。

此外，与许多基于深度学习的算法一样，您应该考虑策略需要学习的大量参数。在 Pong 中，当网络隐藏层有 200 个节点且输入维度为 6,400 (80x80) 时，这个数字达到了 100 万或更多。因此，增加更多的 CPU 和 GPU 来协助训练始终是一个选择。

您可以使用更高级的基于策略梯度的算法，这有助于加快训练速度、提高对参数的敏感性并解决其他问题。例如，存在“自玩”方法，例如近端策略优化 (PPO)，由 John Schulman 等人于 2017 年开发，该方法曾被用于训练 OpenAI Five 智能体超过 10 个月，使其能够在 Dota 2 中达到竞技水平。当然，如果您将这些方法应用于较小的 Gym 环境，训练时间应该以小时计，而不是以月计。

总的来说，RL 存在许多挑战和可能的解决方案，您可以在《强化学习，快与慢》中探索其中一些，该文章由 Matthew Botvinick, Sam Ritter, Jane X. Wang, Zeb Kurth-Nelson, Charles Blundell, 和 Demis Hassabis (2019) 撰写。

如果您想了解更多关于深度 RL 的信息，您应该查看以下免费教育材料：

《深度 RL 快速入门》：由 OpenAI 开发。
由 DeepMind 和加州大学伯克利分校的实践者讲授的深度 RL 讲座。
David Silver（DeepMind，伦敦大学学院）讲授的 RL 讲座。

使用 NumPy 从头开始构建神经网络是了解 NumPy 和深度学习的好方法。然而，对于实际应用，您应该使用专门的框架——例如 PyTorch、JAX、TensorFlow 或 MXNet——它们提供类似 NumPy 的 API，内置自动微分和 GPU 支持，并且专为高性能数值计算和机器学习而设计。

附录#

关于 RL 和深度 RL 的说明#

在用于图像识别、语言翻译或文本分类等任务的监督式深度学习中，您更可能使用大量带标签的数据。然而，在 RL 中，智能体通常不会收到指示正确或错误动作的直接明确反馈——它们依赖于其他信号，例如奖励。
深度 RL 将 RL 与深度学习结合起来。该领域在 2013 年（即计算机视觉领域的 AlexNet 突破一年后）在视频游戏等更复杂环境中取得了首次重大成功。DeepMind 的 Volodymyr Mnih 及其同事发表了一篇名为《使用深度强化学习玩 Atari》的研究论文（并在 2015 年更新），该论文表明他们能够训练一个智能体，使其能够在街机学习环境中玩几款经典游戏达到人类水平。他们的 RL 算法——称为深度 Q 网络（DQN）——在神经网络中使用了卷积层，该网络近似于Q 学习，并使用了经验回放。
与您在此示例中使用的简单策略梯度方法不同，DQN 使用一种“离策略”的基于价值的方法（近似 Q 学习），而原始的 AlphaGo 则使用策略梯度和蒙特卡洛树搜索。
带函数逼近的策略梯度，例如神经网络，由 Richard Sutton 等人于 2000 年撰写。他们受到之前一些工作的影响，包括统计梯度跟踪算法，例如 REINFORCE (Ronald Williams, 1992)，以及反向传播 (Geoffrey Hinton, 1986)，后者有助于深度学习算法学习。RL 与神经网络函数逼近在 1990 年代由 Gerald Tesauro (时序差分学习与 td-gammon, 1995) 的研究引入，他曾与 IBM 合作开发了一个在 1992 年学习玩双陆棋的智能体，以及 Long-Ji Lin (使用神经网络的机器人强化学习, 1993)。
自 2013 年以来，研究人员提出了许多使用深度 RL 解决复杂任务的显著方法，例如用于围棋游戏的 AlphaGo（David Silver 等人，2016 年），通过自玩掌握围棋、国际象棋和将棋的 AlphaZero（David Silver 等人，2017-2018 年），用于 Dota 2 并结合自玩的 OpenAI Five（OpenAI，2019 年），以及用于星际争霸 2 并使用带有经验回放、自模仿学习和策略蒸馏的Actor-Critic 算法的 AlphaStar（Oriol Vinyals 等人，2019 年）。此外，还有其他实验，例如 Electronic Arts/DICE 的工程师为战地 1 进行的深度 RL 实验。
视频游戏在深度 RL 研究中受欢迎的原因之一是，与现实世界中的实验（例如使用遥控直升机的 RL（Pieter Abbeel 等人，2006））不同，虚拟模拟可以提供更安全的测试环境。
如果您对了解深度 RL 对神经科学等其他领域的影响感兴趣，可以参考 Matthew Botvinick 等人（2020 年）的论文。

如何在 Jupyter Notebook 中设置视频播放#

如果您正在使用 Binder——一个免费的基于 Jupyter Notebook 的工具——您可以设置 Docker 镜像，并将 freeglut3-dev、xvfb 和 x11-utils 添加到 apt.txt 配置文件中以安装初始依赖项。然后，在 binder/environment.yml 的 channels 下，添加 gym、pyvirtualdisplay 以及您可能需要的其他内容，例如 python=3.7、pip 和 jupyterlab。请查看以下文章以获取更多信息。
如果您正在使用 Google Colaboratory（另一个免费的基于 Jupyter Notebook 的工具），您可以启用游戏环境的视频播放，方法是安装和设置 X 虚拟帧缓冲区/Xvfb、X11、FFmpeg、PyVirtualDisplay、PyOpenGL 以及其他依赖项，如下面所述。

如果您正在使用 Google Colaboratory，请在 Notebook 单元格中运行以下命令以帮助视频播放。

# Install Xvfb and X11 dependencies.
!apt-get install -y xvfb x11-utils > /dev/null 2>&1
# To work with videos, install FFmpeg.
!apt-get install -y ffmpeg > /dev/null 2>&1
# Install PyVirtualDisplay for visual feedback and other libraries/dependencies.
!pip install pyvirtualdisplay PyOpenGL PyOpenGL-accelerate > /dev/null 2>&1

然后，添加这段 Python 代码。

# Import the virtual display module.
from pyvirtualdisplay import Display
# Import ipythondisplay and HTML from IPython for image and video rendering.
from IPython import display as ipythondisplay
from IPython.display import HTML

# Initialize the virtual buffer at 400x300 (adjustable size).
# With Xvfb, you should set `visible=False`.
display = Display(visible=False, size=(400, 300))
display.start()

# Check that no display is present.
# If no displays are present, the expected output is `:0`.
!echo $DISPLAY

# Define a helper function to display videos in Jupyter notebooks:.
# (Source: https://star-ai.github.io/Rendering-OpenAi-Gym-in-Colaboratory/)

import sys
import math
import glob
import io
import base64

def show_any_video(mp4video=0):
    mp4list = glob.glob('video/*.mp4')
    if len(mp4list) > 0:
        mp4 = mp4list[mp4video]
        video = io.open(mp4, 'r+b').read()
        encoded = base64.b64encode(video)
        ipythondisplay.display(HTML(data='''<video alt="test" autoplay
                                            loop controls style="height: 400px;">
                                            <source src="data:video/mp4;base64,{0}" type="video/mp4" />
                                            </video>'''.format(encoded.decode('ascii'))))

    else:
        print('Could not find the video!')

如果您想在 Jupyter Notebook 中查看上次（非常快速的）游戏回放，并且之前已实现 show_any_video() 函数，请在单元格中运行此代码。
```
show_any_video(-1)
```
如果您在本地 Linux 或 macOS 环境中遵循本教程的说明，可以将大部分代码添加到单个 Python (.py) 文件中。然后，您可以在终端中通过 python your-code.py 运行您的 Gym 实验。要启用渲染，您可以按照官方 OpenAI Gym 文档使用命令行界面（确保您已安装 Gym 和 Xvfb，如指南中所述）。