本教程改编自Andrew Trask的作品(经作者许可)。




建议您阅读Yann LeCun、Yoshua Bengio和Geoffrey Hinton(被认为是该领域的一些先驱者)在2015年发表的深度学习论文。您还可以考虑阅读Andrew Trask的《Grokking Deep Learning》,该书使用NumPy教授深度学习。


本教程可以在隔离环境(例如Virtualenv或conda)中本地运行。您可以使用Jupyter Notebook或JupyterLab运行每个notebook单元格。不要忘记设置NumPy和Matplotlib。


1. 加载MNIST数据集#

在本节中,您将下载最初存储在Yann LeCun网站上的压缩MNIST数据集文件。然后,您将使用内置的Python模块将它们转换为4个NumPy数组类型文件。最后,您将把数组分成训练集和测试集。

1. 定义一个变量,在一个列表中存储MNIST数据集的训练/测试图像/标签名称

data_sources = {
    "training_images": "train-images-idx3-ubyte.gz",  # 60,000 training images.
    "test_images": "t10k-images-idx3-ubyte.gz",  # 10,000 test images.
    "training_labels": "train-labels-idx1-ubyte.gz",  # 60,000 training labels.
    "test_labels": "t10k-labels-idx1-ubyte.gz",  # 10,000 test labels.

2. 加载数据。首先检查数据是否本地存储;如果否,则下载它。

import requests
import os

data_dir = "../_data"
os.makedirs(data_dir, exist_ok=True)

base_url = ""

for fname in data_sources.values():
    fpath = os.path.join(data_dir, fname)
    if not os.path.exists(fpath):
        print("Downloading file: " + fname)
        resp = requests.get(base_url + fname, stream=True, **request_opts)
        resp.raise_for_status()  # Ensure download was succesful
        with open(fpath, "wb") as fh:
            for chunk in resp.iter_content(chunk_size=128):

3. 解压缩4个文件并创建4个ndarrays,将它们保存到字典中。每个原始图像的大小为28x28,神经网络通常期望一个一维向量输入;因此,您还需要通过将28乘以28(784)来重塑图像。

import gzip
import numpy as np

mnist_dataset = {}

# Images
for key in ("training_images", "test_images"):
    with, data_sources[key]), "rb") as mnist_file:
        mnist_dataset[key] = np.frombuffer(
  , np.uint8, offset=16
        ).reshape(-1, 28 * 28)
# Labels
for key in ("training_labels", "test_labels"):
    with, data_sources[key]), "rb") as mnist_file:
        mnist_dataset[key] = np.frombuffer(, np.uint8, offset=8)

4. 使用x表示数据和y表示标签的标准表示法将数据分成训练集和测试集,将训练集和测试集图像分别命名为x_trainx_test,并将标签命名为y_trainy_test

x_train, y_train, x_test, y_test = (

5. 您可以确认图像数组的形状分别为训练集的(60000, 784)和测试集的(10000, 784),标签的形状为(60000,)(10000,)

    "The shape of training images: {} and training labels: {}".format(
        x_train.shape, y_train.shape
    "The shape of test images: {} and test labels: {}".format(
        x_test.shape, y_test.shape
The shape of training images: (60000, 784) and training labels: (60000,)
The shape of test images: (10000, 784) and test labels: (10000,)

6. 您可以使用Matplotlib检查一些图像

import matplotlib.pyplot as plt

# Take the 60,000th image (indexed at 59,999) from the training set,
# reshape from (784, ) to (28, 28) to have a valid shape for displaying purposes.
mnist_image = x_train[59999, :].reshape(28, 28)
# Set the color mapping to grayscale to have a black background.
plt.imshow(mnist_image, cmap="gray")
# Display the image.
# Display 5 random images from the training set.
num_examples = 5
seed = 147197952744
rng = np.random.default_rng(seed)

fig, axes = plt.subplots(1, num_examples)
for sample, ax in zip(rng.choice(x_train, size=num_examples, replace=False), axes):
    ax.imshow(sample.reshape(28, 28), cmap="gray")



         0,   0,  38,  48,  48,  22,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,  62,  97, 198, 243, 254, 254, 212,  27,   0,   0,   0,   0,
# Display the label of the 60,000th image (indexed at 59,999) from the training set.

2. 预处理数据#



  • 归一化图像数据:一种特征缩放过程,可以通过标准化输入数据的分布来加快神经网络训练过程。

  • 独热/分类编码图像标签。

在实践中,您可以根据目标使用不同类型的浮点精度,您可以在Nvidia和Google Cloud博客文章中找到更多相关信息。


图像数据包含在 [0, 255] 区间内编码的 8 位整数,颜色值介于 0 和 255 之间。

您将通过将其除以 255 来将其规范化为 [0, 1] 区间内的浮点数组。

1. 检查矢量化图像数据类型是否为 uint8

print("The data type of training images: {}".format(x_train.dtype))
print("The data type of test images: {}".format(x_test.dtype))
The data type of training images: uint8
The data type of test images: uint8

2. 通过将其除以 255 来规范化数组(从而将数据类型从 uint8 提升到 float64),然后将训练和测试图像数据变量——x_trainx_test——分别赋值给 training_imagestrain_labels。为了减少本例中模型训练和评估时间,将只使用训练和测试图像的一个子集。 training_imagestest_images 都将只包含 1000 个样本,分别来自完整的 60000 和 10000 张图像的数据集。可以通过更改下面的 training_sampletest_sample 来控制这些值,最多可达其最大值 60000 和 10000。

training_sample, test_sample = 1000, 1000
training_images = x_train[0:training_sample] / 255
test_images = x_test[0:test_sample] / 255

3. 确认图像数据已更改为浮点格式

print("The data type of training images: {}".format(training_images.dtype))
print("The data type of test images: {}".format(test_images.dtype))
The data type of training images: float64
The data type of test images: float64

注意:您还可以通过在笔记本单元格中打印 training_images[0] 来检查规范化是否成功。您的长输出应包含一个浮点数数组

       0.        , 0.        , 0.01176471, 0.07058824, 0.07058824,
       0.07058824, 0.49411765, 0.53333333, 0.68627451, 0.10196078,
       0.65098039, 1.        , 0.96862745, 0.49803922, 0.        ,


您将使用独热编码将每个数字标签嵌入为一个全零向量,使用 np.zeros() 并为标签索引放置 1。结果,您的标签数据将是数组,在每个图像标签的位置具有 1.0(或 1.)。

由于总共有 10 个标签(从 0 到 9),因此您的数组将类似于此

array([0., 0., 0., 0., 0., 1., 0., 0., 0., 0.])

1. 使用 dtype uint8 确认图像标签数据是整数

print("The data type of training labels: {}".format(y_train.dtype))
print("The data type of test labels: {}".format(y_test.dtype))
The data type of training labels: uint8
The data type of test labels: uint8

2. 定义一个对数组执行独热编码的函数

def one_hot_encoding(labels, dimension=10):
    # Define a one-hot variable for an all-zero vector
    # with 10 dimensions (number labels from 0 to 9).
    one_hot_labels = labels[..., None] == np.arange(dimension)[None]
    # Return one-hot encoded labels.
    return one_hot_labels.astype(np.float64)

3. 编码标签并将值赋给新变量

training_labels = one_hot_encoding(y_train[:training_sample])
test_labels = one_hot_encoding(y_test[:test_sample])

4. 检查数据类型是否已更改为浮点型

print("The data type of training labels: {}".format(training_labels.dtype))
print("The data type of test labels: {}".format(test_labels.dtype))
The data type of training labels: float64
The data type of test labels: float64

5. 检查一些编码后的标签

[0. 0. 0. 0. 0. 1. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]




3. 从头构建和训练一个小型神经网络#


之后,您将使用 Python 和 NumPy 构建简单深度学习模型的构建块,并训练它以一定精度从 MNIST 数据集中识别手写数字。

使用 NumPy 的神经网络构建块#

  • 层 (Layers):这些构建块充当数据过滤器——它们处理数据并从输入中学习表示,以更好地预测目标输出。

    您将在模型中使用 1 个隐藏层来向前传递输入(前向传播)并向后传播损失函数的梯度/误差导数(反向传播)。这些是输入层、隐藏层和输出层。

    在隐藏层(中间层)和输出层(最后一层)中,神经网络模型将计算输入的加权和。为了计算此过程,您将使用 NumPy 的矩阵乘法函数(“点乘”或, weights))。

    注意:为简便起见,本例中省略了偏置项(没有, weights) + bias)。

  • 权重 (Weights):这些是神经网络通过前向和反向传播数据来微调的重要可调整参数。它们通过称为梯度下降的过程进行优化。在模型训练开始之前,权重使用 NumPy 的Generator.random()随机初始化。


  • 激活函数 (Activation function):深度学习模型能够确定输入和输出之间的非线性关系,这些非线性函数通常应用于每一层的输出。

    您将对隐藏层的输出使用整流线性单元 (ReLU)(例如,relu(, weights)))。

  • 正则化 (Regularization):这种技术有助于防止神经网络模型过拟合

    在本例中,您将使用一种称为 dropout 的方法——稀疏化——它会将一层中的许多特征随机设置为 0。您将使用 NumPy 的Generator.integers()方法定义它并将其应用于网络的隐藏层。

  • 损失函数 (Loss function):该计算通过将图像标签(真实值)与最终层输出中的预测值进行比较来确定预测的质量。

    为简便起见,您将使用基本的总平方误差,使用 NumPy 的 np.sum() 函数(例如,np.sum((final_layer_output - image_labels) ** 2))。

  • 准确率 (Accuracy):此指标衡量网络预测其未见过的数据的准确性。



Diagram showing operations detailed in this tutorial (The input imageis passed into a Hidden layer that creates a weighted sum of outputs.The weighted sum is passed to the Non-linearity, then regularization andinto the output layer. The output layer creates a prediction which canthen be compared to existing data. The errors are used to calculate theloss function and update weights in the hidden layer and outputlayer.)

  • 输入层:

    它是网络的输入——先前预处理的数据,从 training_images 加载到 layer_0 中。

  • 隐藏层(中间层):

    layer_1 获取前一层的输出,并使用 NumPy 的 对输入与权重 (weights_1) 进行矩阵乘法。

    然后,此输出通过 ReLU 激活函数进行非线性处理,然后应用 dropout 来帮助避免过拟合。

  • 输出层(最后一层):

    layer_2 获取 layer_1 的输出,并使用 weights_2 重复相同的“点乘”过程。

    最终输出返回 0-9 数字标签的 10 个分数。网络模型以大小为 10 的层结束——一个 10 维向量。

  • 前向传播、反向传播、训练循环:




  1. 通过比较图像的真实标签(真实值)与模型的预测来衡量误差。

  2. 对损失函数进行微分。

  3. 获取关于输出的梯度,并通过层(s)将其反向传播到输入。





1. 我们将首先创建一个新的随机数生成器,提供一个种子以确保可重复性

seed = 884736743
rng = np.random.default_rng(seed)

2. 对于隐藏层,定义用于前向传播的 ReLU 激活函数和反向传播中使用的 ReLU 的导数

# Define ReLU that returns the input if it's positive and 0 otherwise.
def relu(x):
    return (x >= 0) * x

# Set up a derivative of the ReLU function that returns 1 for a positive input
# and 0 otherwise.
def relu2deriv(output):
    return output >= 0

3. 设置某些超参数的默认值,例如

  • 学习率 (Learning rate)learning_rate——有助于限制权重更新的幅度,以防止它们过度校正。

  • 迭代次数 (轮数): epochs — 数据在网络中完整传递的次数——包含前向传播和反向传播。此参数可能会对结果产生积极或消极的影响。迭代次数越高,学习过程可能花费的时间越长。由于这是一项计算密集型任务,我们选择了非常低的迭代次数 (20)。为了获得有意义的结果,您应该选择更大的数值。

  • 网络中隐藏层 (中间层) 的大小: hidden_size — 隐藏层的大小不同会影响训练和测试过程中的结果。

  • 输入大小: pixels_per_image — 您已确定图像输入为 784 像素 (28x28)。

  • 标签数量: num_labels — 指示输出层的输出数量,其中包含 10 个 (0 到 9) 手写数字标签的预测结果。

learning_rate = 0.005
epochs = 20
hidden_size = 100
pixels_per_image = 784
num_labels = 10

4. 使用随机值初始化隐藏层和输出层中将使用的权重向量。

weights_1 = 0.2 * rng.random((pixels_per_image, hidden_size)) - 0.1
weights_2 = 0.2 * rng.random((hidden_size, num_labels)) - 0.1

5. 使用训练循环设置神经网络的学习实验,并启动训练过程。请注意,模型在每个迭代次数都会根据测试集进行评估,以跟踪其在训练迭代次数中的性能。


# To store training and test set losses and accurate predictions
# for visualization.
store_training_loss = []
store_training_accurate_pred = []
store_test_loss = []
store_test_accurate_pred = []

# This is a training loop.
# Run the learning experiment for a defined number of epochs (iterations).
for j in range(epochs):

    # Training step #

    # Set the initial loss/error and the number of accurate predictions to zero.
    training_loss = 0.0
    training_accurate_predictions = 0

    # For all images in the training set, perform a forward pass
    # and backpropagation and adjust the weights accordingly.
    for i in range(len(training_images)):
        # Forward propagation/forward pass:
        # 1. The input layer:
        #    Initialize the training image data as inputs.
        layer_0 = training_images[i]
        # 2. The hidden layer:
        #    Take in the training image data into the middle layer by
        #    matrix-multiplying it by randomly initialized weights.
        layer_1 =, weights_1)
        # 3. Pass the hidden layer's output through the ReLU activation function.
        layer_1 = relu(layer_1)
        # 4. Define the dropout function for regularization.
        dropout_mask = rng.integers(low=0, high=2, size=layer_1.shape)
        # 5. Apply dropout to the hidden layer's output.
        layer_1 *= dropout_mask * 2
        # 6. The output layer:
        #    Ingest the output of the middle layer into the the final layer
        #    by matrix-multiplying it by randomly initialized weights.
        #    Produce a 10-dimension vector with 10 scores.
        layer_2 =, weights_2)

        # Backpropagation/backward pass:
        # 1. Measure the training error (loss function) between the actual
        #    image labels (the truth) and the prediction by the model.
        training_loss += np.sum((training_labels[i] - layer_2) ** 2)
        # 2. Increment the accurate prediction count.
        training_accurate_predictions += int(
            np.argmax(layer_2) == np.argmax(training_labels[i])
        # 3. Differentiate the loss function/error.
        layer_2_delta = training_labels[i] - layer_2
        # 4. Propagate the gradients of the loss function back through the hidden layer.
        layer_1_delta =, layer_2_delta) * relu2deriv(layer_1)
        # 5. Apply the dropout to the gradients.
        layer_1_delta *= dropout_mask
        # 6. Update the weights for the middle and input layers
        #    by multiplying them by the learning rate and the gradients.
        weights_1 += learning_rate * np.outer(layer_0, layer_1_delta)
        weights_2 += learning_rate * np.outer(layer_1, layer_2_delta)

    # Store training set losses and accurate predictions.

    # Evaluation step #

    # Evaluate model performance on the test set at each epoch.

    # Unlike the training step, the weights are not modified for each image
    # (or batch). Therefore the model can be applied to the test images in a
    # vectorized manner, eliminating the need to loop over each image
    # individually:

    results = relu(test_images @ weights_1) @ weights_2

    # Measure the error between the actual label (truth) and prediction values.
    test_loss = np.sum((test_labels - results) ** 2)

    # Measure prediction accuracy on test set
    test_accurate_predictions = np.sum(
        np.argmax(results, axis=1) == np.argmax(test_labels, axis=1)

    # Store test set losses and accurate predictions.

    # Summarize error and accuracy metrics at each epoch
            f"Epoch: {j}\n"
            f"  Training set error: {training_loss / len(training_images):.3f}\n"
            f"  Training set accuracy: {training_accurate_predictions / len(training_images)}\n"
            f"  Test set error: {test_loss / len(test_images):.3f}\n"
            f"  Test set accuracy: {test_accurate_predictions / len(test_images)}"
Epoch: 0
  Training set error: 0.898
  Training set accuracy: 0.397
  Test set error: 0.680
  Test set accuracy: 0.582
Epoch: 1
  Training set error: 0.656
  Training set accuracy: 0.633
  Test set error: 0.607
  Test set accuracy: 0.641
Epoch: 2
  Training set error: 0.592
  Training set accuracy: 0.68
  Test set error: 0.569
  Test set accuracy: 0.679
Epoch: 3
  Training set error: 0.556
  Training set accuracy: 0.7
  Test set error: 0.541
  Test set accuracy: 0.708
Epoch: 4
  Training set error: 0.534
  Training set accuracy: 0.732
  Test set error: 0.526
  Test set accuracy: 0.729
Epoch: 5
  Training set error: 0.515
  Training set accuracy: 0.715
  Test set error: 0.500
  Test set accuracy: 0.739
Epoch: 6
  Training set error: 0.495
  Training set accuracy: 0.748
  Test set error: 0.487
  Test set accuracy: 0.753
Epoch: 7
  Training set error: 0.483
  Training set accuracy: 0.769
  Test set error: 0.486
  Test set accuracy: 0.747
Epoch: 8
  Training set error: 0.473
  Training set accuracy: 0.776
  Test set error: 0.473
  Test set accuracy: 0.752
Epoch: 9
  Training set error: 0.460
  Training set accuracy: 0.788
  Test set error: 0.462
  Test set accuracy: 0.762
Epoch: 10
  Training set error: 0.465
  Training set accuracy: 0.769
  Test set error: 0.462
  Test set accuracy: 0.767
Epoch: 11
  Training set error: 0.443
  Training set accuracy: 0.801
  Test set error: 0.456
  Test set accuracy: 0.775
Epoch: 12
  Training set error: 0.448
  Training set accuracy: 0.795
  Test set error: 0.455
  Test set accuracy: 0.772
Epoch: 13
  Training set error: 0.438
  Training set accuracy: 0.787
  Test set error: 0.453
  Test set accuracy: 0.778
Epoch: 14
  Training set error: 0.446
  Training set accuracy: 0.791
  Test set error: 0.450
  Test set accuracy: 0.779
Epoch: 15
  Training set error: 0.441
  Training set accuracy: 0.788
  Test set error: 0.452
  Test set accuracy: 0.772
Epoch: 16
  Training set error: 0.437
  Training set accuracy: 0.786
  Test set error: 0.453
  Test set accuracy: 0.772
Epoch: 17
  Training set error: 0.436
  Training set accuracy: 0.794
  Test set error: 0.449
  Test set accuracy: 0.778
Epoch: 18
  Training set error: 0.433
  Training set accuracy: 0.801
  Test set error: 0.450
  Test set accuracy: 0.774
Epoch: 19
  Training set error: 0.429
  Training set accuracy: 0.785
  Test set error: 0.436
  Test set accuracy: 0.784

训练过程可能需要几分钟,具体取决于许多因素,例如运行实验的机器的处理能力和迭代次数。为了减少等待时间,您可以将迭代次数变量从 100 更改为较小的数字,重置运行时间(这将重置权重),然后再次运行 Notebook 单元格。


epoch_range = np.arange(epochs) + 1  # Starting from 1

# The training set metrics.
training_metrics = {
    "accuracy": np.asarray(store_training_accurate_pred) / len(training_images),
    "error": np.asarray(store_training_loss) / len(training_images),

# The test set metrics.
test_metrics = {
    "accuracy": np.asarray(store_test_accurate_pred) / len(test_images),
    "error": np.asarray(store_test_loss) / len(test_images),

# Display the plots.
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(15, 5))
for ax, metrics, title in zip(
    axes, (training_metrics, test_metrics), ("Training set", "Test set")
    # Plot the metrics
    for metric, values in metrics.items():
        ax.plot(epoch_range, values, label=metric.capitalize())





您已经学习了如何仅使用 NumPy 从头开始构建和训练一个简单的前馈神经网络,以对 MNIST 手写数字进行分类。


  • 将训练样本大小从 1,000 增加到更大的数字(最多 60,000)。

  • 使用小批量数据并降低学习率

  • 通过引入更多隐藏层来改变架构,使网络更深

  • 交叉熵损失函数与最后一层中的softmax激活函数结合使用。

  • 引入卷积层:将前馈网络替换为卷积神经网络架构。

  • 使用更高的迭代次数进行更长时间的训练,并添加更多正则化技术,例如提前停止,以防止过拟合

  • 引入验证集,以便对模型拟合进行无偏评估。

  • 应用批量归一化,以实现更快、更稳定的训练。

  • 调整其他参数,例如学习率和隐藏层大小。

使用 NumPy 从头开始构建神经网络是学习更多关于 NumPy 和深度学习的好方法。但是,对于实际应用,您应该使用专门的框架——例如PyTorchJAXTensorFlowMXNet——它们提供类似 NumPy 的 API,具有内置的自动微分和 GPU 支持,并且专为高性能数值计算和机器学习而设计。


(感谢hsjeong5 演示了如何在不使用外部库的情况下下载 MNIST。)