文件读写#

本页介绍常见应用；有关 I/O 例程的完整集合，请参阅输入和输出。

读取文本和 CSV 文件#

没有缺失值时#

使用 numpy.loadtxt。

有缺失值时#

使用 numpy.genfromtxt。

numpy.genfromtxt 将会

返回一个掩码数组 **掩盖缺失值**（如果 usemask=True），或者
**用 filling_values 指定的值填充缺失值**（浮点数的默认值为 np.nan，整数的默认值为 -1）。

使用非空白分隔符#

>>> with open("csv.txt", "r") as f:
...     print(f.read())
1, 2, 3
4,, 6
7, 8, 9

掩码数组输出#

>>> np.genfromtxt("csv.txt", delimiter=",", usemask=True)
masked_array(
  data=[[1.0, 2.0, 3.0],
        [4.0, --, 6.0],
        [7.0, 8.0, 9.0]],
  mask=[[False, False, False],
        [False,  True, False],
        [False, False, False]],
  fill_value=1e+20)

数组输出#

>>> np.genfromtxt("csv.txt", delimiter=",")
array([[ 1.,  2.,  3.],
       [ 4., nan,  6.],
       [ 7.,  8.,  9.]])

数组输出，指定填充值#

>>> np.genfromtxt("csv.txt", delimiter=",", dtype=np.int8, filling_values=99)
array([[ 1,  2,  3],
       [ 4, 99,  6],
       [ 7,  8,  9]], dtype=int8)

由空格分隔#

numpy.genfromtxt 也可以解析有缺失值的由空格分隔的数据文件，如果

**每个字段具有固定宽度**：使用宽度作为 delimiter 参数。

# File with width=4. The data does not have to be justified (for example,
# the 2 in row 1), the last column can be less than width (for example, the 6
# in row 2), and no delimiting character is required (for instance 8888 and 9
# in row 3)

>>> with open("fixedwidth.txt", "r") as f:
...    data = (f.read())
>>> print(data)
1   2      3
44      6
7   88889

# Showing spaces as ^
>>> print(data.replace(" ","^"))
1^^^2^^^^^^3
44^^^^^^6
7^^^88889

>>> np.genfromtxt("fixedwidth.txt", delimiter=4)
array([[1.000e+00, 2.000e+00, 3.000e+00],
       [4.400e+01,       nan, 6.000e+00],
       [7.000e+00, 8.888e+03, 9.000e+00]])

**特殊值（例如“x”）指示一个缺失字段**：将其用作 missing_values 参数。

>>> with open("nan.txt", "r") as f:
...     print(f.read())
1 2 3
44 x 6
7  8888 9

>>> np.genfromtxt("nan.txt", missing_values="x")
array([[1.000e+00, 2.000e+00, 3.000e+00],
       [4.400e+01,       nan, 6.000e+00],
       [7.000e+00, 8.888e+03, 9.000e+00]])

**您想跳过包含缺失值的行**：将 invalid_raise 设置为 False。

>>> with open("skip.txt", "r") as f:
...     print(f.read())
1 2   3
44    6
7 888 9

>>> np.genfromtxt("skip.txt", invalid_raise=False)  
__main__:1: ConversionWarning: Some errors were detected !
    Line #2 (got 2 columns instead of 3)
array([[  1.,   2.,   3.],
       [  7., 888.,   9.]])

**分隔符空白字符与指示缺失数据的空白字符不同**。例如，如果列由 \t 分隔，则如果缺失数据由一个或多个空格组成，它将被识别。

>>> with open("tabs.txt", "r") as f:
...    data = (f.read())
>>> print(data)
1       2       3
44              6
7       888     9

# Tabs vs. spaces
>>> print(data.replace("\t","^"))
1^2^3
44^ ^6
7^888^9

>>> np.genfromtxt("tabs.txt", delimiter="\t", missing_values=" +")
array([[  1.,   2.,   3.],
       [ 44.,  nan,   6.],
       [  7., 888.,   9.]])

读取 .npy 或 .npz 格式的文件#

选择

使用 numpy.load。它可以读取由 numpy.save、numpy.savez 或 numpy.savez_compressed 生成的任何文件。
使用内存映射。请参阅 numpy.lib.format.open_memmap。

写入一个可被 NumPy 读取的文件#

二进制#

使用 numpy.save，或者存储多个数组时使用 numpy.savez 或 numpy.savez_compressed。

出于安全性和可移植性的考虑，请将 allow_pickle=False，除非 dtype 包含 Python 对象，这需要进行 pickling。

掩码数组 目前无法保存，其他任意数组子类也无法保存。

人类可读#

numpy.save 和 numpy.savez 创建二进制文件。要 **写入人类可读文件**，请使用 numpy.savetxt。该数组只能是一维或二维的，并且没有用于多个文件的 savetxtz。

大型数组#

请参阅写入或读取大型数组。

读取任意格式的二进制文件（“二进制 blob”）#

使用结构化数组。

示例

.wav 文件头是位于实际声音数据 data_size 字节之前的一个 44 字节块。

chunk_id         "RIFF"
chunk_size       4-byte unsigned little-endian integer
format           "WAVE"
fmt_id           "fmt "
fmt_size         4-byte unsigned little-endian integer
audio_fmt        2-byte unsigned little-endian integer
num_channels     2-byte unsigned little-endian integer
sample_rate      4-byte unsigned little-endian integer
byte_rate        4-byte unsigned little-endian integer
block_align      2-byte unsigned little-endian integer
bits_per_sample  2-byte unsigned little-endian integer
data_id          "data"
data_size        4-byte unsigned little-endian integer

NumPy 结构化 dtype 中的 .wav 文件头

wav_header_dtype = np.dtype([
    ("chunk_id", (bytes, 4)), # flexible-sized scalar type, item size 4
    ("chunk_size", "<u4"),    # little-endian unsigned 32-bit integer
    ("format", "S4"),         # 4-byte string, alternate spelling of (bytes, 4)
    ("fmt_id", "S4"),
    ("fmt_size", "<u4"),
    ("audio_fmt", "<u2"),     #
    ("num_channels", "<u2"),  # .. more of the same ...
    ("sample_rate", "<u4"),   #
    ("byte_rate", "<u4"),
    ("block_align", "<u2"),
    ("bits_per_sample", "<u2"),
    ("data_id", "S4"),
    ("data_size", "<u4"),
    #
    # the sound data itself cannot be represented here:
    # it does not have a fixed size
])

header = np.fromfile(f, dtype=wave_header_dtype, count=1)[0]

此 .wav 示例仅用于说明；要在实际生活中读取 .wav 文件，请使用 Python 的内置模块 wave。

（改编自 Pauli Virtanen 的高级 NumPy，在 CC BY 4.0 许可下发布。）

写入或读取大型数组#

**内存不足的数组** 可以通过内存映射像普通的内存中数组一样处理。

使用 numpy.ndarray.tofile 或 numpy.ndarray.tobytes 写入的原始数组数据可以使用 numpy.memmap 读取。
```
array = numpy.memmap("mydata/myarray.arr", mode="r", dtype=np.int16, shape=(1024, 1024))
```
由 numpy.save（即使用 NumPy 格式）输出的文件可以使用 numpy.load 和 mmap_mode 关键字参数来读取。
```
large_array[some_slice] = np.load("path/to/small_array", mmap_mode="r")
```

内存映射缺少诸如数据分块和压缩之类的功能；与 NumPy 一起使用的更多功能齐全的格式和库包括：

**HDF5**：h5py 或 PyTables。
**Zarr**：此处。
**NetCDF**：scipy.io.netcdf_file。

有关 memmap、Zarr 和 HDF5 之间的权衡，请参阅 pythonspeed.com。

写入供其他（非 NumPy）工具读取的文件#

用于与写入或读取大型数组中所述的 HDF5、Zarr 和 NetCDF 等格式，可用于与其他工具 **交换数据**。

写入或读取 JSON 文件#

NumPy 数组和大多数 NumPy 标量 **不** 是直接 JSON 可序列化的。请使用自定义的 json.JSONEncoder 来处理 NumPy 类型，您可以通过您喜欢的搜索引擎找到它。

使用 pickle 文件进行保存/恢复#

尽可能避免；pickle 文件无法防止错误或恶意构造的数据。

使用 numpy.save 和 numpy.load。设置 allow_pickle=False，除非数组 dtype 包含 Python 对象，在这种情况下需要 pickling。

numpy.load 和 pickle 子模块也支持取消 pickle NumPy 1.26 创建的文件。

将 pandas DataFrame 转换为 NumPy 数组#

请参阅 pandas.Series.to_numpy。

使用 `tofile` 和 `fromfile` 进行保存/恢复#

一般而言，优先使用 numpy.save 和 numpy.load。

numpy.ndarray.tofile 和 numpy.fromfile 会丢失字节序和精度信息，因此不适用于任何除了临时存储之外的用途。