NEP 8 — 一项向 NumPy 添加 groupby 功能的提案#

作者:: Travis Oliphant
联系方式:: oliphant@enthought.com
日期:: 2010-04-27
状态:: 已推迟

执行摘要#

NumPy 提供了处理数据和进行计算的工具，其方式与关系代数所允许的非常相似。然而，常见的 group-by 功能并不容易处理。NumPy 的 ufuncs 的 reduce 方法是放置这种 groupby 行为的自然场所。本 NEP 描述了 ufuncs 的两个额外方法（reduceby 和 reducein）以及两个额外函数（segment 和 edges），它们有助于添加此功能。

示例用例#

假设你有一个 NumPy 结构化数组，其中包含多个商店在多天内的购买数量信息。为了明确，该结构化数组的数据类型是

dt = [('year', i2), ('month', i1), ('day', i1), ('time', float),
    ('store', i4), ('SKU', 'S6'), ('number', i4)]

假设有一个此数据类型的 1 维 NumPy 数组，并且你希望根据产品、月份、商店等，计算所售产品数量的各种统计数据（最大值、最小值、平均值、总和等）。

目前，这可以通过对数组的数字字段使用 reduce 方法，结合原地排序、带有 return_inverse=True 的 unique 和 bincount 等方式完成。然而，对于这种常见的数据分析需求，拥有标准且更直接的方法来获取结果将会更好。

提议的 Ufunc 方法#

建议为 ufuncs 添加两个新的 reduce 风格方法：reduceby 和 reducein。reducein 方法旨在成为 reduceat 的一个更易于使用的版本，而 reduceby 方法旨在提供对归约操作的 group-by 功能。

reducein

<ufunc>.reducein(arr, indices, axis=0, dtype=None, out=None)

Perform a local reduce with slices specified by pairs of indices.

The reduction occurs along the provided axis, using the provided
data-type to calculate intermediate results, storing the result into
the array out (if provided).

The indices array provides the start and end indices for the
reduction.  If the length of the indices array is odd, then the
final index provides the beginning point for the final reduction
and the ending point is the end of arr.

This generalizes along the given axis, the behavior:

[<ufunc>.reduce(arr[indices[2*i]:indices[2*i+1]])
        for i in range(len(indices)/2)]

This assumes indices is of even length

Example:
   >>> a = [0,1,2,4,5,6,9,10]
   >>> add.reducein(a,[0,3,2,5,-2])
   [3, 11, 19]

   Notice that sum(a[0:3]) = 3; sum(a[2:5]) = 11; and sum(a[-2:]) = 19

reduceby

<ufunc>.reduceby(arr, by, dtype=None, out=None)

Perform a reduction in arr over unique non-negative integers in by.


Let N=arr.ndim and M=by.ndim.  Then, by.shape[:N] == arr.shape.
In addition, let I be an N-length index tuple, then by[I]
contains the location in the output array for the reduction to
be stored.  Notice that if N == M, then by[I] is a non-negative
integer, while if N < M, then by[I] is an array of indices into
the output array.

The reduction is computed on groups specified by unique indices
into the output array. The index is either the single
non-negative integer if N == M or if N < M, the entire
(M-N+1)-length index by[I] considered as a whole.

提议的函数#

segment
edges