NEP 8 — NumPy 中添加 groupby 功能的提案#

作者:

Travis Oliphant

联系方式:

oliphant@enthought.com

日期:

2010-04-27

状态:

延迟

执行摘要#

NumPy 提供了处理数据和进行计算的工具,其方式与关系代数非常相似。但是,常见的 group-by 功能不容易处理。NumPy 的 ufunc 的 reduce 方法是放置此 groupby 行为的自然位置。本 NEP 描述了 ufunc 的两种额外方法(reduceby 和 reducein)以及两种额外函数(segment 和 edges),这些方法可以帮助添加此功能。

用例示例#

假设您有一个 NumPy 结构化数组,其中包含有关多个商店在多个日期的购买数量的信息。明确地说,结构化数组数据类型为

dt = [('year', i2), ('month', i1), ('day', i1), ('time', float),
    ('store', i4), ('SKU', 'S6'), ('number', i4)]

假设有一个此数据类型的 1 维 NumPy 数组,并且您希望根据产品、月份、商店等计算产品销售数量的各种统计数据(最大值、最小值、平均值、总和等)。

目前,这可以通过对数组的 number 字段使用 reduce 方法,结合就地排序、带 return_inverse=True 的 unique 和 bincount 等来完成。但是,对于如此常见的数据分析需求,最好有标准且更直接的方法来获取结果。

提出的 Ufunc 方法#

建议向 ufunc 添加两种新的 reduce 风格方法:reduceby 和 reducein。reducein 方法旨在成为 reduceat 的更易于使用的版本,而 reduceby 方法旨在在减少操作上提供 group-by 功能。

reducein

<ufunc>.reducein(arr, indices, axis=0, dtype=None, out=None)

Perform a local reduce with slices specified by pairs of indices.

The reduction occurs along the provided axis, using the provided
data-type to calculate intermediate results, storing the result into
the array out (if provided).

The indices array provides the start and end indices for the
reduction.  If the length of the indices array is odd, then the
final index provides the beginning point for the final reduction
and the ending point is the end of arr.

This generalizes along the given axis, the behavior:

[<ufunc>.reduce(arr[indices[2*i]:indices[2*i+1]])
        for i in range(len(indices)/2)]

This assumes indices is of even length

Example:
   >>> a = [0,1,2,4,5,6,9,10]
   >>> add.reducein(a,[0,3,2,5,-2])
   [3, 11, 19]

   Notice that sum(a[0:3]) = 3; sum(a[2:5]) = 11; and sum(a[-2:]) = 19

reduceby

<ufunc>.reduceby(arr, by, dtype=None, out=None)

Perform a reduction in arr over unique non-negative integers in by.


Let N=arr.ndim and M=by.ndim.  Then, by.shape[:N] == arr.shape.
In addition, let I be an N-length index tuple, then by[I]
contains the location in the output array for the reduction to
be stored.  Notice that if N == M, then by[I] is a non-negative
integer, while if N < M, then by[I] is an array of indices into
the output array.

The reduction is computed on groups specified by unique indices
into the output array. The index is either the single
non-negative integer if N == M or if N < M, the entire
(M-N+1)-length index by[I] considered as a whole.

提出的函数#

  • segment

  • edges