NEP 8 — NumPy 的 groupby 功能添加提案#
- 作者:
Travis Oliphant
- 联系方式:
- 日期:
2010-04-27
- 状态:
Deferred
执行摘要#
NumPy 提供了处理数据和进行计算的工具,这与关系代数非常相似。然而,通用的 group-by 功能却不容易处理。NumPy 的 ufuncs 的 reduce 方法是实现此 groupby 行为的自然场所。本 NEP 描述了 ufuncs 的两个附加方法 (reduceby 和 reducein) 以及两个附加函数 (segment 和 edges),它们可以帮助添加此功能。
示例用例#
假设您有一个 NumPy 结构化数组,其中包含关于多个商店在多个日期的购买数量的信息。为清楚起见,结构化数组数据类型是
dt = [('year', i2), ('month', i1), ('day', i1), ('time', float),
('store', i4), ('SKU', 'S6'), ('number', i4)]
假设有一个该数据类型的 1-d NumPy 数组,并且您希望按产品、按月份、按商店等计算销售产品数量的各种统计数据 (最大值、最小值、平均值、总和等)。
目前,这可以通过对数组的 number 字段使用 reduce 方法,结合就地排序,使用 return_inverse=True 的 unique 和 bincount 等来实现。然而,对于如此常见的数据分析需求,拥有标准且更直接的结果获取方式会更好。
建议的 Ufunc 方法#
建议向 ufuncs 添加两个新的 reduce 风格的方法:reduceby 和 reducein。reducein 方法旨在成为比 reduceat 更易用的版本,而 reduceby 方法旨在为归约提供 group-by 功能。
reducein
<ufunc>.reducein(arr, indices, axis=0, dtype=None, out=None)
Perform a local reduce with slices specified by pairs of indices.
The reduction occurs along the provided axis, using the provided
data-type to calculate intermediate results, storing the result into
the array out (if provided).
The indices array provides the start and end indices for the
reduction. If the length of the indices array is odd, then the
final index provides the beginning point for the final reduction
and the ending point is the end of arr.
This generalizes along the given axis, the behavior:
[<ufunc>.reduce(arr[indices[2*i]:indices[2*i+1]])
for i in range(len(indices)/2)]
This assumes indices is of even length
Example:
>>> a = [0,1,2,4,5,6,9,10]
>>> add.reducein(a,[0,3,2,5,-2])
[3, 11, 19]
Notice that sum(a[0:3]) = 3; sum(a[2:5]) = 11; and sum(a[-2:]) = 19
reduceby
<ufunc>.reduceby(arr, by, dtype=None, out=None)
Perform a reduction in arr over unique non-negative integers in by.
Let N=arr.ndim and M=by.ndim. Then, by.shape[:N] == arr.shape.
In addition, let I be an N-length index tuple, then by[I]
contains the location in the output array for the reduction to
be stored. Notice that if N == M, then by[I] is a non-negative
integer, while if N < M, then by[I] is an array of indices into
the output array.
The reduction is computed on groups specified by unique indices
into the output array. The index is either the single
non-negative integer if N == M or if N < M, the entire
(M-N+1)-length index by[I] considered as a whole.
建议的函数#
segment
edges