NEP 8 — 添加groupby功能到NumPy的提案#
- 作者:
Travis Oliphant
- 联系方式:
- 日期:
2010-04-27
- 状态:
延迟
执行摘要#
NumPy提供用于处理数据和进行计算的工具,其方式与关系代数类似。但是,常见的groupby功能不容易处理。NumPy的ufunc的reduce方法是放置此groupby行为的自然位置。本NEP描述了ufunc的两种附加方法(reduceby和reducein)和两个附加函数(segment和edges),这些可以帮助添加此功能。
用例示例#
假设您有一个包含有关多个商店在多个日期的购买数量信息的NumPy结构化数组。明确地说,结构化数组数据类型是
dt = [('year', i2), ('month', i1), ('day', i1), ('time', float),
('store', i4), ('SKU', 'S6'), ('number', i4)]
假设有一个此数据类型的1维NumPy数组,并且您希望根据产品、月份、商店等计算销售产品数量的各种统计数据(最大值、最小值、平均值、总和等)。
目前,这可以通过对数组的number字段使用reduce方法,结合就地排序、带return_inverse=True的unique和bincount等来完成。但是,对于如此常见的数 据分析需求,最好有标准的和更直接的方法来获取结果。
建议的Ufunc方法#
建议向ufunc添加两种新的reduce风格方法:reduceby和reducein。reducein方法旨在成为reduceat的更易于使用的版本,而reduceby方法旨在为reduce提供groupby功能。
reducein
<ufunc>.reducein(arr, indices, axis=0, dtype=None, out=None)
Perform a local reduce with slices specified by pairs of indices.
The reduction occurs along the provided axis, using the provided
data-type to calculate intermediate results, storing the result into
the array out (if provided).
The indices array provides the start and end indices for the
reduction. If the length of the indices array is odd, then the
final index provides the beginning point for the final reduction
and the ending point is the end of arr.
This generalizes along the given axis, the behavior:
[<ufunc>.reduce(arr[indices[2*i]:indices[2*i+1]])
for i in range(len(indices)/2)]
This assumes indices is of even length
Example:
>>> a = [0,1,2,4,5,6,9,10]
>>> add.reducein(a,[0,3,2,5,-2])
[3, 11, 19]
Notice that sum(a[0:3]) = 3; sum(a[2:5]) = 11; and sum(a[-2:]) = 19
reduceby
<ufunc>.reduceby(arr, by, dtype=None, out=None)
Perform a reduction in arr over unique non-negative integers in by.
Let N=arr.ndim and M=by.ndim. Then, by.shape[:N] == arr.shape.
In addition, let I be an N-length index tuple, then by[I]
contains the location in the output array for the reduction to
be stored. Notice that if N == M, then by[I] is a non-negative
integer, while if N < M, then by[I] is an array of indices into
the output array.
The reduction is computed on groups specified by unique indices
into the output array. The index is either the single
non-negative integer if N == M or if N < M, the entire
(M-N+1)-length index by[I] considered as a whole.
建议的函数#
segment
edges