python - Is pandas.DataFrame.groupby Guaranteed To Be Stable? -
i've noticed there several uses of pd.dataframe.groupby
followed apply
implicitly assuming groupby
stable - is, if a , b instances of same group, , pre-grouping, a appeared before b, a appear pre b following grouping well.
i think there several answers implicitly using this, but, concrete, here one using groupby
+cumsum
.
is there promising behavior? documentation states:
group series using mapper (dict or key function, apply given function group, return result series) or series of columns.
also, pandas having indices, functionality theoretically achieved without guarantee (albeit in more cumbersome way).
although docs don't state internally, uses stable sort when generating groups.
see:
- https://github.com/pydata/pandas/blob/master/pandas/core/groupby.py#l291
- https://github.com/pydata/pandas/blob/master/pandas/core/groupby.py#l4356
as mentioned in comments, important if consider transform
return series it's index aligned original df. if sorting didn't preserve order, make alignment perform additional work need sort series prior assigning. in fact, mentioned in comments:
_algos.groupsort_indexer
implements counting sort , @ leasto(ngroups)
, where
ngroups = prod(shape)
shape = map(len, keys)
that is, linear in number of combinations (cartesian product) of unique values of groupby keys. can huge when doing multi-key groupby.
np.argsort(kind='mergesort')
o(count x log(count))
count length of data-frame; both algorithms stable sort , necessary correctness of groupby operations.e.g. consider:
df.groupby(key)[col].transform('first')
Comments
Post a Comment