sum vs np.nansum weirdness while summing columns with same name on a pandas dataframe - python -
taking inspiration discussion here on (merge columns within dataframe have same name), tried method suggested and, while works while using function sum()
doesn't when using np.nansum
:
import pandas pd import numpy np df = pd.dataframe(np.random.rand(100,4), columns=['a', 'a','b','b'], index=pd.date_range('2011-1-1', periods=100)) print(df.head(3))
sum()
case:
print(df.groupby(df.columns, axis=1).apply(sum, axis=1).head(3)) b 2011-01-01 1.328933 1.678469 2011-01-02 1.878389 1.343327 2011-01-03 0.964278 1.302857
np.nansum()
case:
print(df.groupby(df.columns, axis=1).apply(np.nansum, axis=1).head(3)) [1.32893299939, 1.87838886222, 0.964278430632,... b [1.67846885234, 1.34332662587, 1.30285727348, ... dtype: object
any idea why?
the issue np.nansum
converts input numpy array, loses column information (sum
doesn't this). result, groupby
doesn't column information when constructing output, output series of numpy arrays.
specifically, the source code np.nansum
calls _replace_nan
function. in turn, the source code _replace_nan
checks if input array, , converts 1 if it's not.
all hope isn't lost though. can replicate np.nansum
pandas functions. use sum
followed fillna
:
df.groupby(df.columns, axis=1).sum().fillna(0)
the sum
should ignore nan
's , sum non-null values. case you'll nan
if values attempting summed nan
, why fillna
required. note fillna
before groupby
, i.e. df.fillna(0).groupby...
.
if want use np.nansum
, can recast pd.series
. impact performance, constructing series can relatively expensive, , you'll doing multiple times:
df.groupby(df.columns, axis=1).apply(lambda x: pd.series(np.nansum(x, axis=1), x.index))
example computations
for example computations, i'll using following simple dataframe, includes nan
values (your example data doesn't):
df = pd.dataframe([[1,2,2,np.nan,4],[np.nan,np.nan,np.nan,3,3],[np.nan,np.nan,-1,2,np.nan]], columns=list('aaabb')) b b 0 1.0 2.0 2.0 nan 4.0 1 nan nan nan 3.0 3.0 2 nan nan -1.0 2.0 nan
using sum
without fillna
:
df.groupby(df.columns, axis=1).sum() b 0 5.0 4.0 1 nan 6.0 2 -1.0 2.0
using sum
, fillna
:
df.groupby(df.columns, axis=1).sum().fillna(0) b 0 5.0 4.0 1 0.0 6.0 2 -1.0 2.0
comparing fixed np.nansum
method:
df.groupby(df.columns, axis=1).apply(lambda x: pd.series(np.nansum(x, axis=1), x.index)) b 0 5.0 4.0 1 0.0 6.0 2 -1.0 2.0
Comments
Post a Comment