sum vs np.nansum weirdness while summing columns with same name on a pandas dataframe - python -
taking inspiration discussion here on (merge columns within dataframe have same name), tried method suggested and, while works while using function sum() doesn't when using np.nansum :
import pandas pd import numpy np df = pd.dataframe(np.random.rand(100,4), columns=['a', 'a','b','b'], index=pd.date_range('2011-1-1', periods=100)) print(df.head(3)) sum() case:
print(df.groupby(df.columns, axis=1).apply(sum, axis=1).head(3)) b 2011-01-01 1.328933 1.678469 2011-01-02 1.878389 1.343327 2011-01-03 0.964278 1.302857 np.nansum() case:
print(df.groupby(df.columns, axis=1).apply(np.nansum, axis=1).head(3)) [1.32893299939, 1.87838886222, 0.964278430632,... b [1.67846885234, 1.34332662587, 1.30285727348, ... dtype: object any idea why?
the issue np.nansum converts input numpy array, loses column information (sum doesn't this). result, groupby doesn't column information when constructing output, output series of numpy arrays.
specifically, the source code np.nansum calls _replace_nan function. in turn, the source code _replace_nan checks if input array, , converts 1 if it's not.
all hope isn't lost though. can replicate np.nansum pandas functions. use sum followed fillna:
df.groupby(df.columns, axis=1).sum().fillna(0) the sum should ignore nan's , sum non-null values. case you'll nan if values attempting summed nan, why fillna required. note fillna before groupby, i.e. df.fillna(0).groupby....
if want use np.nansum, can recast pd.series. impact performance, constructing series can relatively expensive, , you'll doing multiple times:
df.groupby(df.columns, axis=1).apply(lambda x: pd.series(np.nansum(x, axis=1), x.index)) example computations
for example computations, i'll using following simple dataframe, includes nan values (your example data doesn't):
df = pd.dataframe([[1,2,2,np.nan,4],[np.nan,np.nan,np.nan,3,3],[np.nan,np.nan,-1,2,np.nan]], columns=list('aaabb')) b b 0 1.0 2.0 2.0 nan 4.0 1 nan nan nan 3.0 3.0 2 nan nan -1.0 2.0 nan using sum without fillna:
df.groupby(df.columns, axis=1).sum() b 0 5.0 4.0 1 nan 6.0 2 -1.0 2.0 using sum , fillna:
df.groupby(df.columns, axis=1).sum().fillna(0) b 0 5.0 4.0 1 0.0 6.0 2 -1.0 2.0 comparing fixed np.nansum method:
df.groupby(df.columns, axis=1).apply(lambda x: pd.series(np.nansum(x, axis=1), x.index)) b 0 5.0 4.0 1 0.0 6.0 2 -1.0 2.0
Comments
Post a Comment