sum vs np.nansum weirdness while summing columns with same name on a pandas dataframe - python -


taking inspiration discussion here on (merge columns within dataframe have same name), tried method suggested and, while works while using function sum() doesn't when using np.nansum :

import pandas pd import numpy np  df = pd.dataframe(np.random.rand(100,4), columns=['a', 'a','b','b'], index=pd.date_range('2011-1-1', periods=100)) print(df.head(3)) 

sum() case:

print(df.groupby(df.columns, axis=1).apply(sum, axis=1).head(3))                            b 2011-01-01  1.328933  1.678469 2011-01-02  1.878389  1.343327 2011-01-03  0.964278  1.302857 

np.nansum() case:

print(df.groupby(df.columns, axis=1).apply(np.nansum, axis=1).head(3))     [1.32893299939, 1.87838886222, 0.964278430632,... b    [1.67846885234, 1.34332662587, 1.30285727348, ... dtype: object 

any idea why?

the issue np.nansum converts input numpy array, loses column information (sum doesn't this). result, groupby doesn't column information when constructing output, output series of numpy arrays.

specifically, the source code np.nansum calls _replace_nan function. in turn, the source code _replace_nan checks if input array, , converts 1 if it's not.

all hope isn't lost though. can replicate np.nansum pandas functions. use sum followed fillna:

df.groupby(df.columns, axis=1).sum().fillna(0) 

the sum should ignore nan's , sum non-null values. case you'll nan if values attempting summed nan, why fillna required. note fillna before groupby, i.e. df.fillna(0).groupby....

if want use np.nansum, can recast pd.series. impact performance, constructing series can relatively expensive, , you'll doing multiple times:

df.groupby(df.columns, axis=1).apply(lambda x: pd.series(np.nansum(x, axis=1), x.index)) 

example computations

for example computations, i'll using following simple dataframe, includes nan values (your example data doesn't):

df = pd.dataframe([[1,2,2,np.nan,4],[np.nan,np.nan,np.nan,3,3],[np.nan,np.nan,-1,2,np.nan]], columns=list('aaabb'))                b    b 0  1.0  2.0  2.0  nan  4.0 1  nan  nan  nan  3.0  3.0 2  nan  nan -1.0  2.0  nan 

using sum without fillna:

df.groupby(df.columns, axis=1).sum()          b 0  5.0  4.0 1  nan  6.0 2 -1.0  2.0 

using sum , fillna:

df.groupby(df.columns, axis=1).sum().fillna(0)          b 0  5.0  4.0 1  0.0  6.0 2 -1.0  2.0 

comparing fixed np.nansum method:

df.groupby(df.columns, axis=1).apply(lambda x: pd.series(np.nansum(x, axis=1), x.index))         b 0  5.0  4.0 1  0.0  6.0 2 -1.0  2.0 

Comments

Popular posts from this blog

java - Jasper subreport showing only one entry from the JSON data source when embedded in the Title band -

mapreduce - Resource manager does not transit to active state from standby -

serialization - Convert Any type in scala to Array[Byte] and back -