python - Is this approach "vectorized" - used against medium dataset it is relatively slow -


i have data frame:

df = pd.dataframe({'a' : np.random.randn(9),              'b' : ['foo', 'bar', 'blah'] * 3,              'c' : np.random.randn(9)}) 

this function:

def my_test2(row, x):     if x == 'foo':         blah = 10     if x == 'bar':         blah = 20     if x == 'blah':         blah = 30     return (row['a'] % row['c']) + blah 

i creating 3 new columns this:

df['value_foo'] = df.apply(my_test2, axis=1, x='foo') df['value_bar'] = df.apply(my_test2, axis=1, x='bar') df['value_blah'] = df.apply(my_test2, axis=1, x='blah') 

it runs ok when make my_test2 more complex , expand df several thousand rows slow - above hear described "vectorized"? can speed things up?

as andrew, ami tavory , sohier dane have mentioned in comments there 2 "slow" things in solution:

  1. .apply() slow loops under hood.
  2. .apply(..., axis=1) extremely slow (even compared .apply(..., axis=0)) looping on row basis

here vectorized approach:

in [74]: d = {    ....:   'foo': 10,    ....:   'bar': 20,    ....:   'blah': 30    ....: }  in [75]: d out[75]: {'bar': 20, 'blah': 30, 'foo': 10}  in [76]: k,v in d.items():    ....:         df['value_{}'.format(k)] = df.a % df.c + v    ....:  in [77]: df out[77]:               b         c  value_bar  value_blah  value_foo 0 -0.747164   foo  0.438713  20.130262   30.130262  10.130262 1 -0.185182   bar  0.047253  20.003828   30.003828  10.003828 2  1.622818  blah -0.730215  19.432174   29.432174   9.432174 3  0.117658   foo  1.530249  20.117658   30.117658  10.117658 4  2.536363   bar -0.100726  19.917499   29.917499   9.917499 5  1.128002  blah  0.350663  20.076014   30.076014  10.076014 6  0.059516   foo  0.638910  20.059516   30.059516  10.059516 7 -1.184688   bar  0.073781  20.069590   30.069590  10.069590 8  1.440576  blah -2.231575  19.209001   29.209001   9.209001 

timing against 90k rows df:

in [80]: big = pd.concat([df] * 10**4, ignore_index=true)  in [81]: big.shape out[81]: (90000, 3)  in [82]: %%timeit    ....: big['value_foo'] = big.apply(my_test2, axis=1, x='foo')    ....: big['value_bar'] = big.apply(my_test2, axis=1, x='bar')    ....: big['value_blah'] = big.apply(my_test2, axis=1, x='blah')    ....: 1 loop, best of 3: 10.5 s per loop  in [83]: big = pd.concat([df] * 10**4, ignore_index=true)  in [84]: big.shape out[84]: (90000, 3)  in [85]: %%timeit    ....: k,v in d.items():    ....:     big['value_{}'.format(k)] = big.a % big.c + v    ....: 100 loops, best of 3: 7.24 ms per loop 

conclusion: vectorized approach 1450 times faster...


Comments

Popular posts from this blog

java - Jasper subreport showing only one entry from the JSON data source when embedded in the Title band -

serialization - Convert Any type in scala to Array[Byte] and back -

SonarQube Plugin for Jenkins does not find SonarQube Scanner executable -