python - Is this approach "vectorized" - used against medium dataset it is relatively slow -
i have data frame:
df = pd.dataframe({'a' : np.random.randn(9), 'b' : ['foo', 'bar', 'blah'] * 3, 'c' : np.random.randn(9)})
this function:
def my_test2(row, x): if x == 'foo': blah = 10 if x == 'bar': blah = 20 if x == 'blah': blah = 30 return (row['a'] % row['c']) + blah
i creating 3 new columns this:
df['value_foo'] = df.apply(my_test2, axis=1, x='foo') df['value_bar'] = df.apply(my_test2, axis=1, x='bar') df['value_blah'] = df.apply(my_test2, axis=1, x='blah')
it runs ok when make my_test2 more complex , expand df several thousand rows slow - above hear described "vectorized"? can speed things up?
as andrew, ami tavory , sohier dane have mentioned in comments there 2 "slow" things in solution:
.apply()
slow loops under hood..apply(..., axis=1)
extremely slow (even compared.apply(..., axis=0)
) looping on row basis
here vectorized approach:
in [74]: d = { ....: 'foo': 10, ....: 'bar': 20, ....: 'blah': 30 ....: } in [75]: d out[75]: {'bar': 20, 'blah': 30, 'foo': 10} in [76]: k,v in d.items(): ....: df['value_{}'.format(k)] = df.a % df.c + v ....: in [77]: df out[77]: b c value_bar value_blah value_foo 0 -0.747164 foo 0.438713 20.130262 30.130262 10.130262 1 -0.185182 bar 0.047253 20.003828 30.003828 10.003828 2 1.622818 blah -0.730215 19.432174 29.432174 9.432174 3 0.117658 foo 1.530249 20.117658 30.117658 10.117658 4 2.536363 bar -0.100726 19.917499 29.917499 9.917499 5 1.128002 blah 0.350663 20.076014 30.076014 10.076014 6 0.059516 foo 0.638910 20.059516 30.059516 10.059516 7 -1.184688 bar 0.073781 20.069590 30.069590 10.069590 8 1.440576 blah -2.231575 19.209001 29.209001 9.209001
timing against 90k rows df:
in [80]: big = pd.concat([df] * 10**4, ignore_index=true) in [81]: big.shape out[81]: (90000, 3) in [82]: %%timeit ....: big['value_foo'] = big.apply(my_test2, axis=1, x='foo') ....: big['value_bar'] = big.apply(my_test2, axis=1, x='bar') ....: big['value_blah'] = big.apply(my_test2, axis=1, x='blah') ....: 1 loop, best of 3: 10.5 s per loop in [83]: big = pd.concat([df] * 10**4, ignore_index=true) in [84]: big.shape out[84]: (90000, 3) in [85]: %%timeit ....: k,v in d.items(): ....: big['value_{}'.format(k)] = big.a % big.c + v ....: 100 loops, best of 3: 7.24 ms per loop
conclusion: vectorized approach 1450 times faster...
Comments
Post a Comment