python - efficient loop over numpy array -


versions of question have been asked have not found satisfactory answer.

problem: given large numpy vector, find indices of vector elements duplicated (a variation of comparison tolerance).

so problem ~o(n^2) , memory bound (at least current algorithm point of view). wonder why whatever tried python 100x or more slower equivalent c code.

import numpy np n = 10000 vect = np.arange(float(n)) vect[n/2] = 1 vect[n/4] = 1 dupl = [] print("init done") counter = 0 in range(n):     j in range(i+1, n):         if vect[i] == vect[j]:             dupl.append(j)             counter += 1  print("counter =", counter) print(dupl) # simplicity, code ignores repeated indices  # can trimmed later. ref output # counter = 3 # [2500, 5000, 5000] 

i tried using numpy iterators worse (~ x4-5) http://docs.scipy.org/doc/numpy/reference/arrays.nditer.html

using n=10,000 i'm getting 0.1 sec in c, 12 sec in python (code above), 40 sec in python using np.nditer, 50 sec in python using np.ndindex. pushed n=160,000 , timing scales n^2 expected.

python highly-dynamic, slow, language. idea in numpy use vectorization, , avoid explicit loops. in case, can use np.equal.outer. can start with

a = np.equal.outer(vect, vect) 

now, example, find sum:

 >>> np.sum(a)  10006 

to find indices of i equal, can do

np.fill_diagonal(a, 0)  >>> np.nonzero(np.any(a, axis=0))[0] array([   1, 2500, 5000]) 

timing

def find_vec():     = np.equal.outer(vect, vect)     s = np.sum(a)     np.fill_diagonal(a, 0)     return np.sum(a), np.nonzero(np.any(a, axis=0))[0]  >>> %timeit find_vec() 1 loops, best of 3: 214 ms per loop  def find_loop():     dupl = []     counter = 0     in range(n):         j in range(i+1, n):              if vect[i] == vect[j]:                  dupl.append(j)                  counter += 1     return dupl  >>> % timeit find_loop() 1 loops, best of 3: 8.51 s per loop 

Comments

Popular posts from this blog

java - Jasper subreport showing only one entry from the JSON data source when embedded in the Title band -

mapreduce - Resource manager does not transit to active state from standby -

serialization - Convert Any type in scala to Array[Byte] and back -