python - faster alternative to numpy.where? -
i have 3d array filled integers 0 n. need list of indices corresponding array equal 1, 2, 3, ... n. can np.where follows:
n = 300 shape = (1000,1000,10) data = np.random.randint(0,n+1,shape) indx = [np.where(data == i_id) i_id in range(1,data.max()+1)]
but quite slow. according question fast python numpy functionality? should possible speed index search quite lot, haven't been able transfer methods proposed there problem of getting actual indices. best way speed above code?
as add-on: want store indices later, makes sense use np.ravel_multi_index reduce size saving 3 indices 1, i.e. using:
indx = [np.ravel_multi_index(np.where(data == i_id), data.shape) i_id in range(1, data.max()+1)]
which closer e.g. matlab's find function. can directly incorporated in solution doesn't use np.where?
i think standard vectorized approach problem end being memory intensive – int64 data, require o(8 * n * data.size) bytes, or ~22 gigs of memory example gave above. i'm assuming not option.
you might make progress using sparse matrix store locations of unique values. example:
import numpy np scipy.sparse import csr_matrix def compute_m(data): cols = np.arange(data.size) return csr_matrix((cols, (data.ravel(), cols)), shape=(data.max() + 1, data.size)) def get_indices_sparse(data): m = compute_m(data) return [np.unravel_index(row.data, data.shape) row in m]
this takes advantage of fast code within sparse matrix constructor organize data in useful way, constructing sparse matrix row i
contains indices flattened data equals i
.
to test out, i'll define function straightforward method:
def get_indices_simple(data): return [np.where(data == i) in range(0, data.max() + 1)]
the 2 functions give same results same input:
data_small = np.random.randint(0, 100, size=(100, 100, 10)) all(np.allclose(i1, i2) i1, i2 in zip(get_indices_simple(data_small), get_indices_sparse(data_small))) # true
and sparse method order of magnitude faster simple method dataset:
data = np.random.randint(0, 301, size=(1000, 1000, 10)) %time ind = get_indices_simple(data) # cpu times: user 14.1 s, sys: 638 ms, total: 14.7 s # wall time: 14.8 s %time ind = get_indices_sparse(data) # cpu times: user 881 ms, sys: 301 ms, total: 1.18 s # wall time: 1.18 s %time m = compute_m(data) # cpu times: user 216 ms, sys: 148 ms, total: 365 ms # wall time: 363 ms
the other benefit of sparse method matrix m
ends being compact , efficient way store relevant information later use, mentioned in add-on part of question. hope that's useful!
edit: realized there bug in initial version: failed if values in range didn't appear in data: that's fixed above.
Comments
Post a Comment