What is the most efficient way to check if a value exists in a NumPy array?

python performance numpy

I have a very large NumPy array

I want to check to see if a value exists in the 1st column of the array. I've got a bunch of homegrown ways (e.g. iterating through each row and checking), but given the size of the array I'd like to find the most efficient method.

Thanks!

You might use binary search if 1st index is in non-decreasing order or consider sorting if you do more than lets say 10 searches

agf

How about

if value in my_array[:, col_num]:
    do_whatever

Edit: I think __contains__ is implemented in such a way that this is the same as @detly's version

You know, I've been using numpy's any() function so heavily recently, I completely forgot about plain old in.

Okay, this is (a) more readable and (b) about 40% faster than my answer.

In principle, value in … can be faster than any(… == value), because it can iterate over the array elements and stop whenever the value is encountered (as opposed to calculating whether each array element is equal to the value, and then checking whether one of the boolean results is true).

@EOL really? In Python, any is short-circuiting, is it not in numpy?

Things changed since, note that in future @detly's answer would become the only working solution, currently a warning is thrown. for more see stackoverflow.com/questions/40659212/… for more.

eduardosufan

The most obvious to me would be:

np.any(my_array[:, 0] == value)

HI @detly can you add more explaination. it seems very obvious to you but a beginner like me is not. My instinct tells me that this might be the solution that im looking for but i could not try it with out examples :D

@jameshwartlopez my_array[:, 0] gives you all the rows (indicated by :) and for each row the 0th element, i.e. the first column. This is a simple one-dimensional array, for example [1, 3, 6, 2, 9]. If you use the == operator in numpy with a scalar, it will do element-wise comparison and return a boolean numpy array of the same shape as the array. So [1, 3, 6, 2, 9] == 3 gives [False, True, False, False, False]. Finally, np.any checks, if any of the values in this array are True.

HYRY

To check multiple values, you can use numpy.in1d(), which is an element-wise function version of the python keyword in. If your data is sorted, you can use numpy.searchsorted():

import numpy as np
data = np.array([1,4,5,5,6,8,8,9])
values = [2,3,4,6,7]
print np.in1d(values, data)

index = np.searchsorted(data, values)
print data[index] == values

+1 for the less well-known numpy.in1d() and for the very fast searchsorted().

@eryksun: Yeah, interesting. Same observation, here…

Note that the final line will throw an IndexError if any element of values is larger than the greatest value of data, so that requires specific attention.

@fuglede It's possible to replace index with index % len(data) or np.append(index[:-1],0) equivalently in this case.

np.in1d() is limimted only to 1-d numpy arrays. If you want to check if multiple values are in a multidimensional numpy array use np.isin() method.

Lukas Mandrake

Fascinating. I needed to improve the speed of a series of loops that must perform matching index determination in this same way. So I decided to time all the solutions here, along with some riff's.

Here are my speed tests for Python 2.7.10:

import timeit
timeit.timeit('N.any(N.in1d(sids, val))', setup = 'import numpy as N; val = 20010401020091; sids = N.array([20010401010101+x for x in range(1000)])')

18.86137104034424

timeit.timeit('val in sids', setup = 'import numpy as N; val = 20010401020091; sids = [20010401010101+x for x in range(1000)]')

15.061666011810303

timeit.timeit('N.in1d(sids, val)', setup = 'import numpy as N; val = 20010401020091; sids = N.array([20010401010101+x for x in range(1000)])')

11.613027095794678

timeit.timeit('N.any(val == sids)', setup = 'import numpy as N; val = 20010401020091; sids = N.array([20010401010101+x for x in range(1000)])')

7.670552015304565

timeit.timeit('val in sids', setup = 'import numpy as N; val = 20010401020091; sids = N.array([20010401010101+x for x in range(1000)])')

5.610057830810547

timeit.timeit('val == sids', setup = 'import numpy as N; val = 20010401020091; sids = N.array([20010401010101+x for x in range(1000)])')

1.6632978916168213

timeit.timeit('val in sids', setup = 'import numpy as N; val = 20010401020091; sids = set([20010401010101+x for x in range(1000)])')

0.0548710823059082

timeit.timeit('val in sids', setup = 'import numpy as N; val = 20010401020091; sids = dict(zip([20010401010101+x for x in range(1000)],[True,]*1000))')

0.054754018783569336

Very surprising! Orders of magnitude difference!

To summarize, if you just want to know whether something's in a 1D list or not:

19s N.any(N.in1d(numpy array))

15s x in (list)

8s N.any(x == numpy array)

6s x in (numpy array)

.1s x in (set or a dictionary)

If you want to know where something is in the list as well (order is important):

12s N.in1d(x, numpy array)

2s x == (numpy array)

Joelmob

Adding to @HYRY's answer in1d seems to be fastest for numpy. This is using numpy 1.8 and python 2.7.6.

In this test in1d was fastest, however 10 in a look cleaner:

a = arange(0,99999,3)
%timeit 10 in a
%timeit in1d(a, 10)

10000 loops, best of 3: 150 µs per loop
10000 loops, best of 3: 61.9 µs per loop

Constructing a set is slower than calling in1d, but checking if the value exists is a bit faster:

s = set(range(0, 99999, 3))
%timeit 10 in s

10000000 loops, best of 3: 47 ns per loop

The comparison isn't fair. You need to count the cost of converting an array to a set. OP starts with a NumPy array.

I didn't mean to compare the methods like that so i edited the post to point out the cost of creating a set. If you already have python set, there is no big difference.

Loochie

The most convenient way according to me is:

(Val in X[:, col_num])

where Val is the value that you want to check for and X is the array. In your example, suppose you want to check if the value 8 exists in your the third column. Simply write

(8 in X[:, 2])

This will return True if 8 is there in the third column, else False.

Simon Klein

If you are looking for a list of integers, you may use indexing for doing the work. This also works with nd-arrays, but seems to be slower. It may be better when doing this more than once.

def valuesInArray(values, array):
    values = np.asanyarray(values)
    array = np.asanyarray(array)
    assert array.dtype == np.int and values.dtype == np.int
    
    matches = np.zeros(array.max()+1, dtype=np.bool_)
    matches[values] = True
    
    res = matches[array]
    
    return np.any(res), res
    
    
array = np.random.randint(0, 1000, (10000,3))
values = np.array((1,6,23,543,222))

matched, matches = valuesInArray(values, array)

By using numba and njit, I could get a speedup of this by ~x10.

What is the most efficient way to check if a value exists in a NumPy array?

Follow WeChat

Want to stay one step ahead of the latest teleworks?

相似问题

Platform

Support

Contact US