Better way to shuffle two numpy arrays in unison

python numpy random shuffle numpy-ndarray

I have two numpy arrays of different shapes, but with the same length (leading dimension). I want to shuffle each of them, such that corresponding elements continue to correspond -- i.e. shuffle them in unison with respect to their leading indices.

This code works, and illustrates my goals:

def shuffle_in_unison(a, b):
    assert len(a) == len(b)
    shuffled_a = numpy.empty(a.shape, dtype=a.dtype)
    shuffled_b = numpy.empty(b.shape, dtype=b.dtype)
    permutation = numpy.random.permutation(len(a))
    for old_index, new_index in enumerate(permutation):
        shuffled_a[new_index] = a[old_index]
        shuffled_b[new_index] = b[old_index]
    return shuffled_a, shuffled_b

For example:

>>> a = numpy.asarray([[1, 1], [2, 2], [3, 3]])
>>> b = numpy.asarray([1, 2, 3])
>>> shuffle_in_unison(a, b)
(array([[2, 2],
       [1, 1],
       [3, 3]]), array([2, 1, 3]))

However, this feels clunky, inefficient, and slow, and it requires making a copy of the arrays -- I'd rather shuffle them in-place, since they'll be quite large.

Is there a better way to go about this? Faster execution and lower memory usage are my primary goals, but elegant code would be nice, too.

One other thought I had was this:

def shuffle_in_unison_scary(a, b):
    rng_state = numpy.random.get_state()
    numpy.random.shuffle(a)
    numpy.random.set_state(rng_state)
    numpy.random.shuffle(b)

This works...but it's a little scary, as I see little guarantee it'll continue to work -- it doesn't look like the sort of thing that's guaranteed to survive across numpy version, for example.

Six years later, I'm amused and surprised by how popular this question proved to be. And in a bit of delightful coincidence, for Go 1.10 I contributed math/rand.Shuffle to the standard library. The design of the API makes it trivial to shuffle two arrays in unison, and doing so is even included as an example in the docs.

This is a different programming language however.

Íhor Mé

Your can use NumPy's array indexing:

def unison_shuffled_copies(a, b):
    assert len(a) == len(b)
    p = numpy.random.permutation(len(a))
    return a[p], b[p]

This will result in creation of separate unison-shuffled arrays.

This does create copies, as it uses advanced indexing. But of course it is faster than the original.

@mtrw: The mere fact that the original arrays are untouched does not outrule that the returned arrays are views of the same data. But they are indeed not, since NumPy views are not flexible enough to support permuted views (this wouldn't be desirable either).

@Sven - I really have to learn about views. @Dat Chu - I just tried

>>> t = timeit.Timer(stmt = "<function>(a,b)", setup = "import numpy as np; a,b = np.arange(4), np.arange(4*20).reshape((4,20))")>>> t.timeit()

and got 38 seconds for the OP's version, and 27.5 seconds for mine, for 1 million calls each.

I really like the simplicity and readability of this, and advanced indexing continues to surprise and amaze me; for that this answer readily gets +1. Oddly enough, though, on my (large) datasets, it is slower than my original function: my original takes ~1.8s for 10 iterations, and this takes ~2.7s. Both numbers are quite consistent. The dataset I used to test has a.shape is (31925, 405) and b.shape is (31925,).

Maybe, the slowness has to do with the fact that you're not doing things in-place, but are instead creating new arrays. Or with some slowness related to how CPython parses array-indexes.

James

X = np.array([[1., 0.], [2., 1.], [0., 0.]])
y = np.array([0, 1, 2])
from sklearn.utils import shuffle
X, y = shuffle(X, y, random_state=0)

To learn more, see http://scikit-learn.org/stable/modules/generated/sklearn.utils.shuffle.html

This solution creates copies ("The original arrays are not impacted"), whereas the author's "scary" solution doesn't.

You can choose any style as you like

Sven Marnach

Your "scary" solution does not appear scary to me. Calling shuffle() for two sequences of the same length results in the same number of calls to the random number generator, and these are the only "random" elements in the shuffle algorithm. By resetting the state, you ensure that the calls to the random number generator will give the same results in the second call to shuffle(), so the whole algorithm will generate the same permutation.

If you don't like this, a different solution would be to store your data in one array instead of two right from the beginning, and create two views into this single array simulating the two arrays you have now. You can use the single array for shuffling and the views for all other purposes.

Example: Let's assume the arrays a and b look like this:

a = numpy.array([[[  0.,   1.,   2.],
                  [  3.,   4.,   5.]],

                 [[  6.,   7.,   8.],
                  [  9.,  10.,  11.]],

                 [[ 12.,  13.,  14.],
                  [ 15.,  16.,  17.]]])

b = numpy.array([[ 0.,  1.],
                 [ 2.,  3.],
                 [ 4.,  5.]])

We can now construct a single array containing all the data:

c = numpy.c_[a.reshape(len(a), -1), b.reshape(len(b), -1)]
# array([[  0.,   1.,   2.,   3.,   4.,   5.,   0.,   1.],
#        [  6.,   7.,   8.,   9.,  10.,  11.,   2.,   3.],
#        [ 12.,  13.,  14.,  15.,  16.,  17.,   4.,   5.]])

Now we create views simulating the original a and b:

a2 = c[:, :a.size//len(a)].reshape(a.shape)
b2 = c[:, a.size//len(a):].reshape(b.shape)

The data of a2 and b2 is shared with c. To shuffle both arrays simultaneously, use numpy.random.shuffle(c).

In production code, you would of course try to avoid creating the original a and b at all and right away create c, a2 and b2.

This solution could be adapted to the case that a and b have different dtypes.

Re: the scary solution: I just worry that arrays of different shapes could (conceivably) yield different numbers of calls to the rng, which would cause divergence. However, I think you are right that the current behavior is perhaps unlikely to change, and a very simple doctest does make confirming correct behavior very easy...

I like your suggested approach, and could definitely arrange to have a and b start life as a unified c array. However, a and b will need to be contiguous shortly after shuffling (for efficient transfer to a GPU), so I think that, in my particular case, I'd end up making copies of a and b anyway. :(

@Josh: Note that numpy.random.shuffle() operates on arbitrary mutable sequences, such as Python lists or NumPy arrays. The array shape does not matter, only the length of the sequence. This is very unlikely to change in my opinion.

I didn't know that. That makes me much more comfortable with it. Thank you.

@SvenMarnach : I posted an answer below. Can you comment on whether you think it makes sense/ is a good way to do it?

connor

Very simple solution:

randomize = np.arange(len(x))
np.random.shuffle(randomize)
x = x[randomize]
y = y[randomize]

the two arrays x,y are now both randomly shuffled in the same way

This is equivalent to mtrw's solution. Your first two lines are just generating a permutation, but that can be done in one line.

Massinissa

James wrote in 2015 an sklearn solution which is helpful. But he added a random state variable, which is not needed. In the below code, the random state from numpy is automatically assumed.

X = np.array([[1., 0.], [2., 1.], [0., 0.]])
y = np.array([0, 1, 2])
from sklearn.utils import shuffle
X, y = shuffle(X, y)

benjaminjsanders

from np.random import permutation
from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data #numpy array
y = iris.target #numpy array

# Data is currently unshuffled; we should shuffle 
# each X[i] with its corresponding y[i]
perm = permutation(len(X))
X = X[perm]
y = y[perm]

This is essentially identical to the top-voted answer.

Isaac B

Shuffle any number of arrays together, in-place, using only NumPy.

import numpy as np


def shuffle_arrays(arrays, set_seed=-1):
    """Shuffles arrays in-place, in the same order, along axis=0

    Parameters:
    -----------
    arrays : List of NumPy arrays.
    set_seed : Seed value if int >= 0, else seed is random.
    """
    assert all(len(arr) == len(arrays[0]) for arr in arrays)
    seed = np.random.randint(0, 2**(32 - 1) - 1) if set_seed < 0 else set_seed

    for arr in arrays:
        rstate = np.random.RandomState(seed)
        rstate.shuffle(arr)

And can be used like this

a = np.array([1, 2, 3, 4, 5])
b = np.array([10,20,30,40,50])
c = np.array([[1,10,11], [2,20,22], [3,30,33], [4,40,44], [5,50,55]])

shuffle_arrays([a, b, c])

A few things to note:

The assert ensures that all input arrays have the same length along their first dimension.

Arrays shuffled in-place by their first dimension - nothing returned.

Random seed within positive int32 range.

If a repeatable shuffle is needed, seed value can be set.

After the shuffle, the data can be split using np.split or referenced using slices - depending on the application.

beautiful solution, this worked perfect for me. Even with arrays of 3+ axis

This is the correct answer. There is no reason to use the global np.random when you can pass around random state objects.

One RandomState could be used outside of the loop. See Adam Snaider's answer

@bartolo-otrit, the choice that has to be made in the for loop is whether to reassign or reseed random state. With the number of arrays being passed into a shuffling function expected to be small, I wouldn't expect a performance difference between the two. But yes, rstate could be assigned outside the loop and reseeded inside the loop on each iteration.

mohammad hassan bigdeli shamlo

you can make an array like:

s = np.arange(0, len(a), 1)

then shuffle it:

np.random.shuffle(s)

now use this s as argument of your arrays. same shuffled arguments return same shuffled vectors.

x_data = x_data[s]
x_label = x_label[s]

Really, this is the best solution, and should be the accepted one! It even works for many (more than 2) arrays at the same time. The idea is simple: just shuffle the index list [0, 1, 2, ..., n-1] , and then reindex the arrays' rows with the shuffled indexes. Nice!

sziraqui

There is a well-known function that can handle this:

from sklearn.model_selection import train_test_split
X, _, Y, _ = train_test_split(X,Y, test_size=0.0)

Just setting test_size to 0 will avoid splitting and give you shuffled data. Though it is usually used to split train and test data, it does shuffle them too.
From documentation

Split arrays or matrices into random train and test subsets Quick utility that wraps input validation and next(ShuffleSplit().split(X, y)) and application to input data into a single call for splitting (and optionally subsampling) data in a oneliner.

I can not believe I never thought of this. Your answer is brilliant.

Has something changed in sklearn? This solution is not working for me and throwing a ValueError.

I don't see any changes in this function. Check if you are passing correct data type (any array-like type will work) and also check if the arrays have same shape.

andrea m.

This seems like a very simple solution:

import numpy as np
def shuffle_in_unison(a,b):

    assert len(a)==len(b)
    c = np.arange(len(a))
    np.random.shuffle(c)

    return a[c],b[c]

a =  np.asarray([[1, 1], [2, 2], [3, 3]])
b =  np.asarray([11, 22, 33])

shuffle_in_unison(a,b)
Out[94]: 
(array([[3, 3],
        [2, 2],
        [1, 1]]),
 array([33, 22, 11]))

Adam Snaider

One way in which in-place shuffling can be done for connected lists is using a seed (it could be random) and using numpy.random.shuffle to do the shuffling.

# Set seed to a random number if you want the shuffling to be non-deterministic.
def shuffle(a, b, seed):
   np.random.seed(seed)
   np.random.shuffle(a)
   np.random.seed(seed)
   np.random.shuffle(b)

That's it. This will shuffle both a and b in the exact same way. This is also done in-place which is always a plus.

EDIT, don't use np.random.seed() use np.random.RandomState instead

def shuffle(a, b, seed):
   rand_state = np.random.RandomState(seed)
   rand_state.shuffle(a)
   rand_state.seed(seed)
   rand_state.shuffle(b)

When calling it just pass in any seed to feed the random state:

a = [1,2,3,4]
b = [11, 22, 33, 44]
shuffle(a, b, 12345)

Output:

>>> a
[1, 4, 2, 3]
>>> b
[11, 44, 22, 33]

Edit: Fixed code to re-seed the random state

This code does not work. RandomState changes state on the first call and a and b are not shuffled in unison.

@BrunoKlein You are right. I fixed the post to re-seed the random state. Also, even though it is not in unison in the sense of both lists being shuffled at the same time, they are in unison in the sense that both are shuffled in the same way, and it also doesn't require more memory to hold a copy of the lists (which OP mentions in his question)

monolith

Say we have two arrays: a and b.

a = np.array([[1,2,3],[4,5,6],[7,8,9]])
b = np.array([[9,1,1],[6,6,6],[4,2,0]])

We can first obtain row indices by permutating first dimension

indices = np.random.permutation(a.shape[0])
[1 2 0]

Then use advanced indexing. Here we are using the same indices to shuffle both arrays in unison.

a_shuffled = a[indices[:,np.newaxis], np.arange(a.shape[1])]
b_shuffled = b[indices[:,np.newaxis], np.arange(b.shape[1])]

This is equivalent to

np.take(a, indices, axis=0)
[[4 5 6]
 [7 8 9]
 [1 2 3]]

np.take(b, indices, axis=0)
[[6 6 6]
 [4 2 0]
 [9 1 1]]

Why not just a[indices,:] or b[indices,:]?

DaveP

If you want to avoid copying arrays, then I would suggest that instead of generating a permutation list, you go through every element in the array, and randomly swap it to another position in the array

for old_index in len(a):
    new_index = numpy.random.randint(old_index+1)
    a[old_index], a[new_index] = a[new_index], a[old_index]
    b[old_index], b[new_index] = b[new_index], b[old_index]

This implements the Knuth-Fisher-Yates shuffle algorithm.

codinghorror.com/blog/2007/12/the-danger-of-naivete.html has made me wary of implementing my own shuffle algorithms; it is in part responsible for my asking this question. :) However, you are very right to point out that I should consider using the Knuth-Fisher-Yates algorithm.

Well spotted, I've fixed the code now. Anyway, I think the basic idea of in-place shuffling is scalable to an arbitrary number of arrays an avoids making copies.

The code is still incorrect (it won't even run). To make it work, replace len(a) by reversed(range(1, len(a))). But it won't be very efficient anyway.

momo668

Shortest and easiest way in my opinion, use seed:

random.seed(seed)
random.shuffle(x_data)
# reset the same seed to get the identical random sequence and shuffle the y
random.seed(seed)
random.shuffle(y_data)

ajfbiw.s

With an example, this is what I'm doing:

combo = []
for i in range(60000):
    combo.append((images[i], labels[i]))

shuffle(combo)

im = []
lab = []
for c in combo:
    im.append(c[0])
    lab.append(c[1])
images = np.asarray(im)
labels = np.asarray(lab)

This is more or less equivalent to combo = zip(images, labels); shuffle(combo); im, lab = zip(*combo), just slower. Since you are using Numpy anyway, a yet much faster solution would be to zip the arrays using Numpy combo = np.c_[images, labels], shuffle, and unzip again images, labels = combo.T. Assuming that labels and images are one-dimensional Numpy arrays of the same length to begin with, this will be easily the fastest solution. If they are multi-dimensional, see my answer above.

Ok that makes sense. Thanks! @SvenMarnach

Ivo

I extended python's random.shuffle() to take a second arg:

def shuffle_together(x, y):
    assert len(x) == len(y)

    for i in reversed(xrange(1, len(x))):
        # pick an element in x[:i+1] with which to exchange x[i]
        j = int(random.random() * (i+1))
        x[i], x[j] = x[j], x[i]
        y[i], y[j] = y[j], y[i]

That way I can be sure that the shuffling happens in-place, and the function is not all too long or complicated.

Tonechas

Just use numpy...

First merge the two input arrays 1D array is labels(y) and 2D array is data(x) and shuffle them with NumPy shuffle method. Finally split them and return.

import numpy as np

def shuffle_2d(a, b):
    rows= a.shape[0]
    if b.shape != (rows,1):
        b = b.reshape((rows,1))
    S = np.hstack((b,a))
    np.random.shuffle(S)
    b, a  = S[:,0], S[:,1:]
    return a,b

features, samples = 2, 5
x, y = np.random.random((samples, features)), np.arange(samples)
x, y = shuffle_2d(train, test)

Diego Velez

most solutions above work, however if you have column vectors you have to transpose them first. here is an example

def shuffle(self) -> None:
    """
    Shuffles X and Y
    """
    x = self.X.T
    y = self.Y.T
    p = np.random.permutation(len(x))
    self.X = x[p].T
    self.Y = y[p].T

Better way to shuffle two numpy arrays in unison

Follow WeChat

Want to stay one step ahead of the latest teleworks?

相似问题

Platform

Support

Contact US