Find unique rows in numpy.array

ghz 7months ago ⋅ 121 views

I need to find unique rows in a numpy.array.

For example:

>>> a # I have
array([[1, 1, 1, 0, 0, 0],
       [0, 1, 1, 1, 0, 0],
       [0, 1, 1, 1, 0, 0],
       [1, 1, 1, 0, 0, 0],
       [1, 1, 1, 1, 1, 0]])
>>> new_a # I want to get to
array([[1, 1, 1, 0, 0, 0],
       [0, 1, 1, 1, 0, 0],
       [1, 1, 1, 1, 1, 0]])

I know that i can create a set and loop over the array, but I am looking for an efficient pure numpy solution. I believe that there is a way to set data type to void and then I could just use numpy.unique, but I couldn't figure out how to make it work.

Answers

To find unique rows in a NumPy array efficiently, you can indeed use a combination of NumPy's structured arrays and the numpy.unique function. Here’s a step-by-step guide to achieve this:

  1. Convert each row of the array into a single element of a structured array.
  2. Use numpy.unique to find the unique rows.
  3. Convert the result back to the original array format.

Here’s how you can do it:

import numpy as np

def unique_rows(a):
    # View the array as a structured array with each row as a single element
    a_view = np.ascontiguousarray(a).view(np.dtype((np.void, a.dtype.itemsize * a.shape[1])))
    
    # Find the unique elements in the structured array
    _, idx = np.unique(a_view, return_index=True)
    
    # Sort the indices to preserve the order of first occurrence
    unique_a = a[np.sort(idx)]
    
    return unique_a

# Example usage
a = np.array([[1, 1, 1, 0, 0, 0],
              [0, 1, 1, 1, 0, 0],
              [0, 1, 1, 1, 0, 0],
              [1, 1, 1, 0, 0, 0],
              [1, 1, 1, 1, 1, 0]])

new_a = unique_rows(a)
print(new_a)

Explanation

  1. Viewing as a structured array:

    • np.ascontiguousarray(a) ensures that the array is contiguous in memory, which is necessary for viewing it as a structured array.
    • .view(np.dtype((np.void, a.dtype.itemsize * a.shape[1]))) treats each row as a single element of a void type, where the size of each element is the total number of bytes in a row (a.dtype.itemsize * a.shape[1]).
  2. Finding unique rows:

    • np.unique(a_view, return_index=True) returns the unique elements and the indices of the first occurrences of these unique elements. The a_view array, being of void type, is compared element-wise to determine uniqueness.
  3. Preserving order:

    • np.sort(idx) sorts the indices of the unique rows to maintain the order of their first appearance in the original array.
    • a[np.sort(idx)] uses these sorted indices to extract the unique rows in the original order.

This method is efficient and leverages the power of NumPy to perform the operation in a vectorized manner without explicit loops in Python, making it suitable for large arrays.