I need to find unique rows in a numpy.array
.
For example:
>>> a # I have
array([[1, 1, 1, 0, 0, 0],
[0, 1, 1, 1, 0, 0],
[0, 1, 1, 1, 0, 0],
[1, 1, 1, 0, 0, 0],
[1, 1, 1, 1, 1, 0]])
>>> new_a # I want to get to
array([[1, 1, 1, 0, 0, 0],
[0, 1, 1, 1, 0, 0],
[1, 1, 1, 1, 1, 0]])
I know that i can create a set and loop over the array, but I am looking for an efficient pure numpy
solution. I believe that there is a way to set data type to void and then I could just use numpy.unique
, but I couldn't figure out how to make it work.
Answers
To find unique rows in a NumPy array efficiently, you can indeed use a combination of NumPy's structured arrays and the numpy.unique
function. Here’s a step-by-step guide to achieve this:
- Convert each row of the array into a single element of a structured array.
- Use
numpy.unique
to find the unique rows. - Convert the result back to the original array format.
Here’s how you can do it:
import numpy as np
def unique_rows(a):
# View the array as a structured array with each row as a single element
a_view = np.ascontiguousarray(a).view(np.dtype((np.void, a.dtype.itemsize * a.shape[1])))
# Find the unique elements in the structured array
_, idx = np.unique(a_view, return_index=True)
# Sort the indices to preserve the order of first occurrence
unique_a = a[np.sort(idx)]
return unique_a
# Example usage
a = np.array([[1, 1, 1, 0, 0, 0],
[0, 1, 1, 1, 0, 0],
[0, 1, 1, 1, 0, 0],
[1, 1, 1, 0, 0, 0],
[1, 1, 1, 1, 1, 0]])
new_a = unique_rows(a)
print(new_a)
Explanation
-
Viewing as a structured array:
np.ascontiguousarray(a)
ensures that the array is contiguous in memory, which is necessary for viewing it as a structured array..view(np.dtype((np.void, a.dtype.itemsize * a.shape[1])))
treats each row as a single element of avoid
type, where the size of each element is the total number of bytes in a row (a.dtype.itemsize * a.shape[1]
).
-
Finding unique rows:
np.unique(a_view, return_index=True)
returns the unique elements and the indices of the first occurrences of these unique elements. Thea_view
array, being ofvoid
type, is compared element-wise to determine uniqueness.
-
Preserving order:
np.sort(idx)
sorts the indices of the unique rows to maintain the order of their first appearance in the original array.a[np.sort(idx)]
uses these sorted indices to extract the unique rows in the original order.
This method is efficient and leverages the power of NumPy to perform the operation in a vectorized manner without explicit loops in Python, making it suitable for large arrays.