nearest neighbour search 4D space python fast - vectorization

ghz 13hours ago ⋅ 2 views

For each observation in X (there are 20) I want to get the k(3) nearest neighbors. How to make this fast to support up to 3 to 4 million rows? Is it possible to speed up the loop iterating over the elements? Maybe via numpy, numba or some kind of vectorization?

A naive loop in python:

import numpy as np
from sklearn.neighbors import KDTree

n_points = 20
d_dimensions = 4
k_neighbours = 3

rng = np.random.RandomState(0)
X = rng.random_sample((n_points, d_dimensions))
print(X)
tree = KDTree(X, leaf_size=2, metric='euclidean')

for element in X:
    print('********')
    print(element)

# when simply using the first row
#element = X[:1]
#print(element)

    # potential optimization: query_radius https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KDTree.html#sklearn.neighbors.KDTree.query_radius
    dist, ind = tree.query([element], k=k_neighbours, return_distance=True, dualtree=False, breadth_first=False, sort_results=True)

    # indices of 3 closest neighbors
    print(ind)
    #[[0 9 1]] !! includes self (element that was searched for)
    print(dist)  # distances to 3 closest neighbors
    #[[0.         0.38559188 0.40997835]] !! includes self (element that was searched for)

    # actual returned elements for index:
    print(X[ind])

    ## after removing self
    print(X[ind][0][1:])

Optimally the output is a pandas.DataFrame of the following structure:

lat_1,long_1,lat_2,long_2,neighbours_list
0.5488135,0.71518937,0.60276338,0.54488318, [[0.61209572 0.616934   0.94374808 0.6818203 ][0.4236548  0.64589411 0.43758721 0.891773]

edit

For now, I have a pandas-based implementation:

df = df.dropna() # there are sometimes only parts of the tuple (either left or right) defined
X = df[['lat1', 'long1', 'lat2', 'long2']]
tree = KDTree(X, leaf_size=4, metric='euclidean')

k_neighbours = 3
def neighbors_as_list(row, index, complete_list):
    dist, ind = index.query([[row['lat1'], row['long1'], row['lat2'], row['long2']]], k=k_neighbours, return_distance=True, dualtree=False, breadth_first=False, sort_results=True)
    return complete_list.values[ind][0][1:]    
df['neighbors'] = df.apply(neighbors_as_list, index=tree, complete_list=X, axis=1)
df.head()

But this is very slow.

edit 2

Sure, here is a pandas version:

import numpy as np
import pandas as pd

from sklearn.neighbors import KDTree
from scipy.spatial import cKDTree

rng = np.random.RandomState(0)
#n_points = 4_000_000
n_points = 20
d_dimensions = 4
k_neighbours = 3

X = rng.random_sample((n_points, d_dimensions))
X


df = pd.DataFrame(X)
df = df.reset_index(drop=False)
df.columns = ['id_str', 'lat_1', 'long_1', 'lat_2', 'long_2']
df.id_str = df.id_str.astype(object)
display(df.head())

tree = cKDTree(df[['lat_1', 'long_1', 'lat_2', 'long_2']])
dist,ind=tree.query(X, k=k_neighbours,n_jobs=-1)

display(dist)
print(df[['lat_1', 'long_1', 'lat_2', 'long_2']].shape)
print(X[ind_out].shape)
X[ind_out]

# fails with
# AssertionError: Shape of new values must be compatible with manager shape
df['neighbors'] = X[ind_out]
df

But it fails as I cannot re-assign the result.

Answer

To make your code faster and handle up to 3 to 4 million rows efficiently, we can focus on several optimization strategies, both for the computation and the way we handle the data. The key areas for improvement include using more efficient data structures, avoiding loops as much as possible, and leveraging parallelism.

Optimization Strategy

1. Switching to cKDTree from KDTree

You've already made the right choice by switching from KDTree to cKDTree (from scipy.spatial). cKDTree is implemented in C and much faster for large datasets compared to KDTree from sklearn.

2. Vectorization and Efficient Querying

Currently, you're using apply with a row-wise operation in pandas, which is very slow because it is essentially looping over each row and making a query to the KDTree. To speed up the process, we can query the entire dataset at once and avoid row-wise operations.

3. Parallelizing with n_jobs

You're using n_jobs=-1, which allows cKDTree to parallelize the computation across multiple CPU cores. This can provide significant speedups on multi-core systems.

4. Avoiding Redundant Data Access

You are trying to assign the neighbors directly into a pandas DataFrame, but pandas doesn't like multidimensional arrays as simple assignments. We need to ensure the data we are assigning is in the correct format.

Optimized Implementation

Here is an optimized version of your code that should work for large datasets like 3–4 million rows. We will:

  • Use cKDTree for faster querying.
  • Avoid apply by querying the tree for all data at once.
  • Efficiently assign the result back to the DataFrame.
import numpy as np
import pandas as pd
from scipy.spatial import cKDTree

# Generate a random dataset (or load your data)
rng = np.random.RandomState(0)
n_points = 4_000_000  # or adjust to your actual dataset size
d_dimensions = 4
k_neighbours = 3

X = rng.random_sample((n_points, d_dimensions))

# Create DataFrame (assuming 'lat_1', 'long_1', 'lat_2', 'long_2' columns)
df = pd.DataFrame(X, columns=['lat_1', 'long_1', 'lat_2', 'long_2'])
df['id_str'] = df.index.astype(str)  # Add an identifier for each row

# Build the cKDTree for fast nearest neighbor querying
tree = cKDTree(df[['lat_1', 'long_1', 'lat_2', 'long_2']])

# Query for k-nearest neighbors for all points at once
dist, ind = tree.query(X, k=k_neighbours, n_jobs=-1)

# Extract the neighbors (excluding self by slicing [1:])
neighbors = X[ind][:, 1:]  # Excluding the first one (self)

# Add the neighbors to the DataFrame
# To avoid direct assignment issues, we'll convert the neighbors into a list of arrays
df['neighbors'] = [neighbors[i].tolist() for i in range(neighbors.shape[0])]

# Show the DataFrame
print(df.head())

Explanation of Changes

  1. Generating Data: We create random sample data with the shape (4_000_000, 4) (or whatever your actual dataset size is).
  2. cKDTree Construction: We build a cKDTree from the lat_1, long_1, lat_2, and long_2 columns for efficient querying.
  3. Querying Neighbors: The query method is called once for all points in the dataset. n_jobs=-1 allows parallel computation across all available cores.
  4. Neighbor Extraction: The output of query includes indices of the neighbors, but it also includes the point itself. We slice [:, 1:] to exclude the point itself.
  5. Assigning to DataFrame: Instead of directly assigning multidimensional arrays to a single column, which causes issues in pandas, we convert each row’s neighbors into a list of arrays and assign it as a list of lists.

Performance Considerations

  1. Scalability: This solution should scale efficiently for large datasets (3–4 million rows). The bottleneck will likely be the disk I/O if your data is coming from a file, so ensure the data is kept in memory as much as possible.
  2. Parallelism: The n_jobs=-1 argument in the query method allows cKDTree to utilize all available CPU cores. This can provide a significant speedup when dealing with large datasets.
  3. Memory Usage: Make sure your system has enough memory to hold both the dataset and the output of the query method. For large datasets, you may need to monitor memory usage and possibly chunk the data if you hit memory limitations.

Expected Output

For each row in your dataset, you will have a column neighbors that contains the 3 nearest neighbors, excluding the point itself. The format will be:

lat_1      long_1      lat_2      long_2      neighbors
0.5488135  0.71518937  0.60276338  0.54488318  [[0.4236548  0.64589411 0.43758721 0.891773], ...]

Notes:

  • Handling Data Size: If your dataset size exceeds available RAM, you may want to consider using out-of-core processing or breaking the problem into smaller chunks and processing them sequentially.
  • Further Optimizations: For extremely large datasets, you may consider leveraging libraries like Dask or Modin for distributed computing, or use GPU-accelerated libraries like cuML (part of the RAPIDS AI suite) for nearest neighbor search if you have access to GPUs.

Let me know if you need further optimizations or have any questions about the implementation!