For each observation in X (there are 20) I want to get the k(3) nearest neighbors. How to make this fast to support up to 3 to 4 million rows? Is it possible to speed up the loop iterating over the elements? Maybe via numpy, numba or some kind of vectorization?
A naive loop in python:
import numpy as np
from sklearn.neighbors import KDTree
n_points = 20
d_dimensions = 4
k_neighbours = 3
rng = np.random.RandomState(0)
X = rng.random_sample((n_points, d_dimensions))
print(X)
tree = KDTree(X, leaf_size=2, metric='euclidean')
for element in X:
print('********')
print(element)
# when simply using the first row
#element = X[:1]
#print(element)
# potential optimization: query_radius https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KDTree.html#sklearn.neighbors.KDTree.query_radius
dist, ind = tree.query([element], k=k_neighbours, return_distance=True, dualtree=False, breadth_first=False, sort_results=True)
# indices of 3 closest neighbors
print(ind)
#[[0 9 1]] !! includes self (element that was searched for)
print(dist) # distances to 3 closest neighbors
#[[0. 0.38559188 0.40997835]] !! includes self (element that was searched for)
# actual returned elements for index:
print(X[ind])
## after removing self
print(X[ind][0][1:])
Optimally the output is a pandas.DataFrame of the following structure:
lat_1,long_1,lat_2,long_2,neighbours_list
0.5488135,0.71518937,0.60276338,0.54488318, [[0.61209572 0.616934 0.94374808 0.6818203 ][0.4236548 0.64589411 0.43758721 0.891773]
edit
For now, I have a pandas-based implementation:
df = df.dropna() # there are sometimes only parts of the tuple (either left or right) defined
X = df[['lat1', 'long1', 'lat2', 'long2']]
tree = KDTree(X, leaf_size=4, metric='euclidean')
k_neighbours = 3
def neighbors_as_list(row, index, complete_list):
dist, ind = index.query([[row['lat1'], row['long1'], row['lat2'], row['long2']]], k=k_neighbours, return_distance=True, dualtree=False, breadth_first=False, sort_results=True)
return complete_list.values[ind][0][1:]
df['neighbors'] = df.apply(neighbors_as_list, index=tree, complete_list=X, axis=1)
df.head()
But this is very slow.
edit 2
Sure, here is a pandas version:
import numpy as np
import pandas as pd
from sklearn.neighbors import KDTree
from scipy.spatial import cKDTree
rng = np.random.RandomState(0)
#n_points = 4_000_000
n_points = 20
d_dimensions = 4
k_neighbours = 3
X = rng.random_sample((n_points, d_dimensions))
X
df = pd.DataFrame(X)
df = df.reset_index(drop=False)
df.columns = ['id_str', 'lat_1', 'long_1', 'lat_2', 'long_2']
df.id_str = df.id_str.astype(object)
display(df.head())
tree = cKDTree(df[['lat_1', 'long_1', 'lat_2', 'long_2']])
dist,ind=tree.query(X, k=k_neighbours,n_jobs=-1)
display(dist)
print(df[['lat_1', 'long_1', 'lat_2', 'long_2']].shape)
print(X[ind_out].shape)
X[ind_out]
# fails with
# AssertionError: Shape of new values must be compatible with manager shape
df['neighbors'] = X[ind_out]
df
But it fails as I cannot re-assign the result.
Answer
To make your code faster and handle up to 3 to 4 million rows efficiently, we can focus on several optimization strategies, both for the computation and the way we handle the data. The key areas for improvement include using more efficient data structures, avoiding loops as much as possible, and leveraging parallelism.
Optimization Strategy
1. Switching to cKDTree
from KDTree
You've already made the right choice by switching from KDTree
to cKDTree
(from scipy.spatial
). cKDTree
is implemented in C and much faster for large datasets compared to KDTree
from sklearn
.
2. Vectorization and Efficient Querying
Currently, you're using apply
with a row-wise operation in pandas, which is very slow because it is essentially looping over each row and making a query to the KDTree. To speed up the process, we can query the entire dataset at once and avoid row-wise operations.
3. Parallelizing with n_jobs
You're using n_jobs=-1
, which allows cKDTree
to parallelize the computation across multiple CPU cores. This can provide significant speedups on multi-core systems.
4. Avoiding Redundant Data Access
You are trying to assign the neighbors directly into a pandas DataFrame, but pandas
doesn't like multidimensional arrays as simple assignments. We need to ensure the data we are assigning is in the correct format.
Optimized Implementation
Here is an optimized version of your code that should work for large datasets like 3–4 million rows. We will:
- Use
cKDTree
for faster querying. - Avoid
apply
by querying the tree for all data at once. - Efficiently assign the result back to the DataFrame.
import numpy as np
import pandas as pd
from scipy.spatial import cKDTree
# Generate a random dataset (or load your data)
rng = np.random.RandomState(0)
n_points = 4_000_000 # or adjust to your actual dataset size
d_dimensions = 4
k_neighbours = 3
X = rng.random_sample((n_points, d_dimensions))
# Create DataFrame (assuming 'lat_1', 'long_1', 'lat_2', 'long_2' columns)
df = pd.DataFrame(X, columns=['lat_1', 'long_1', 'lat_2', 'long_2'])
df['id_str'] = df.index.astype(str) # Add an identifier for each row
# Build the cKDTree for fast nearest neighbor querying
tree = cKDTree(df[['lat_1', 'long_1', 'lat_2', 'long_2']])
# Query for k-nearest neighbors for all points at once
dist, ind = tree.query(X, k=k_neighbours, n_jobs=-1)
# Extract the neighbors (excluding self by slicing [1:])
neighbors = X[ind][:, 1:] # Excluding the first one (self)
# Add the neighbors to the DataFrame
# To avoid direct assignment issues, we'll convert the neighbors into a list of arrays
df['neighbors'] = [neighbors[i].tolist() for i in range(neighbors.shape[0])]
# Show the DataFrame
print(df.head())
Explanation of Changes
- Generating Data: We create random sample data with the shape
(4_000_000, 4)
(or whatever your actual dataset size is). cKDTree
Construction: We build acKDTree
from thelat_1
,long_1
,lat_2
, andlong_2
columns for efficient querying.- Querying Neighbors: The
query
method is called once for all points in the dataset.n_jobs=-1
allows parallel computation across all available cores. - Neighbor Extraction: The output of
query
includes indices of the neighbors, but it also includes the point itself. We slice[:, 1:]
to exclude the point itself. - Assigning to DataFrame: Instead of directly assigning multidimensional arrays to a single column, which causes issues in pandas, we convert each row’s neighbors into a list of arrays and assign it as a list of lists.
Performance Considerations
- Scalability: This solution should scale efficiently for large datasets (3–4 million rows). The bottleneck will likely be the disk I/O if your data is coming from a file, so ensure the data is kept in memory as much as possible.
- Parallelism: The
n_jobs=-1
argument in thequery
method allowscKDTree
to utilize all available CPU cores. This can provide a significant speedup when dealing with large datasets. - Memory Usage: Make sure your system has enough memory to hold both the dataset and the output of the
query
method. For large datasets, you may need to monitor memory usage and possibly chunk the data if you hit memory limitations.
Expected Output
For each row in your dataset, you will have a column neighbors
that contains the 3 nearest neighbors, excluding the point itself. The format will be:
lat_1 long_1 lat_2 long_2 neighbors
0.5488135 0.71518937 0.60276338 0.54488318 [[0.4236548 0.64589411 0.43758721 0.891773], ...]
Notes:
- Handling Data Size: If your dataset size exceeds available RAM, you may want to consider using out-of-core processing or breaking the problem into smaller chunks and processing them sequentially.
- Further Optimizations: For extremely large datasets, you may consider leveraging libraries like
Dask
orModin
for distributed computing, or useGPU-accelerated libraries
likecuML
(part of the RAPIDS AI suite) for nearest neighbor search if you have access to GPUs.
Let me know if you need further optimizations or have any questions about the implementation!