KDE in python with different mu, sigma / mapping a function to an array
I have a 2-dimensional array of values that I would like to perform a Gaussian KDE on, with a catch: the points are assumed to have different variances. For that, I have a second 2-dimensional array (with the same shape) that is the variance of the Gaussian to be used for each point. In the simple example,
import numpy as np
data = np.array([[0.4,0.2],[0.1,0.5]])
sigma = np.array([[0.05,0.1],[0.02,0.3]])
there would be four gaussians, the first of which is centered at x=0.4 with σ=0.05. Note: Actual data is much larger than 2x2
I am looking for one of two things:
- A Gaussian KDE solver that will allow for bandwidth to change for each point
or
- A way to map the results of each Gaussian into a 3-dimensional array, with each Gaussian evaluated across a range of points (say, evaluate each center/σ pair along np.linspace(0,1,101)). In this case, I could e.g. have the KDE value at x=0.5 by taking outarray[:,:,51].
Answer
In your case, you're dealing with a kernel density estimate (KDE) where each point has its own variance, i.e., a different bandwidth for each Gaussian. Unfortunately, the standard scipy.stats.gaussian_kde
implementation assumes a single bandwidth for all points. However, you can either:
- Use a Custom KDE that allows for varying bandwidths (i.e., for each point).
- Evaluate each Gaussian manually for each point across a grid, constructing a 3D array that contains the KDE evaluated at various locations for each point.
1. Custom KDE with Varying Bandwidth
If you're open to writing a custom solution, you can modify the standard KDE to incorporate different variances (bandwidths) for each data point. You can do this by creating a weighted sum of Gaussian kernels, each with its own standard deviation.
The general formula for a Gaussian kernel centered at ( x_i ) with bandwidth ( \sigma_i ) is:
[ K(x, x_i, \sigma_i) = \frac{1}{\sqrt{2 \pi \sigma_i^2}} \exp\left(-\frac{(x - x_i)^2}{2 \sigma_i^2}\right) ]
You can then sum these kernels for all points to get the KDE.
Custom KDE Implementation:
import numpy as np
import matplotlib.pyplot as plt
def gaussian_kernel(x, xi, sigma):
"""
Compute Gaussian kernel for a point `x` centered at `xi` with standard deviation `sigma`.
"""
return (1 / np.sqrt(2 * np.pi * sigma**2)) * np.exp(-0.5 * ((x - xi) / sigma)**2)
def kde_with_varying_bandwidth(data, sigma, grid):
"""
Perform Kernel Density Estimation (KDE) with different bandwidths (sigmas) for each point.
data: 2D array, where each column is a set of data points (e.g., n x m)
sigma: 2D array with same shape as `data`, representing standard deviations for each point
grid: 1D array of points at which to evaluate the KDE
Returns: 2D array with KDE values at each point in the grid for each data point
"""
n_points = data.shape[0] # Number of points in the dataset
kde_values = np.zeros((len(grid), n_points)) # Store KDE values for each grid point and data point
for i in range(n_points):
for j, x in enumerate(grid):
kde_values[j, i] = np.sum(gaussian_kernel(x, data[i], sigma[i]))
return kde_values
# Example data
data = np.array([[0.4, 0.2], [0.1, 0.5]]) # Data points
sigma = np.array([[0.05, 0.1], [0.02, 0.3]]) # Standard deviations (variances)
grid = np.linspace(0, 1, 101) # Evaluation points
# Perform KDE
kde_values = kde_with_varying_bandwidth(data, sigma, grid)
# Plot the result for each data point
for i in range(data.shape[0]):
plt.plot(grid, kde_values[:, i], label=f'Point {i+1}')
plt.legend()
plt.title('Gaussian KDE with Varying Bandwidth')
plt.xlabel('x')
plt.ylabel('KDE')
plt.show()
Explanation:
gaussian_kernel(x, xi, sigma)
: This function calculates the Gaussian kernel for a given pointx
with centerxi
and standard deviationsigma
.kde_with_varying_bandwidth(data, sigma, grid)
: This function calculates the KDE at each point ingrid
by summing the contributions of the individual kernels for each point indata
.
The result is a 2D array kde_values
where each row corresponds to a point in grid
, and each column corresponds to one of the original data points. You can then extract the KDE value for any particular center and bandwidth pair.
2. Manual Evaluation and 3D Array Construction
If you'd like to manually map the evaluation of each Gaussian kernel, you can construct a 3D array where each slice corresponds to the evaluation of a specific Gaussian at all points in the grid.
Here’s how to do this:
import numpy as np
def evaluate_gaussians(data, sigma, grid):
"""
Manually evaluate each Gaussian for each point in the grid.
data: 2D array (n_points x m), where each column represents a data point
sigma: 2D array with the same shape as `data`, representing standard deviations for each point
grid: 1D array of points at which to evaluate the Gaussian functions
Returns: 3D array where each slice corresponds to a Gaussian evaluated on the grid
"""
n_points = data.shape[0] # Number of data points
n_grid = len(grid) # Number of points in the grid
result = np.zeros((n_points, n_grid, data.shape[1])) # 3D array to store the results
for i in range(data.shape[1]): # Iterate over each dimension (i.e., each column of data)
for j in range(n_points):
result[j, :, i] = gaussian_kernel(grid, data[j, i], sigma[j, i])
return result
# Example data
data = np.array([[0.4, 0.2], [0.1, 0.5]]) # Data points
sigma = np.array([[0.05, 0.1], [0.02, 0.3]]) # Standard deviations (variances)
grid = np.linspace(0, 1, 101) # Evaluation points
# Perform KDE evaluation
result = evaluate_gaussians(data, sigma, grid)
# Example: Plot the result for a particular dimension
import matplotlib.pyplot as plt
for i in range(data.shape[1]):
plt.plot(grid, result[:, 50, i], label=f'Dimension {i+1}')
plt.legend()
plt.title('Evaluation of Gaussian Kernels')
plt.xlabel('x')
plt.ylabel('KDE')
plt.show()
Explanation:
- Manual Evaluation: In
evaluate_gaussians
, each Gaussian kernel is evaluated for each point in the grid. The result is stored in a 3D array, where each slice (result[j, :, i]
) corresponds to the evaluation of the Gaussian for the j-th data point, i-th dimension (column). - You can extract values for any particular center/bandwidth by indexing into the resulting 3D array.
Summary:
- If you want a fast, general KDE with varying bandwidths, you can write your own KDE function as shown above, using a sum of Gaussian kernels with custom bandwidths.
- If you'd like to manually evaluate each Gaussian for each center and bandwidth pair, you can create a 3D array that stores these evaluations and extract them based on your requirements.
Let me know if you'd like further clarification or optimization based on your actual use case!