Odd-even sort: Incorrect results when using multiple blocks in CUDA
I am new to PyCUDA and trying to implement the Odd-even sort using PyCUDA.
I managed to run it successfully on arrays whose size is limited by 2048 (using one thread block), but as soon as I tried to use multiple thread blocks, the result was no longer correct. I suspected this might be a synchronization problem but had no idea how to fix it.
bricksort_src = """
__global__
void bricksort(int *in, int *out, int n){
int tid = threadIdx.x + (blockIdx.x * blockDim.x);
if((tid * 2) < n) out[tid * 2] = in[tid *2];
if((tid * 2 + 1) < n) out[tid * 2 + 1] = in[tid * 2 + 1];
__syncthreads();
// odd and even are used for adjusting the index
// to avoid out-of-index exception
int odd, even, alter;
odd = ((n + 2) % 2) != 0;
even = ((n + 2) % 2) == 0;
// alter is used for alternating between the odd and even phases
alter = 0;
for(int i = 0; i < n; i++){
int idx = tid * 2 + alter;
int adjust = alter == 0 ? odd : even;
if(idx < (n - adjust)){
int f, s;
f = out[idx];
s = out[idx + 1];
if (f > s){
out[idx] = s;
out[idx + 1] = f;
}
}
__syncthreads();
alter = 1 - alter;
}
}
"""
bricksort_ker = SourceModule(source=bricksort_src)
bricksort = bricksort_ker.get_function("bricksort")
np.random.seed(0)
arr = np.random.randint(0,10,2**11).astype('int32')
iar = gpuarray.to_gpu(arr)
oar = gpuarray.empty_like(iar)
n = iar.size
num_threads = np.ceil(n/2)
if (num_threads < 1024):
blocksize = int(num_threads)
gridsize = 1
else:
blocksize = 1024
gridsize = int(np.ceil(num_threads / blocksize))
bricksort(iar, oar, np.int32(n),
block=(blocksize,1,1),
grid=(gridsize,1,1))
Answer
The issue you're facing stems from the lack of synchronization across multiple thread blocks during the odd-even sort. In your current implementation, you have __syncthreads()
to synchronize threads within each block, but that does not synchronize threads across different blocks.
Problem Breakdown:
- Synchronization Issue: CUDA threads within a block can synchronize with each other using
__syncthreads()
, but threads from different blocks can't communicate or synchronize in the same way. This is crucial because your odd-even sorting algorithm relies on alternating phases, and the results from one block may need to be used by another in subsequent iterations. - Parallel Sorting: Odd-even sort involves multiple phases where pairs of adjacent elements are compared and swapped in parallel. However, if these elements are split across thread blocks, the sorting won't work properly without careful management of how data is exchanged between blocks during each phase.
Solution:
One approach to fix this is to:
- Handle Data Exchange: Since you are working with multiple blocks, you need to ensure that after each phase of sorting, the data is made available to all blocks for the next phase.
- Use Shared Memory or Multiple Kernels: CUDA does not allow easy inter-block synchronization, but you can either:
- Use shared memory within blocks, but this won't solve the inter-block communication problem.
- Use multiple kernel launches, with each kernel sorting a phase, and synchronize the blocks using CUDA's
cudaDeviceSynchronize()
.
Here's an adjusted version of your code using multiple kernel launches. Each kernel will run one phase of the sorting and then synchronize the device:
Updated Code with Multiple Kernel Launches:
from pycuda.compiler import SourceModule
import pycuda.driver as cuda
import numpy as np
from pycuda import gpuarray
from math import ceil
# Odd-Even Sort Kernel
bricksort_src = """
__global__
void bricksort(int *in, int *out, int n, int alter){
int tid = threadIdx.x + blockIdx.x * blockDim.x;
// Odd and Even phases
int odd, even;
odd = (n % 2 != 0);
even = (n % 2 == 0);
int idx = tid * 2 + alter;
int adjust = alter == 0 ? odd : even;
if (idx < n - adjust) {
int f = in[idx];
int s = in[idx + 1];
if (f > s) {
out[idx] = s;
out[idx + 1] = f;
} else {
out[idx] = f;
out[idx + 1] = s;
}
}
__syncthreads();
}
"""
# Compile the kernel
bricksort_ker = SourceModule(bricksort_src)
bricksort = bricksort_ker.get_function("bricksort")
# Generate data for sorting
np.random.seed(0)
arr = np.random.randint(0, 10, 2**11).astype('int32')
# GPU arrays
iar = gpuarray.to_gpu(arr)
oar = gpuarray.empty_like(iar)
n = iar.size
num_threads = ceil(n / 2)
# Determine block size and grid size
if num_threads < 1024:
blocksize = int(num_threads)
gridsize = 1
else:
blocksize = 1024
gridsize = ceil(num_threads / blocksize)
# Perform multiple kernel launches for each phase
for phase in range(n): # For n iterations or the maximum number of phases
# Perform sorting for odd or even phase
bricksort(iar, oar, np.int32(n), np.int32(phase % 2),
block=(blocksize, 1, 1), grid=(gridsize, 1, 1))
# Synchronize the device after each phase
cuda.Context.synchronize()
# Swap input and output for next phase
iar, oar = oar, iar
# Copy the sorted data back to host
sorted_arr = iar.get()
# Output the result
print("Sorted Array:", sorted_arr)
Explanation of Changes:
-
Kernel Phase Argument: The kernel now accepts an additional
alter
argument, which determines whether it's an odd or even phase. It will alternate between0
and1
to perform the respective phase (odd or even sorting). -
Multiple Kernel Launches: Instead of doing everything within a single kernel, we now launch the kernel multiple times. Each time, the kernel is given a different value of
alter
(0 for odd, 1 for even), and it sorts the array in that phase. -
Synchronization: After each kernel launch,
cuda.Context.synchronize()
ensures that the GPU has finished processing before the next phase begins. -
Swapping Input and Output Arrays: After each phase, the input and output arrays are swapped so that the result from one phase becomes the input for the next.
Optimizations:
- Efficient Memory Usage: Only two arrays (
iar
andoar
) are used, and they are swapped after each phase. - Multiple Kernel Launches: Rather than trying to handle multiple blocks in a single kernel, we break the task into smaller steps by launching kernels multiple times.
Performance Considerations:
- Grid and Block Size: The grid and block sizes should be chosen based on the number of elements you need to sort and the available GPU memory.
- Number of Phases: The odd-even sort requires at most
n
phases, but in practice, fewer phases may be necessary for convergence, depending on the data.
This approach ensures proper synchronization between blocks and phases while making the sorting operation parallelized effectively using multiple kernel launches.