PyFFTW perfomance on multidimensional arrays

ghz 12hours ago ⋅ 1 views

I have a nD array, say of dimensions: (144, 522720) and I need to compute its FFT.

PyFFTW seems slower than numpy and scipy, that it is NOT expected.

Am I doing something obviously wrong?

Below is my code

import numpy
import scipy      
import pyfftw
import time

n1 = 144
n2 = 522720
loops = 2

pyfftw.config.NUM_THREADS = 4
pyfftw.config.PLANNER_EFFORT = 'FFTW_ESTIMATE'
# pyfftw.config.PLANNER_EFFORT = 'FFTW_MEASURE'

Q_1 = pyfftw.empty_aligned([n1, n2], dtype='float64')
Q_2 = pyfftw.empty_aligned([n1, n2], dtype='complex_')
Q_ref = pyfftw.empty_aligned([n1, n2], dtype='complex_')

# repeat a few times to see if pyfft planner helps
for i in range(0,loops):
    Q_1 = numpy.random.rand(n1,n2)

    s1 = time.time()
    Q_ref = numpy.fft.fft(Q_1, axis=0)
    print('NUMPY - elapsed time: ', time.time() - s1, 's.')

    s1 = time.time()
    Q_2 = scipy.fft.fft(Q_1, axis=0)
    print('SCIPY - elapsed time: ', time.time() - s1, 's.')
    print('Equal = ', numpy.allclose(Q_2, Q_ref))

    s1 = time.time()
    Q_2 = pyfftw.interfaces.numpy_fft.fft(Q_1, axis=0)
    print('PYFFTW NUMPY - elapsed time = ', time.time() - s1, 's.')
    print('Equal = ', numpy.allclose(Q_2, Q_ref))

    s1 = time.time()
    Q_2 = pyfftw.interfaces.scipy_fftpack.fft(Q_1, axis=0)
    print('PYFFTW SCIPY - elapsed time = ', time.time() - s1, 's.')
    print('Equal = ', numpy.allclose(Q_2, Q_ref))

    s1 = time.time()
    fft_object = pyfftw.builders.fft(Q_1, axis=0)
    Q_2 = fft_object()
    print('FFTW PURE Elapsed time = ', time.time() - s1, 's')
    print('Equal = ', numpy.allclose(Q_2, Q_ref))

Answer

The fact that PyFFTW is slower than numpy or scipy in your code is unusual, as PyFFTW typically offers better performance for larger arrays due to its use of FFTW (Fastest Fourier Transform in the West) which is highly optimized for various hardware architectures. However, there are several possible factors that could lead to suboptimal performance. Let's go through your code and examine potential issues and optimizations.

1. PyFFTW Configuration:

  • You have set pyfftw.config.NUM_THREADS = 4, but it's worth noting that the performance benefit of multithreading in PyFFTW depends on the hardware (e.g., number of available CPU cores) and the size of the data. If the data is relatively small or your CPU has fewer than 4 cores, multithreading may not result in a performance boost and could even cause overhead.

  • Try running the benchmark with NUM_THREADS = 1 to see if threading is the issue. FFT is often more efficient in single-threaded mode for certain array sizes.

pyfftw.config.NUM_THREADS = 1

2. FFTW Planner Effort:

  • You are using the default FFTW_ESTIMATE planner effort. While this is faster in terms of initial planning, it might not result in the best performance, especially for larger datasets. For larger arrays, using FFTW_MEASURE could yield better performance as it will measure the best strategy for the FFT computation.
pyfftw.config.PLANNER_EFFORT = 'FFTW_MEASURE'
  • You may also experiment with FFTW_PATIENT, which can sometimes give better results than FFTW_MEASURE for specific cases, but it will be slower during the planning stage.

3. Array Alignment:

  • PyFFTW benefits from aligned memory for performance. You are using pyfftw.empty_aligned() to allocate your arrays, which is good. However, ensure that the arrays are actually aligned to the memory boundaries expected by FFTW. Sometimes, just calling pyfftw.empty_aligned() without setting the alignment strategy may not provide the expected performance.

  • If performance is still not satisfactory, try using pyfftw.interfaces.cache.enable() and check if this improves performance by caching FFTW plans.

4. Comparing FFT Implementations:

  • You are comparing PyFFTW against numpy.fft and scipy.fft in your code. Note that scipy.fft internally uses numpy.fft, so the performance difference you're seeing between scipy and numpy might just be due to the overhead of calling scipy's wrapper.

  • It would be good to test numpy.fft and scipy.fft separately, ensuring they are using the same backend (i.e., they both use FFTW in the case of scipy.fft), and that they aren't doing anything extra that might introduce overhead.

5. Use pyfftw.builders:

  • When you use pyfftw.interfaces.numpy_fft.fft, it works similarly to numpy.fft.fft, but it doesn't take full advantage of pyfftw.builders, which is optimized for repeated FFT calls.

  • Using pyfftw.builders.fft is the recommended approach for better performance in repetitive FFT computations. The pyfftw.builders interface allows for the pre-planning of FFT computations, which is more efficient when you're running multiple FFTs on the same size data.

  • Correct Use of pyfftw.builders:

    fft_object = pyfftw.builders.fft(Q_1, axis=0)
    Q_2 = fft_object()
    

    This approach should yield the best performance, especially for repeated FFT calls.

6. Benchmarking Across Different Libraries:

  • It's also important to ensure that your benchmark is consistent. The following code structure can help you isolate the results more effectively by ensuring the data is the same across all libraries:

    import numpy
    import scipy      
    import pyfftw
    import time
    
    n1 = 144
    n2 = 522720
    loops = 2
    
    pyfftw.config.NUM_THREADS = 1
    pyfftw.config.PLANNER_EFFORT = 'FFTW_MEASURE'
    
    # Initialize the data
    Q_1 = numpy.random.rand(n1, n2)
    
    # Create FFT objects once for PyFFTW
    fft_numpy = pyfftw.interfaces.numpy_fft.fft
    fft_scipy = pyfftw.interfaces.scipy_fftpack.fft
    fft_fftw = pyfftw.builders.fft(Q_1, axis=0)
    
    # Benchmark
    for i in range(loops):
        # Numpy FFT
        start_time = time.time()
        Q_ref = numpy.fft.fft(Q_1, axis=0)
        print(f'NUMPY - elapsed time: {time.time() - start_time:.6f}s.')
    
        # Scipy FFT
        start_time = time.time()
        Q_2 = scipy.fft.fft(Q_1, axis=0)
        print(f'SCIPY - elapsed time: {time.time() - start_time:.6f}s.')
        print(f'Equal = {numpy.allclose(Q_2, Q_ref)}')
    
        # PyFFTW with numpy interface
        start_time = time.time()
        Q_2 = fft_numpy(Q_1, axis=0)
        print(f'PYFFTW NUMPY - elapsed time: {time.time() - start_time:.6f}s.')
        print(f'Equal = {numpy.allclose(Q_2, Q_ref)}')
    
        # PyFFTW with scipy interface
        start_time = time.time()
        Q_2 = fft_scipy(Q_1, axis=0)
        print(f'PYFFTW SCIPY - elapsed time: {time.time() - start_time:.6f}s.')
        print(f'Equal = {numpy.allclose(Q_2, Q_ref)}')
    
        # Pure PyFFTW FFTW builder
        start_time = time.time()
        Q_2 = fft_fftw()
        print(f'FFTW PURE - elapsed time: {time.time() - start_time:.6f}s.')
        print(f'Equal = {numpy.allclose(Q_2, Q_ref)}')
    
  • Key Adjustments:

    • The pyfftw.builders.fft object is created before the loop to avoid recreating it on each iteration.
    • The results are consistently compared using numpy.allclose to ensure the results match across all libraries.
    • The NUM_THREADS = 1 and FFTW_MEASURE are used to ensure optimal single-threaded performance with better planning for large arrays.

7. Check Hardware and Environment:

  • Make sure that your hardware has enough resources (CPU cores, memory, etc.) to benefit from the optimizations offered by PyFFTW.
  • If you're running on a multi-core machine, consider checking CPU utilization to verify if PyFFTW is utilizing multiple cores. For smaller arrays, threading may not provide a significant benefit.

Summary of Suggestions:

  • Use pyfftw.config.NUM_THREADS = 1 and check performance again.
  • Set pyfftw.config.PLANNER_EFFORT = 'FFTW_MEASURE'.
  • Use pyfftw.builders.fft for repeated FFTs.
  • Ensure that you are comparing FFTs across different libraries correctly and consistently.
  • Check hardware and multithreading settings.

By adjusting these configurations, you should see an improvement in the performance of PyFFTW relative to numpy and scipy.