In early 2010, I took a course in Digital Signal Processing. For a course project, I chose to implement a parallel FFT algorithm. I've had an interest in the Fourier transform ever since I come upon the JPEG image format (circa 1998) and it's use of the Discrete Cosine Transform (DCT).
My FFT implementation: I based the code on the Pease Algorithm. Decomposition was done using Decimation-in-frequency (DIF). e.g. the x[n] terms first, rearrange the parts on the outputs, then apply twiddle factors and butterfly operations as usual. DIF butterfly illustrated as follows:
The idea behind the Pease implementation is that the Cooley-Tukey algorithm can be mapped in alternate ways by varying the order of butterfly computations. The Pease structure, a constant-geometry interconnect, is computed so that each stage of the FFT calculation is performed with identical interconnect patterns for all butterfly operations. A typical radix-2 Cooley-Tukey DIF algorithm with 8 sample points can be transformed into this constant-geometry structure as follows:
The indices of the butterfly operations are wired in for each stage. This prevents an in-place computation, where the output overwrites the input in memory as it is computed. But this method requires only an auxiliary storage space equal to that of the input array to hold intermediate steps.
The approach to the parallel implementation is to divide the computational task equally among the CPUs and/or cores of the computer. Each thread will be assigned an equal share of the computation. The input time-domain samples and output frequency-domain samples assigned to each thread will be constant from stage to stage. The input is processed breadth-first and in-order by DIF stages resulting in a bit-reversed output.
Similar to the standard Cooley-Tukey, this requires log2(N) butterfly stages, where N is the total number of input points and must be a power of 2. The maximum number of threads is therefore N/2. So the number of input samples must be divisible by the number of threads used. The computation of each stage is in multiples of two complex samples--a multiple of a single butterfly operation. In terms of code, the number of complex points assigned to a processor is the step size.
For example: A FFT for 8 samples computed by 4 threads would result in each being responsible for a single butterfly operation–thus a step size of 2. Thread #1 would be assigned input points (0,4) and output points (0,1) for each stage. Thread #2: (1,5)→(2,3), Thread #3: (2,6)→(4,5), Thread #4: (3,7)→(6,7) corresponding to the above figure. After each stage, the input array becomes the output for the next stage, and the current output becomes the input. Each thread swaps back and forth between “write” and “read” pointers to designate the input and output arrays for each subsequent stage. If there is an odd number of stages, theauxiliary storage space will contain the completed FFT, and the original input memory gets freed. This procedure only requires pointers to be updated and avoids expensive copy operations from one memory location to another.
Once the parallel FFT calculations are complete the parent process performs single in-place pass using a typical Gold-Rader bit-reversal algorithm.
OpenMP is a standard API for multi-threaded applications. This parallel FFT implementation synchronizes after each stage of computation to avoid contention for memory access–avoiding the need for mutual exclusion (mutexes).
Each thread is assigned to a specific CPU. This affinity is controlled by setting the GOMP CPU AFFINITY environment variable, specific to the GNU libgomp implementation of OpenMP and is non-portable. Though the OpenMP standard does not specify this affinity functionality, most implementations have their own specific ways of setting processes to specific CPUs. Otherwise there is no guarantee that the underlying operating system will make use off all available CPUs.
In 2010, the benchmarks I ran showed some speed-up in computation but I was expecting a bit more from the parallelization. The bit-reversal stage, which was not done in parallel, took up non-trivial amount of the total calculation time. Some further gain could be eeked-out by implementing a more efficient algorithm or including the bit-reversal in the parallelization.
The other main oversight of this algorithm is that it does not take into account the effects of locality of reference: main-memory RAM, processor caches, and the CPU registers. In many cases memory access slows the computation down more than the arithmetic operations. It is likely that this is preventing greater increases in speed when increasing the number of CPUs. Computing as many calculations on a set of points as possible before going back to slower main-memory RAM could result in an even greater speed-up.
For your academic curiosity, the code has been posted to GitHub (cgp-fft).