Writing Your Own FFT in Python and STM32
In this post, we’ll be diving into the process of writing our own custom Fast Fourier Transform (FFT) implementation from scratch. The FFT is a foundational algorithm in digital signal processing, used to convert time-domain signals into their frequency-domain representation. While many libraries and tools offer ready-made FFT functions, coding one yourself is a great way to build a deeper understanding of how it works under the hood.
I’ll start by lightly covering a bit of the theory but try and discuss implementing the algorithm itself in more detail. From there, we’ll jump into writing the code in Python, which allows us to test and visualize the results quickly. Then, we’ll bring it down to the hardware level by implementing the same FFT logic on an STM32 microcontroller, where things like memory usage and performance really start to matter.
The goal here isn’t to create the fastest or most optimized FFT, but to build something simple.
Additional Resources:
FFT Overview
Note: I am not an expert in FFT, DSP, or anything closely related. Think of this post more as a personal notebook that I’m sharing with the world. I’m not going to dive into the theory of the FFT in any great detail. Instead, I’ll focus on implementing it. My implementation is not optimized for performance; it's simply a straightforward FFT meant for learning purposes and for trying it out in both Python and on the STM32.
With that said, here’s the “style” of FFT I’m targeting:
I will not assume the input signal is real-valued, so I won’t rely on any symmetry in the frequency spectrum.
On the STM32, I’ll only support 512 samples. The Python version will be more flexible, since memory constraints aren’t an issue there.
I’ll implement a decimation-in-time (DIT) FFT, so the output frequency bins appear in order from lowest to highest.
I’ll use the iterative approach to the FFT algorithm, as it is easier to implement on the STM32 compared to the recursive approach. It is also generally faster and more memory-efficient.
Will be implementing the Radix-2 Cooley-Tukey FFT algorithm.
I am not applying a window to the data before the FFT. I probably should but that’s not the goal of this blog.
Next we will start with a really brief overview of the Fast Fourier Transform (FFT) and to do that we have to actually start with the Discrete Fourier Transform (DFT). The DFT is the mathematical definition that transforms a signal from the time domain to the frequency domain. The FFT is a fast and efficient algorithm to compute the DFT. With that said, we should start by looking at the equation for the DFT:
Where:
X[k] - is the k-th output frequency bin
x[n] - is the n-th input time sample
W - is the complex exponential that applies the necessary rotation to project the time sample into the frequency domain. This is often called the Twiddle factor.
N - is the total number of samples in the DFT
k - frequency bin number
n - is the time sample number
In the end, the DFT converts time-domain samples into frequency components using only additions and multiplications with complex exponentials (phase rotations). However, many of these operations are repetitive. The FFT is a fast algorithm that reorganizes and reuses these calculations, reducing the total number of operations and making it much more efficient for large signals.
We can represent the DFT as a matrix of twiddle factors multiplied by the input sample vector, producing an output vector of frequency components. In this matrix form, many of the twiddle factors are repeated or related by symmetry. By taking advantage of this redundancy and cleverly rearranging the computations, we arrive at the FFT which is a faster algorithm to compute the same result. A visual representation of this process can be seen in the case of an 8-point FFT:
Where the x[ .. ] values on the left side are your time samples but rearranged in bit-reversed order and the X[ .. ] values on the right side are your frequency bins (in the correct order).
Lines = signal paths (no difference between solid and dashed lines)
Arrows = multiplication of the signal on the line with the value under the arrow
Dots = addition of the connecting lines/signals
You may notice a few repeating patterns such as the one shown below. This is called the FFT butterfly (specifically the Radix-2 butterfly) and it repeats throughout the diagram - in fact the diagram is exclusively made up of these butterflies.
While I find these FFT diagrams great for visualizing the theory/mathematics of the FFT, I find that they don’t make it super obvious on how to actually implement the algorithm. For example, in the image below I can easily calculate the value of X[3] by following the paths in the FFT diagram. Its a fairly easy process, but I don’t really know how to do this programmatically.
Now to be fair its possible that I personally just don’t have the mathematics or algorithm design experience to be able to see these diagrams and easily implement them myself. So maybe this is really trivial. However, I will do my best to walk through maybe a different approach that will also resonate with you.
Now if we look at the FFT diagrams, we should see that there are these clear vertical “transition” lines between the butterflies. The input of the next butterfly only uses the output of the previous butterfly. There are no other signals coming from before these transition lines affecting the next butterflies. There is a clear and clean transition between them.
Below is a diagram where there wasn’t this transition between butterflies. We can see some signal inside of the first butterflies being apart of the calculations for the later butterflies.
Why is this interesting? It means we do not have extra memory holding previous values for future calculations. Furthermore, we can see that there is a direct path between x[ a ] to X [ a ] and all of the other sums and multiplications just manipulate the value stored in the “a” path. Every input into the FFT has an equivalent direct path between its x[ ] and X[ ] and all of the other sums and multiplications just manipulate the value stored.
What this leads to is the idea that we can perform the FFT in the same memory as our time sample data is saved. Put another way, the FFT operates in place as it reuses the same memory space to store intermediate results, avoiding the need for additional storage. As this is a really good characteristic of the FFT.
With this in mind I am going to redraw the FFT diagram showing the same memory and its value after each FFT butterfly.
Before I identified a “transition” line between butterflies. This transition line was what separated, the more commonly referred, stages from one another. In the above diagram we see stage 1,2, and 3 where are the end of each stage I show the memory array and the “value” each element of the array holds. For example, Mem[2] in stage one will be the value corresponding to:
Mem[2] + (W_2^0) Mem[3]
Where Mem[2] was the value of memory element 2 before stage 1, Mem[3] was the value of memory element 3 before stage 1, and (W_2^0) is the twiddle factor.
What I’m trying to highlight is that each FFT stage operates only on the data in memory, without any knowledge to where the values originally came from.
From here I am gong to remove the butterfly parts of the diagram and just focus on how the memory changes from stage to stage. We are transitioning from a diagram where we could visually see patterns to hopefully a diagram where we can more easily quantize these patterns in terms of our FFT programmatically.
If it wasn’t obvious before, the calculation for each memory element has the following form:
With these changes in how we present the FFT, we now only care about which memory element we are at, what is the value in Mem[a] and Mem[b], a twiddle factor, and if we are adding or subtracting.
Putting the 8-point FFT into a table form where each table is a different FFT stage, we can now look at the patterns numerically:
Pattern 1:
Starting at
Stage 1, we see that Mem[a] = Mem[0] for both Mem[0] and Mem[1] elements >> difference in elements is 1
Stage 2, we see that Mem[a] = Mem[0] for both Mem[0] and Mem[2] elements >> difference in elements is 2
Stage 3, we see that Mem[a] = Mem[0] for both Mem[0] and Mem[4] elements >> difference in elements is 4
We see this pattern repeated for the other memory elements. We can relate the difference in elements to the stage number as:
Where stage is the current stage number. I have written the equation a couple different ways including 2^stage >> 1 which is bit shifting to the left, which is equivalent of dividing by 2. In fact, you might take that even further by doing (1 << stage) >> 1.
Pattern 2: For the twiddle factors we also see a similar pattern where the superscript repeats at 2^(stage-1) and always starts at 0. The subscript is 2^stage for all twiddles in the stage. I haven’t yet addressed how to calculate the twiddle factors but I will suggest that there other ways to handle them that might be better when programming in Python or programming on STM32. More on this later but we do see a clear pattern.
Pattern 3: We can also notice that the:
First time Mem[a] = Mem[0] it is an addition “+”
Second time Mem[a] = Mem[0] it is a subtraction “-”
This pattern repeats for other values of Mem[a] if we start from Mem[0] and process the memory elements in sequential order to the last memory element (Mem[7] in our example.
Pattern 4: There is also a pattern between Mem[a] and Mem[b], where Mem[b] is the memory element 2^(stage-1) from Mem[a] as shown below.
Pattern 5: I think the last thing to note is that the number of required stages is log2(Number of samples). So in our example we have 8 samples = 2^3 samples which gives us the number of stages as log2(8) = log2(2^3) = 3 stages.
So lets now put together some pseudo-code for what we think we need to do:
We know that we will have to go through each stage :
Loop stage until log2(N)
Where N is the number of samples.
In each stage, we will have to loop through each of the memory elements:
Loop stage until log2(N)
Loop k until N
Where k is the variable we will loop through the memory with.
Now lets look at Pattern 1, 3 and 4. These patterns give us this idea that these operations can happen in pairs. For example, if we start at Mem[0] we have a similar operation on the memory element 2^(stage-1) elements away. It is the same operation except with a subtraction rather than addition. The Mem[b] value is also always 2^(stage-1) away from Mem[a].
Loop stage until log2(N)
Loop k until N incrementing 2^(stage)
Loop j until 2^(stage-1)
Calculate Mem[ k+j ] = Mem[ k+j ] + W * Mem[ k+j + 2^(stage-1) ]
Calculate Mem[ k+j + 2^(stage-1)] = Mem[ k+j ] - W * Mem[ k+j + 2^(stage-1) ]
Where j and k are used to loop through each memory element in a specific order. Note that we have changed the Loop k to increment 2^(stage). Similarly notice that Loop j increments until 2^(stage-1). The combination of these two loops is key for performing the operations in the correct order.
Lets walk through the 8-point FFT example:
We are at stage 1, meaning 2^(stage) = 2 and 2^(stage-1) = 1. Thus we have something like:
Loop stage until log2(N)
Loop k until N incrementing 2
Loop j until 1
Calculate Mem[ k+j ] = ….
Calculate Mem[ k+j + 1] = ….
1) k = 0, j = 0: Calculate Mem[0], Mem[1]
2) k = 2, j = 0: Calculate Mem[2], Mem[3]
3) k = 4, j = 0: Calculate Mem[4], Mem[5]
4) k = 6, j = 0: Calculate Mem[6], Mem[7]
We are at stage 2, meaning 2^(stage) = 4 and 2^(stage-1) = 2. Thus we have something like:
Loop stage until log2(N)
Loop k until N incrementing 4
Loop j until 2
Calculate Mem[ k+j ] = ….
Calculate Mem[ k+j + 2] = ….
1) k = 0, j = 0: Calculate Mem[0], Mem[2]
2) k = 0, j = 1: Calculate Mem[1], Mem[3]
3) k = 4, j = 0: Calculate Mem[4], Mem[6]
4) k = 4, j = 1: Calculate Mem[5], Mem[7]
We are at stage 3, meaning 2^(stage) = 8 and 2^(stage-1) = 4. Thus we have something like:
Loop stage until log2(N)
Loop k until N incrementing 8
Loop j until 4
Calculate Mem[ k+j ] = ….
Calculate Mem[ k+j + 4] = ….
1) k = 0, j = 0: Calculate Mem[0], Mem[4]
2) k = 0, j = 1: Calculate Mem[1], Mem[5]
3) k = 0, j = 2: Calculate Mem[2], Mem[6]
4) k = 0, j = 3: Calculate Mem[3], Mem[7]
Todo:Add image
The one detail that is missing relates to which twiddle factor to use. From Pattern 2 we can see that the superscript follows j while the subscript is equal to 2^(stage):
Loop stage until log2(N)
Loop k until N
Loop j until 2^(stage-1)
Calculate Mem[ k+j ] = Mem[ k+j ] + W^{ j }_{2^stage} * Mem[ k+j + 2^(stage-1) ]
Calculate Mem[ k+j + 2^(stage-1)] = Mem[ k+j ] - W^{ j }_{2^stage} * Mem[ k+j + 2^(stage-1) ]
Because the twiddle factor represents a phase rotation, and the amount of rotation between each butterfly in a given stage is consistent, we can start with a twiddle factor of 1 (which is W⁰) and repeatedly rotate it by multiplying with the stage’s base twiddle factor. This way, each new twiddle factor in the stage is just the previous one multiplied by a constant rotation, reducing the number of exponentials we need to compute. This is the approach we will take with our Python code.
For the STM32 code we will pre-calculate the twiddle factors and then use the method shown in the above pseudo-code.
The twiddle factor is calculated using the following equation:
Which gives us the twiddle factor as a complex number. If we want it broken into real and imaginary (useful for when we code the STM32), we can use the following equation:
I think this gives us enough background on what we will attempt to write in our Python and STM32 code.
Implementing FFT with Python
Before starting I am using Python 3.13 along with the following libraries:
Numpy for array handling, math, and plotting
Matplotlib for plotting
For my IDE, I’m using Spyder, since its layout is very similar to Matlab, which makes it great for this. To see how I setup my workspace see Using Spyder with Python Virtual Environment.
Lets start by importing our required libraries for this tutorial:
import numpy as np import matplotlib.pyplot as plt
Next, we’ll create an example script with a signal composed of two sine waves: one at 1 kHz and the other at 6 kHz. The system will use a sampling rate of 100 kHz, and the FFT will be performed on 2⁹ = 512 samples. We’ll pass these samples into our custom FFT function called fft_iterative( signal , numberOfSamples ), where signal represents the y-axis data and numberOfSamples is the number of samples in the FFT (must be powers of 2). After completing the FFT, we’ll calculate the frequency values for each bin and plot the results.
Now lets start working on our FFT function fft_iterative( ).
1. We need to convert the input signal array into an array of complex numbers. Since the FFT operates on complex values, we have to make sure our NumPy array is set up to handle them properly. We can accomplish this by creating a new Numpy array from the old one but set it’s dtype to complex:
X = np.array(x, dtype=complex)
Where little “x” was our input signal array passed through the fft_iterative( ) function, and big “X” is our new complex array. Note our input signal array could be complex if we wanted.
2. Next we have to do is reorder our time sample data in bit-reversed order. This just means we flip the order of the bits in each index. So, if we had 8 samples (with indices 0 to 7), we’d convert each index to binary, reverse the bits, and use that as the new position. For example, lets say Mem[3] has index 3 is 011 in binary - flip it to 110, which is 6, so the sample value held in Mem[3] moves to Mem[6]. Another example of bit-reversal is shown in the image below.
In Python we will create a function called bit_reverse_indices( ) which we will pass the number of samples and it will generate an array with the indices in which the memory need to be moved to:
In short, this function counts from 0 to N-1 and goes from the least significant bit to the most significant bit. If it finds a 1 in the binary number, it moves it to the other side of the number (reversing it). The function returns an array of numbers representing bit reversed indices. We can then use the following line to actually reorder one array to this new bit reversed array:
indices = bit_reverse_indices(N)
X = X[indices]
Where X is our time samples.
3. With our time sample data in bit-reversed order and our input numpy array in type complex, we can start implementing the core parts of our FFT. We will start by creating the different loops like we have in our pseudo-code from the previous section.
Where s is the stage number, m is 2^(stage), and m//2 is equivalent to 2^(stage -1).
The final step is to perform the calculations:
Some things to note:
We use the variables u and t to hold the values of Mem[a] and W*Mem[b] before we add/subtract them together. If we didn’t do that, when we calculate X[k+j] we would change the value of X[k+j] that was needed in order to calculate X[k+j+m//2].
The code line w *= wm moves our twiddle factor phase rotation to the next location. Also we start our twiddle factor w=1 which is equivalent to W^0.
4. Now that we have the FFT output, which gives us the amplitude (and possibly phase) information for each frequency bin, the next step is to map those bin indices to real-world frequencies. The FFT result is essentially an array where each index corresponds to a specific frequency component of the input signal. However, the bin number by itself doesn’t directly tell us the frequency in hertz.
To interpret the results meaningfully when plotting we need to convert each bin number into a frequency value using the sample rate and the total number of points in the FFT.
This is typically done with a simple formula for positive frequencies:
And for negative frequencies:
Where index is the index of our FFT bin, and N is the number of time samples/frequency bins. It should be noted that the FFT output starts at 0 Hz and goes up to the sampling frequency (Fs). The second half of the FFT, from Fs/2 to Fs, actually represents negative frequencies due to the way complex exponentials work. So when we talk about "negative frequency," we're just shifting that upper half of the spectrum (Fs/2 to Fs) from 0 Hz down to –Fs/2.
The custom function fft_freq( numberOfSamples, samplePeriod ) will return an array of frequency values converted from the frequency bin numbers:
5. Just to round out the FFT we will also create a custom function to generate the IFFT.
The IFFT is almost the same as the FFT, but with two key differences: the twiddle factor uses +j instead of –j, effectively reversing the direction of phase rotation, and the final result is scaled by 1/N to undo the summation scaling. Aside from that, the structure of the algorithm remains the same.
Putting all of this code together and plotting:
Implementing FFT on STM32
A lot of what was done in Python carries over to this example on the STM32, however we do have to spend extra attention with:
Handling complex numbers
Memory usage
Use ADC to measure a real sine wave
Somewhat cautious of processing speed even though we are measuring slow signals
For the next part, we are going to implement this FFT in C on an STM32H723 (NUCLEO-H723ZG) using STM32CubeIDE and the HAL libraries. These steps should be similar for most other STM32 microcontrollers, though there may be small differences depending on the specific device or peripheral configuration.
We’ll write a program that uses the microcontroller’s ADC to sample a sine wave. The ADC will collect a total of 1024 samples and store them in memory using DMA. To process the data in chunks, the DMA will be set up in double-buffer (or half-transfer) mode. When the first half of the buffer (512 samples) is filled, the DMA will trigger an interrupt. In that interrupt, we’ll convert the raw ADC values into floating-point voltage values. Then, we’ll rearrange the data using bit-reversal and store it into a separate 512-sample buffer. After that, we’ll perform a 512-point FFT and print the maximum frequency spectrum to the serial console. Once the DMA finishes storing the second half of the buffer (samples 512 to 1023), it will trigger another interrupt. We’ll repeat the same process for this second half: convert the samples, apply bit-reversal, perform the FFT, and print the results. This structure allows continuous data acquisition and processing without missing samples, making it suitable for real-time or near-real-time analysis.
Lets begin with setting up our project and pinout configuration:
Pinout and Configuration
First step is creating a new STM32 project, selecting the microcontroller, give it a name, and use C as the targeted language.
If it is not already opened by default, double click the .ioc file from the project explorer to use the STM32CubeIDE configuration tool. The next steps will take place in the Pinout and Configuration tool provided by STM32CubeIDE:
For debugging we will select PA14 and PA13 as SWCLK and SWDIO respectively. To configure these we will go to the Trace and Debug category >> Debug >> Serial Wire
For the ADC, we’ll use ADC1 channel 15, which is located on pin PA3 for this microcontroller. To enable the ADC channel, check the IN15 Single-ended box in your configuration tool.
Next, we’ll set up the DMA. Go to the DMA Settings tab and click the Add button to add a new DMA request. Change the DMA mode to Circular, so that once the buffer is fully filled, the DMA doesn’t stop—it wraps around and starts writing again from index 0. This allows for continuous data acquisition.
For the address increment settings, check the Memory Increment Address option. This tells the DMA to increment the memory pointer as it fills the buffer with ADC values. Do not check the Peripheral Increment Address option, because the address of the ADC1 data register remains fixed.
Since our ADC resolution will be set to 16 bits (maximum resolution), set the Data Width for both the Peripheral and Memory to Half Word (which corresponds to 16 bits).
In the Parameter Settings tab we will configure the Clock Prescaler to be divided by 128, the Resolution to be 16bits, and Continuous Conversion Mode to be enabled.
We will also select the End of Conversion Mode to be end of sequence of conversion (ie. after multiple conversions as opposed to just one single conversion) and set the Conversion Data Management Mode to be the DMA circular mode.
Lastly, we will expand the Rank menu to change the Channel 15 sampling time to 16.5 cycles.
In the Clock Configuration tool of the .ioc file we will set the ADC Clock mus to use PLL2P which gives us a 48MHz clock to the ADC block.
The ADC kernel clock called adc_ker_ck is generated from the 48MHz PLL2P clock going into the ADC but is divided by 2×prescaler.
Thus the ADC kernel clock is Tadc_ker_ck = 48MHz/2×prescaler = 48MHz/2×128 = 187.5kHz
The ADC sampling time = ( Tsmpl + Tsar ) × Tadc_ker_ck
Where:
Tsmpl is the channel conversion time which is set to 16.5 clock cycles in our example
Tsar is the SAR ADC conversion time which is dependent on the number of ADC bits. Since we are using a 16bit resolution, it will take 8.5 clock cycles.
Thus:
ADC sampling time = ( 16.5 + 8.5 ) × (1/187.5kHz) = 133.3us or 7.5kHz sample rate
Next we will configure the UART so that we can print out our FFT result to a console.
That should be all of the configurations we need to do. To generate the code go to Project >> Generate Code.
The STM32 FFT Code
Now, working in the main.c file, let’s include the necessary libraries for our project. We’ll include math.h to perform mathematical operations, such as sine and cosine calculations needed for generating twiddle factors. We’ll also include stdio.h and string.h, which will be used for serial printing to the console and for handling string operations.
Next, we’ll define a few constants, notably FFT_BITS and FFT_SIZE. The FFT_SIZE represents the number of samples used in the FFT calculation, and it is computed as FFT_SIZE = 2^FFT_BITS.
We’ll also define a macro to handle the conversion of the ADC’s 16-bit output value to a voltage (as a float), assuming a reference voltage of 3.3 V. This macro will scale the raw ADC value by 3.3f to convert it into a corresponding voltage level.
Next, we’ll define our global variables. The first one is an array called adcBuffer_16b, which is used to store raw ADC samples via DMA. This buffer is sized to be twice as large as the number of samples needed for the FFT, allowing us to collect ADC data continuously while processing previous data. When the first half of the buffer is filled (i.e., the first 512 samples), the DMA triggers a half-transfer interrupt to signal that the buffer is halfway full. At that point, our code begins processing those 512 samples with the FFT. While that processing is happening, the ADC and DMA continue collecting data into the second half of the buffer. Once the second half is filled (samples 512–1023), the DMA generates a transfer-complete interrupt. The code can then process that half while the first half is being refilled, and the cycle repeats. This double-buffering scheme ensures continuous sampling and processing without losing data.
Since adcBuffer_16b is an array of type uint16_t, we need to convert the raw ADC samples into voltage values (as float) before performing the FFT. These converted values will be stored in a separate array called fftADC. As we’ll show later, we’ll also perform bit-reversal during this transfer step; reordering the samples as they’re moved from adcBuffer_16b to fftADC in preparation for the FFT. The fftADC will be twice as large as the number of samples needed for the FFT as we need to store both the real and imaginary parts that make up a complex number.
The twiddle array will store all the the twiddle factors that are pre-computed for the FFT. This array will also be twice as large as the number the number of required twiddle factors (FFT_SIZE/2) as we need to store both the real and imaginary parts.
Lastly, we have the fftADC_ready flag to signal to our program that the ADC sample data is converted to voltage, arranged in bit-reversal, moved into fftADC and we are now ready to process the FFT.
Before enabling the ADC and running the FFT, we’ll perform some initial setup. Specifically, we want to precompute the twiddle factors, which will be used during the FFT calculation. In addition, we’ll zero out both the adcBuffer_16b buffer and the fftADC array. While this step isn’t strictly necessary for functionality, it can help with debugging by ensuring that all values are initialized to known states. For that reason, I recommend doing it at this stage.
We will create a function called fft_512samp_init( ) and call it inside the USER CODE 2 section of the main.c (ie. just after out hardware initialization code):
We will place the fft_512samp_init( ) function in the USER CODE 0 section.
As a reminder, we can use the following equation to calculate the twiddle factor:
Since we need to handle complex numbers, the twiddle array is allocated at twice the size as the number of twiddle factors that we need (# twiddle = FFT_SIZE / 2, thus we need the twiddle array to be FFT_SIZE long). This allows us to store both the real and imaginary parts of each complex sample in a single interleaved array. The real parts are stored at even indices (0, 2, 4, ...), and the imaginary parts are stored at the corresponding odd indices (1, 3, 5, ...). This format is commonly used in embedded DSP implementations because it simplifies memory access and keeps the data tightly packed. Alternatively, we could have used two separate arrays, one for the real components and one for the imaginary, but interleaving is often more efficient.
Now that all initial setup is complete, we’ll start the ADC in DMA mode to begin continuous sampling. This is typically done using the HAL_ADC_Start_DMA( ) function, which links the ADC to the DMA controller and begins filling the buffer with ADC samples (placed in the USER CODE 2 section).
At this point, the ADC and DMA are running continuously, filling the adcBuffer_16b array with sampled data. When the buffer is half full, an interrupt is triggered, and the DMA calls the HAL_ADC_ConvHalfCpltCallback( ) function. Inside this callback, we begin processing the newly available data. We move the samples from adcBuffer_16b, convert each one to a floating-point voltage, and store the result in the fftADC array in bit-reversed order. To do this, we define a pointer to index 0 of adcBuffer_16b and calculate the correct offset for each sample based on its bit-reversed position using the bit_reverse16( ) function (see below). Then we use the adc16_to_voltage() macro to scale the raw 16-bit ADC value into its corresponding voltage. This process ensures that our FFT input is properly formatted and ready for processing as soon as the data becomes available.
Since we need to handle complex numbers, the fftAdc array is allocated at twice the size of the FFT length. This allows us to store both the real and imaginary parts of each complex sample in a single interleaved array. The real parts are stored at even indices (0, 2, 4, ...), and the imaginary parts are stored at the corresponding odd indices (1, 3, 5, ...).
The same approach is used when the second half of the buffer is filled, this time the HAL_ADC_ConvCpltCallback( ) function is called. In that case, we define the pointer to start at index FFT_SIZE-1 (i.e., sample 512 - 1 = 511 because we start at index 0) and repeat the same operations: bit-reversal, voltage conversion, and storing the data in fftADC. This alternating, double-buffered processing allows continuous sampling and real-time FFT computation without missing any data.
(placed in the USER CODE 0 section)
The bit_reverse16( ) function calculates the bit-reversed index of a given value, which is required for correctly ordering input data before performing the FFT. It works by using a series of bitmasking and shifting operations to reverse all 16 bits of the input value in stages; first swapping individual bits, then 2-bit groups, then nibbles (4-bit groups), and finally bytes. After fully reversing the bits, the result is shifted to match the actual bit width of the FFT (e.g., 9 bits for a 512-point FFT).
(placed in the USER CODE 0 section)
The final step inside both the HAL_ADC_ConvHalfCpltCallback() and HAL_ADC_ConvCpltCallback() functions is setting the fftADC_ready flag. This flag acts as a signal to the main while(1) loop; indicating that a new set of ADC samples has been processed and stored in the fftADC buffer, and we’re now ready to calculate the FFT. By using this flag, we avoid doing heavy FFT computation inside the interrupt itself and instead defer it to the main loop, keeping the interrupt handlers fast and efficient.
In our main while(1) loop we have an if statement checking the fftADC_ready flag. If this flag is set we will perform the FFT (called fft_dit_512samp( ) where the dit refers to the fact that we are doing a decimation in time FFT). After the FFT is complete we will plot all the data to serial console using the printFFT_Magnitude_All( ) function and finally clearing the fftADC_ready flag.
The FFT on the STM32 is performed very similar way as we did in Python with some notable differences:
In Python, complex numbers are handled natively and can be manipulated just like any other numeric type. In contrast, on the STM32 using C, we represent complex numbers manually by interleaving the real and imaginary parts within the same array. This means that for each complex value, the real part is stored at an even index and the corresponding imaginary part is stored at the next odd index.
In Python, we calculated the base twiddle factor for each FFT stage and, starting from W⁰ = 1, applied successive complex multiplications to generate the necessary rotations during the FFT computation. This approach is flexible and easy to implement in high-level languages, but it can be computationally expensive in real-time applications. On the STM32 using C, we take a different approach by precomputing all the required twiddle factors ahead of time and storing them in an array. During the FFT, we select the appropriate twiddle factor using the stride variable and the loop index j. This saves time during execution by avoiding repeated calls to sine and cosine functions, which is especially important on resource-constrained embedded systems.
After executing the FFT, the data in the fftAdc array represents the signal in the frequency domain. Each complex value in the array corresponds to a specific frequency bin, containing both the real and imaginary parts of the frequency component. These values can be used to compute the magnitude and phase of each frequency present in the original signal.
Finally we will print out the magnitude frequency spectrum to serial console using printFFT_Magnitude_All( ) function:
To get the snprintf( ) to work, you will need to enable the printf_float option. You can do this in STM32CubeIDE by right clicking on your project and open the properties dialog.
Then expand the C/C++ Build >> Settings >> MCU/MPU Settings and finally check the Use float with printf … checkbox.
With all the pieces in place; ADC configuration, DMA setup, buffer management, twiddle factor initialization, and the FFT implementation. We now have a complete system that continuously samples data, processes it in real time, and transforms it into the frequency domain. In the next section, we’ll test this setup by feeding in known signals and analyzing the FFT output to confirm that everything is working as expected.
If you want to see the complete code for this FFT, check out on Github
Testing STM32 FFT Code
In this section, we’ll test the FFT code by feeding a known 1 kHz sine wave into the ADC input pin and plotting the FFT results received over the serial console. To do this, we’ll connect a signal generator to the appropriate ADC pin and observe how the system processes and transforms the signal. Unfortunately, due to the time it takes to transmit the FFT output over serial, we need to pause data acquisition during testing. To prevent the buffer from being overwritten while we're still printing, we stop the DMA immediately after the HAL_ADC_ConvHalfCpltCallback() function is triggered. This gives us time to inspect the output without losing any data.
With this change we can run our code and our output in the serial plot looks like this in Tera Term:
By selecting all the serial data and copying it into Excel using Edit → Copy Table, we can generate a plot of the FFT output. The resulting graph shows a peak at DC (which is expected due to the signal’s offset), along with two clear peaks at ±1 kHz, matching the frequency of our test input and confirming that the FFT is working correctly.
This initial test is useful, as it confirms that our FFT is functioning correctly and that there are no unexpected spikes or noise artifacts in the output. However, for practical applications, it may be more useful to extract and print only the most significant frequency component, specifically the peak frequency in the spectrum (excluding the DC bin). This approach reduces the amount of data sent over serial and gives us a quick, high-level view of the dominant frequency present in the signal.
First add the new function called printFFT_Magnitude_MaxPossitive( ) in the USER CODE 0 section:
Then in the main while(1) loop replace the function call printFFT_Magnitude_All( ) with printFFT_Magnitude_MaxPossitive( ) as shown:
Finally, we want to remove the code line which stops the DMA in the HAL_ADC_ConvHalfCpltCallback() function
With these changes to our program, we can run our code and our output in the serial plot looks like this in Tera Term:
Where the first column is the max positive frequency and the second column is its magnitude.
Conclusion
With this, I’ve shown how to write your own FFT from scratch; first in Python for clarity and experimentation, and then in C on an STM32 microcontroller for embedded applications. This implementation is intentionally kept simple and unoptimized, with a focus on clarity and education rather than speed or efficiency. The goal was to understand the underlying mechanics of the FFT and how to apply them in both high-level and low-level environments.
While there are faster, more advanced libraries available for both Python and embedded systems, building the FFT yourself gives valuable insight into how the algorithm works, how data needs to be arranged, and how performance considerations come into play, especially in resource-constrained systems. If you're just getting started with digital signal processing or embedded programming, this kind of hands-on exploration is a great way to build intuition and confidence.
There’s plenty of room for improvement from here, such as adding fixed-point support, optimizing memory access, or implementing windowing and overlap. But this foundation provides a strong starting point for more advanced DSP work.