Introduction image showing a 8-point FFT butterfly — In this post, we’ll be diving into the process of writing our own custom Fast Fourier Transform (FFT) implementation from scratch. The FFT is a foundational algorithm in digital signal processing, used to convert time-domain signals into their frequency-domain representation. While many libraries and tools offer ready-made FFT functions, coding one yourself is a great way to build a deeper understanding of how it works under the hood.
I’ll start by lightly covering a bit of the theory but try and discuss implementing the algorithm itself in more detail. From there, we’ll jump into writing the code in Python, which allows us to test and visualize the results quickly. Then, we’ll bring it down to the hardware level by implementing the same FFT logic on an STM32 microcontroller, where things like memory usage and performance really start to matter.
The goal here isn’t to create the fastest or most optimized FFT, but to build something simple.

Additional Resources:

DSP Related courses by Dan Boschen

FFT Overview

Note: I am not an expert in FFT, DSP, or anything closely related. Think of this post more as a personal notebook that I’m sharing with the world. I’m not going to dive into the theory of the FFT in any great detail. Instead, I’ll focus on implementing it. My implementation is not optimized for performance; it's simply a straightforward FFT meant for learning purposes and for trying it out in both Python and on the STM32.

With that said, here’s the “style” of FFT I’m targeting:

I will not assume the input signal is real-valued, so I won’t rely on any symmetry in the frequency spectrum.
On the STM32, I’ll only support 512 samples. The Python version will be more flexible, since memory constraints aren’t an issue there.
I’ll implement a decimation-in-time (DIT) FFT, so the output frequency bins appear in order from lowest to highest.
I’ll use the iterative approach to the FFT algorithm, as it is easier to implement on the STM32 compared to the recursive approach. It is also generally faster and more memory-efficient.
Will be implementing the Radix-2 Cooley-Tukey FFT algorithm.
I am not applying a window to the data before the FFT. I probably should but that’s not the goal of this blog.

Next we will start with a really brief overview of the Fast Fourier Transform (FFT) and to do that we have to actually start with the Discrete Fourier Transform (DFT). The DFT is the mathematical definition that transforms a signal from the time domain to the frequency domain. The FFT is a fast and efficient algorithm to compute the DFT. With that said, we should start by looking at the equation for the DFT:

Breaking down the equation for the discrete Fourier transform

Where:

X[k] - is the k-th output frequency bin

x[n] - is the n-th input time sample

W - is the complex exponential that applies the necessary rotation to project the time sample into the frequency domain. This is often called the Twiddle factor.

N - is the total number of samples in the DFT

k - frequency bin number

n - is the time sample number

In the end, the DFT converts time-domain samples into frequency components using only additions and multiplications with complex exponentials (phase rotations). However, many of these operations are repetitive. The FFT is a fast algorithm that reorganizes and reuses these calculations, reducing the total number of operations and making it much more efficient for large signals.

We can represent the DFT as a matrix of twiddle factors multiplied by the input sample vector, producing an output vector of frequency components. In this matrix form, many of the twiddle factors are repeated or related by symmetry. By taking advantage of this redundancy and cleverly rearranging the computations, we arrive at the FFT which is a faster algorithm to compute the same result. A visual representation of this process can be seen in the case of an 8-point FFT:

FFT butterfly diagram for an 8-point FFT

Where the x[ .. ] values on the left side are your time samples but rearranged in bit-reversed order and the X[ .. ] values on the right side are your frequency bins (in the correct order).

Lines = signal paths (no difference between solid and dashed lines)
Arrows = multiplication of the signal on the line with the value under the arrow
Dots = addition of the connecting lines/signals

A legend or guide on how to read the symbols in the FFT butterfly diagram

You may notice a few repeating patterns such as the one shown below. This is called the FFT butterfly (specifically the Radix-2 butterfly) and it repeats throughout the diagram - in fact the diagram is exclusively made up of these butterflies.

A single 2-point butterfly which is used extensively in larger FFTs

While I find these FFT diagrams great for visualizing the theory/mathematics of the FFT, I find that they don’t make it super obvious on how to actually implement the algorithm. For example, in the image below I can easily calculate the value of X[3] by following the paths in the FFT diagram. Its a fairly easy process, but I don’t really know how to do this programmatically.

Now to be fair its possible that I personally just don’t have the mathematics or algorithm design experience to be able to see these diagrams and easily implement them myself. So maybe this is really trivial. However, I will do my best to walk through maybe a different approach that will also resonate with you.

Highlighting the path(s) that each input takes to give use the final value for X[3] output from the FFT

Now if we look at the FFT diagrams, we should see that there are these clear vertical “transition” lines between the butterflies. The input of the next butterfly only uses the output of the previous butterfly. There are no other signals coming from before these transition lines affecting the next butterflies. There is a clear and clean transition between them.

Diagram showing how there is a transition between the different stages in the FFT where the next butterfly only used the outputs from the previous butterflies

Below is a diagram where there wasn’t this transition between butterflies. We can see some signal inside of the first butterflies being apart of the calculations for the later butterflies.

An example diagram showing the case where one of the inputs in the next butterfly come from somewhere that is not an output of the previous butterfly

Why is this interesting? It means we do not have extra memory holding previous values for future calculations. Furthermore, we can see that there is a direct path between x[ a ] to X [ a ] and all of the other sums and multiplications just manipulate the value stored in the “a” path. Every input into the FFT has an equivalent direct path between its x[ ] and X[ ] and all of the other sums and multiplications just manipulate the value stored.

A diagram showing that the memory element with index a is directly connected to the output memory element with index a. This means that we can computer the FFT "in-place" and can minimize the amount of memory used to calculate the FFT.

What this leads to is the idea that we can perform the FFT in the same memory as our time sample data is saved. Put another way, the FFT operates in place as it reuses the same memory space to store intermediate results, avoiding the need for additional storage. As this is a really good characteristic of the FFT.

With this in mind I am going to redraw the FFT diagram showing the same memory and its value after each FFT butterfly.

Diagram showing the 8-point FFT butterfly but with the physical memory elements used to calculate the FFT. Again we are showing that you can calculate the FFT "in-place" to minimize memory.

Before I identified a “transition” line between butterflies. This transition line was what separated, the more commonly referred, stages from one another. In the above diagram we see stage 1,2, and 3 where are the end of each stage I show the memory array and the “value” each element of the array holds. For example, Mem[2] in stage one will be the value corresponding to:

Mem[2] + (W_2^0) Mem[3]

Where Mem[2] was the value of memory element 2 before stage 1, Mem[3] was the value of memory element 3 before stage 1, and (W_2^0) is the twiddle factor.

What I’m trying to highlight is that each FFT stage operates only on the data in memory, without any knowledge to where the values originally came from.

From here I am gong to remove the butterfly parts of the diagram and just focus on how the memory changes from stage to stage. We are transitioning from a diagram where we could visually see patterns to hopefully a diagram where we can more easily quantize these patterns in terms of our FFT programmatically.

A simplification of the FFT diagram where we only show the memory elements and their corresponding calculations.

If it wasn’t obvious before, the calculation for each memory element has the following form:

The equation of the calculation that all the memory elements have.

With these changes in how we present the FFT, we now only care about which memory element we are at, what is the value in Mem[a] and Mem[b], a twiddle factor, and if we are adding or subtracting.

Putting the 8-point FFT into a table form where each table is a different FFT stage, we can now look at the patterns numerically:

Simplifying the FFT diagrams again into simply a table. The idea here is to make programmable patterns more obvious so that we can write our FFT algorithm

Pattern 1:

Starting at

Stage 1, we see that Mem[a] = Mem[0] for both Mem[0] and Mem[1] elements >> difference in elements is 1
Stage 2, we see that Mem[a] = Mem[0] for both Mem[0] and Mem[2] elements >> difference in elements is 2
Stage 3, we see that Mem[a] = Mem[0] for both Mem[0] and Mem[4] elements >> difference in elements is 4

Table view of FFT, showing the pattern related to Mem[a]

We see this pattern repeated for the other memory elements. We can relate the difference in elements to the stage number as:

Equation showing different ways to calculate 2^(stage-1) on a computer.

Where stage is the current stage number. I have written the equation a couple different ways including 2^stage >> 1 which is bit shifting to the left, which is equivalent of dividing by 2. In fact, you might take that even further by doing (1 << stage) >> 1.

Pattern 2: For the twiddle factors we also see a similar pattern where the superscript repeats at 2^(stage-1) and always starts at 0. The subscript is 2^stage for all twiddles in the stage. I haven’t yet addressed how to calculate the twiddle factors but I will suggest that there other ways to handle them that might be better when programming in Python or programming on STM32. More on this later but we do see a clear pattern.

Pattern 3: We can also notice that the:

First time Mem[a] = Mem[0] it is an addition “+”
Second time Mem[a] = Mem[0] it is a subtraction “-”

This pattern repeats for other values of Mem[a] if we start from Mem[0] and process the memory elements in sequential order to the last memory element (Mem[7] in our example.

Pattern 4: There is also a pattern between Mem[a] and Mem[b], where Mem[b] is the memory element 2^(stage-1) from Mem[a] as shown below.

Table view of FFT, showing the pattern related to Mem[b]

Pattern 5: I think the last thing to note is that the number of required stages is log2(Number of samples). So in our example we have 8 samples = 2^3 samples which gives us the number of stages as log2(8) = log2(2^3) = 3 stages.

So lets now put together some pseudo-code for what we think we need to do:

We know that we will have to go through each stage :

Loop stage until log2(N)

Where N is the number of samples.

In each stage, we will have to loop through each of the memory elements:

Loop stage until log2(N)
Loop k until N

Where k is the variable we will loop through the memory with.

Now lets look at Pattern 1, 3 and 4. These patterns give us this idea that these operations can happen in pairs. For example, if we start at Mem[0] we have a similar operation on the memory element 2^(stage-1) elements away. It is the same operation except with a subtraction rather than addition. The Mem[b] value is also always 2^(stage-1) away from Mem[a].

Loop stage until log2(N)
Loop k until N incrementing 2^(stage)
Loop j until 2^(stage-1)
Calculate Mem[ k+j ] = Mem[ k+j ] + W * Mem[ k+j + 2^(stage-1) ]
Calculate Mem[ k+j + 2^(stage-1)] = Mem[ k+j ] - W * Mem[ k+j + 2^(stage-1) ]

Where j and k are used to loop through each memory element in a specific order. Note that we have changed the Loop k to increment 2^(stage). Similarly notice that Loop j increments until 2^(stage-1). The combination of these two loops is key for performing the operations in the correct order.

Lets walk through the 8-point FFT example:

We are at stage 1, meaning 2^(stage) = 2 and 2^(stage-1) = 1. Thus we have something like:

Loop stage until log2(N)
Loop k until N incrementing 2
Loop j until 1
Calculate Mem[ k+j ] = ….
Calculate Mem[ k+j + 1] = ….

1) k = 0, j = 0: Calculate Mem[0], Mem[1]

2) k = 2, j = 0: Calculate Mem[2], Mem[3]

3) k = 4, j = 0: Calculate Mem[4], Mem[5]

4) k = 6, j = 0: Calculate Mem[6], Mem[7]

Table view of FFT, showing the stage 1 table

We are at stage 2, meaning 2^(stage) = 4 and 2^(stage-1) = 2. Thus we have something like:

Loop stage until log2(N)
Loop k until N incrementing 4
Loop j until 2
Calculate Mem[ k+j ] = ….
Calculate Mem[ k+j + 2] = ….

1) k = 0, j = 0: Calculate Mem[0], Mem[2]
2) k = 0, j = 1: Calculate Mem[1], Mem[3]

3) k = 4, j = 0: Calculate Mem[4], Mem[6]
4) k = 4, j = 1: Calculate Mem[5], Mem[7]

Table view of FFT, showing the stage 2 table

We are at stage 3, meaning 2^(stage) = 8 and 2^(stage-1) = 4. Thus we have something like:

Loop stage until log2(N)
Loop k until N incrementing 8
Loop j until 4
Calculate Mem[ k+j ] = ….
Calculate Mem[ k+j + 4] = ….

1) k = 0, j = 0: Calculate Mem[0], Mem[4]
2) k = 0, j = 1: Calculate Mem[1], Mem[5]
3) k = 0, j = 2: Calculate Mem[2], Mem[6]
4) k = 0, j = 3: Calculate Mem[3], Mem[7]
Todo:Add image

Table view of FFT, showing the stage 3 table

The one detail that is missing relates to which twiddle factor to use. From Pattern 2 we can see that the superscript follows j while the subscript is equal to 2^(stage):

Loop stage until log2(N)
Loop k until N
Loop j until 2^(stage-1)
Calculate Mem[ k+j ] = Mem[ k+j ] + W^{ j }_{2^stage} * Mem[ k+j + 2^(stage-1) ]
Calculate Mem[ k+j + 2^(stage-1)] = Mem[ k+j ] - W^{ j }_{2^stage} * Mem[ k+j + 2^(stage-1) ]

Because the twiddle factor represents a phase rotation, and the amount of rotation between each butterfly in a given stage is consistent, we can start with a twiddle factor of 1 (which is W⁰) and repeatedly rotate it by multiplying with the stage’s base twiddle factor. This way, each new twiddle factor in the stage is just the previous one multiplied by a constant rotation, reducing the number of exponentials we need to compute. This is the approach we will take with our Python code.

For the STM32 code we will pre-calculate the twiddle factors and then use the method shown in the above pseudo-code.

The twiddle factor is calculated using the following equation:

Twiddle factor equation in complex number formate

Which gives us the twiddle factor as a complex number. If we want it broken into real and imaginary (useful for when we code the STM32), we can use the following equation:

Twiddle factor equation showing the real and imaginary parts

I think this gives us enough background on what we will attempt to write in our Python and STM32 code.

Implementing FFT with Python

Before starting I am using Python 3.13 along with the following libraries:

Numpy for array handling, math, and plotting
Matplotlib for plotting

For my IDE, I’m using Spyder, since its layout is very similar to Matlab, which makes it great for this. To see how I setup my workspace see Using Spyder with Python Virtual Environment.

Lets start by importing our required libraries for this tutorial:

import numpy as np
import matplotlib.pyplot as plt

Next, we’ll create an example script with a signal composed of two sine waves: one at 1 kHz and the other at 6 kHz. The system will use a sampling rate of 100 kHz, and the FFT will be performed on 2⁹ = 512 samples. We’ll pass these samples into our custom FFT function called fft_iterative( signal , numberOfSamples ), where signal represents the y-axis data and numberOfSamples is the number of samples in the FFT (must be powers of 2). After completing the FFT, we’ll calculate the frequency values for each bin and plot the results.

# Number of sample points
N = int(2**9)

# Sample rate
Fs = 100e3
Ts = 1/Fs

# Signal frequencies
f1 = 1e3
f2 = 6e3

# Signals
x = np.linspace(0.0, N*Ts, N, endpoint=False)
y = np.sin(f1 * 2.0 * np.pi * x) + 0.25 * np.sin(f2 * 2.0 * np.pi * x)

# Perform FFT
yf = fft_iterative(y, N)

# Calculate frequency bins
freq = fft_freq(N, Ts)

# Plot FFT
plt.figure()
plt.stem(freq, 1/N * np.abs(yf))
plt.grid()
plt.xlim(-10000, 10000)
plt.ylim()
plt.show()

Now lets start working on our FFT function fft_iterative( ).

1. We need to convert the input signal array into an array of complex numbers. Since the FFT operates on complex values, we have to make sure our NumPy array is set up to handle them properly. We can accomplish this by creating a new Numpy array from the old one but set it’s dtype to complex:

X = np.array(x, dtype=complex)

Where little “x” was our input signal array passed through the fft_iterative( ) function, and big “X” is our new complex array. Note our input signal array could be complex if we wanted.

2. Next we have to do is reorder our time sample data in bit-reversed order. This just means we flip the order of the bits in each index. So, if we had 8 samples (with indices 0 to 7), we’d convert each index to binary, reverse the bits, and use that as the new position. For example, lets say Mem[3] has index 3 is 011 in binary - flip it to 110, which is 6, so the sample value held in Mem[3] moves to Mem[6]. Another example of bit-reversal is shown in the image below.

Diagram showing how we convert a sequential index ordering to a reverse bit index ordering.

In Python we will create a function called bit_reverse_indices( ) which we will pass the number of samples and it will generate an array with the indices in which the memory need to be moved to:

# Generate bit-reversed indices for a given size N
def bit_reverse_indices(N):

    # Determine the number of bits needed to represent indices
    bits = int(np.log2(N))

    # Create an array to hold the bit-reversed indices
    reversed_indices = np.zeros(N, dtype=int)

    # For each index from 0 to N-1
    for i in range(N):

        # Variable to store reversed bits
        rev = 0

        # Check each bit position
        for b in range(bits):

            # If the b-th bit of i is set, set the (bits-1-b)th bit of rev
            if (i >> b) & 1:
                rev |= 1 << (bits - 1 - b)

        # Store reversed index
        reversed_indices[i] = rev

    # Return the list of reversed indices
    return reversed_indices

In short, this function counts from 0 to N-1 and goes from the least significant bit to the most significant bit. If it finds a 1 in the binary number, it moves it to the other side of the number (reversing it). The function returns an array of numbers representing bit reversed indices. We can then use the following line to actually reorder one array to this new bit reversed array:

indices = bit_reverse_indices(N)
X = X[indices]

Where X is our time samples.

3. With our time sample data in bit-reversed order and our input numpy array in type complex, we can start implementing the core parts of our FFT. We will start by creating the different loops like we have in our pseudo-code from the previous section.

s = 1
while s <= int(np.log2(N)):
    
    m = 2**s
    
    for k in range(0, N, m):

        for j in range(m//2):
            
            
    s += 1

Where s is the stage number, m is 2^(stage), and m//2 is equivalent to 2^(stage -1).

The final step is to perform the calculations:

def fft_iterative(x, N):
    
    # Convert list to a numpy array with complex number representation
    X = np.array(x, dtype=complex)

    # Bit-reversal reordering
    indices = bit_reverse_indices(N) # Gets the index ordering
    X = X[indices] # Performs the actual reordering to bit-reversed

    # Iterative Cooley-Tukey
    s = 1
    while s <= int(np.log2(N)):
        m = 2**s
        
        # Calculate the stage's "base" twiddle factor phase rotation
        wm = np.exp(-2j * np.pi / m)
        
        for k in range(0, N, m):
            
            # Start twiddle factor at W^0 = 1
            w = 1
            
            for j in range(m//2):
                
                # FFT/Butterfly calculation
                u = X[k + j]
                t = w * X[k + j + m//2]
                
                X[k + j] = u + t
                X[k + j + m//2] = u - t
                
                # Rotate the twiddle factor
                w *= wm
                
        # Next Stage
        s += 1

    # Return completed FFT data
    return X

Some things to note:

We use the variables u and t to hold the values of Mem[a] and W*Mem[b] before we add/subtract them together. If we didn’t do that, when we calculate X[k+j] we would change the value of X[k+j] that was needed in order to calculate X[k+j+m//2].
The code line w *= wm moves our twiddle factor phase rotation to the next location. Also we start our twiddle factor w=1 which is equivalent to W^0.

4. Now that we have the FFT output, which gives us the amplitude (and possibly phase) information for each frequency bin, the next step is to map those bin indices to real-world frequencies. The FFT result is essentially an array where each index corresponds to a specific frequency component of the input signal. However, the bin number by itself doesn’t directly tell us the frequency in hertz.

To interpret the results meaningfully when plotting we need to convert each bin number into a frequency value using the sample rate and the total number of points in the FFT.

This is typically done with a simple formula for positive frequencies:

Equation on how to calculate the frequency bin value from the FFT frequency bin index

And for negative frequencies:

Equation on how to calculate the negative frequency bin value from the FFT frequency bin index

Where index is the index of our FFT bin, and N is the number of time samples/frequency bins. It should be noted that the FFT output starts at 0 Hz and goes up to the sampling frequency (Fs). The second half of the FFT, from Fs/2 to Fs, actually represents negative frequencies due to the way complex exponentials work. So when we talk about "negative frequency," we're just shifting that upper half of the spectrum (Fs/2 to Fs) from 0 Hz down to –Fs/2.

The custom function fft_freq( numberOfSamples, samplePeriod ) will return an array of frequency values converted from the frequency bin numbers:

def fft_freq(N, Ts):
    
    freqs = np.zeros(N)
    
    for k in range(N):
        
        if k < N//2: # Positive frequencies
            freqs[k] = k / (N * Ts)
            
        else: # Negative frequencies
            freqs[k] = (k - N) / (N * Ts)
            
    return freqs

5. Just to round out the FFT we will also create a custom function to generate the IFFT.

The IFFT is almost the same as the FFT, but with two key differences: the twiddle factor uses +j instead of –j, effectively reversing the direction of phase rotation, and the final result is scaled by 1/N to undo the summation scaling. Aside from that, the structure of the algorithm remains the same.

def ifft_iterative(x):
    N = len(x)
    X = np.array(x, dtype=complex)

    # Bit-reversal reordering
    indices = bit_reverse_indices(N)
    X = X[indices]

    # Iterative Cooley-Tukey
    s = 1
    while s <= int(np.log2(N)):
        m = 2**s
        wm = np.exp(+2j * np.pi / m)
        
        for k in range(0, N, m):
            w = 1
            for j in range(m//2):
                
                u = X[k + j]
                t = w * X[k + j + m//2]
                
                X[k + j] = u + t
                X[k + j + m//2] = u - t
                w *= wm
        s += 1

    return X/N

Putting all of this code together and plotting:

import numpy as np
import matplotlib.pyplot as plt


def bit_reverse_indices(N):
    
    # Determine the number of bits that we need to 
    # reverse order
    bits = int(np.log2(N))
    
    # Create an zero array which will be updated
    # with the bit-reversed indicies
    reversed_indices = np.zeros(N, dtype=int)

    # For all of the samples
    for i in range(N):
        
        # Empty variable to store reversed bits
        rev = 0
        
        # For all of the bits in the sample index
        for b in range(bits):
            
            # If the bit is a 1, then include it in rev but shifted
            # to the other end of the binary number
            if (i >> b) & 1:  
                rev |= 1 << (bits - 1 - b)
                
        # Store reversed bit index in our array of indices
        reversed_indices[i] = rev

    # Return the list of indices
    return reversed_indices


def fft_iterative(x, N):
    
    # Convert list to a numpy array with complex number representation
    X = np.array(x, dtype=complex)

    # Bit-reversal reordering
    indices = bit_reverse_indices(N) # Gets the index ordering
    X = X[indices] # Performs the actual reordering to bit-reversed

    # Iterative Cooley-Tukey
    s = 1
    while s <= int(np.log2(N)):
        m = 2**s
        
        # Calculate the stage's "base" twiddle factor phase rotation
        wm = np.exp(-2j * np.pi / m)
        
        for k in range(0, N, m):
            
            # Start twiddle factor at W^0 = 1
            w = 1
            
            for j in range(m//2):
                
                # FFT/Butterfly calculation
                u = X[k + j]
                t = w * X[k + j + m//2]
                
                X[k + j] = u + t
                X[k + j + m//2] = u - t
                
                # Rotate the twiddle factor
                w *= wm
                
        # Next Stage
        s += 1

    # Return completed FFT data
    return X



def ifft_iterative(x):
    N = len(x)
    X = np.array(x, dtype=complex)

    # Bit-reversal reordering
    indices = bit_reverse_indices(N)
    X = X[indices]

    # Iterative Cooley-Tukey
    s = 1
    while s <= int(np.log2(N)):
        m = 2**s
        wm = np.exp(+2j * np.pi / m)
        
        for k in range(0, N, m):
            w = 1
            for j in range(m//2):
                
                u = X[k + j]
                t = w * X[k + j + m//2]
                
                X[k + j] = u + t
                X[k + j + m//2] = u - t
                w *= wm
        s += 1

    return X/N


def fft_freq(N, Ts):
    
    freqs = np.zeros(N)
    
    for k in range(N):
        
        if k < N//2: # Positive frequencies
            freqs[k] = k / (N * Ts)
            
        else: # Negative frequencies
            freqs[k] = (k - N) / (N * Ts)
            
    return freqs


# Number of sample points
N = int(2**9)

# Sample rate
Fs = 100e3
Ts = 1/Fs

# Signal frequencies
f1 = 1e3
f2 = 6e3

# Signals
x = np.linspace(0.0, N*Ts, N, endpoint=False)
y = np.sin(f1*2.0*np.pi*x) + 0.25*np.sin(f2 * 2.0*np.pi*x)

# Perform FFT
yf = fft_iterative(y, N)

# Calculate frequency bins
freq = fft_freq(N, Ts)

# Plot FFT
plt.figure()
plt.stem(freq, 1/N * np.abs(yf))
plt.grid()
plt.xlim(-10000, 10000)
plt.ylim()
plt.title("FFT")
plt.xlabel("Frequency (Hz)")
plt.ylabel("Magnitude")
plt.tight_layout()
plt.show()

# Perform IFFT
y_recovered = ifft_iterative(yf)

# Plot the IFFT
plt.figure()
plt.plot(x, y_recovered)
plt.grid()
plt.ylim()
plt.title("IFFT")
plt.xlabel("Time (s)")
plt.ylabel("Voltage (V)")
plt.tight_layout()
plt.show()

Output FFT from our manual FFT in Python

Output IFFT from our manual IFFT in Python

Implementing FFT on STM32

A lot of what was done in Python carries over to this example on the STM32, however we do have to spend extra attention with:

Handling complex numbers
Memory usage
Use ADC to measure a real sine wave
Somewhat cautious of processing speed even though we are measuring slow signals

For the next part, we are going to implement this FFT in C on an STM32H723 (NUCLEO-H723ZG) using STM32CubeIDE and the HAL libraries. These steps should be similar for most other STM32 microcontrollers, though there may be small differences depending on the specific device or peripheral configuration.

We’ll write a program that uses the microcontroller’s ADC to sample a sine wave. The ADC will collect a total of 1024 samples and store them in memory using DMA. To process the data in chunks, the DMA will be set up in double-buffer (or half-transfer) mode. When the first half of the buffer (512 samples) is filled, the DMA will trigger an interrupt. In that interrupt, we’ll convert the raw ADC values into floating-point voltage values. Then, we’ll rearrange the data using bit-reversal and store it into a separate 512-sample buffer. After that, we’ll perform a 512-point FFT and print the maximum frequency spectrum to the serial console. Once the DMA finishes storing the second half of the buffer (samples 512 to 1023), it will trigger another interrupt. We’ll repeat the same process for this second half: convert the samples, apply bit-reversal, perform the FFT, and print the results. This structure allows continuous data acquisition and processing without missing samples, making it suitable for real-time or near-real-time analysis.

Lets begin with setting up our project and pinout configuration:

Pinout and Configuration

First step is creating a new STM32 project, selecting the microcontroller, give it a name, and use C as the targeted language.

If it is not already opened by default, double click the .ioc file from the project explorer to use the STM32CubeIDE configuration tool. The next steps will take place in the Pinout and Configuration tool provided by STM32CubeIDE:

For debugging we will select PA14 and PA13 as SWCLK and SWDIO respectively. To configure these we will go to the Trace and Debug category >> Debug >> Serial Wire

For the ADC, we’ll use ADC1 channel 15, which is located on pin PA3 for this microcontroller. To enable the ADC channel, check the IN15 Single-ended box in your configuration tool.

Next, we’ll set up the DMA. Go to the DMA Settings tab and click the Add button to add a new DMA request. Change the DMA mode to Circular, so that once the buffer is fully filled, the DMA doesn’t stop—it wraps around and starts writing again from index 0. This allows for continuous data acquisition.

For the address increment settings, check the Memory Increment Address option. This tells the DMA to increment the memory pointer as it fills the buffer with ADC values. Do not check the Peripheral Increment Address option, because the address of the ADC1 data register remains fixed.

Since our ADC resolution will be set to 16 bits (maximum resolution), set the Data Width for both the Peripheral and Memory to Half Word (which corresponds to 16 bits).

In the Parameter Settings tab we will configure the Clock Prescaler to be divided by 128, the Resolution to be 16bits, and Continuous Conversion Mode to be enabled.

We will also select the End of Conversion Mode to be end of sequence of conversion (ie. after multiple conversions as opposed to just one single conversion) and set the Conversion Data Management Mode to be the DMA circular mode.

Lastly, we will expand the Rank menu to change the Channel 15 sampling time to 16.5 cycles.

In the Clock Configuration tool of the .ioc file we will set the ADC Clock mus to use PLL2P which gives us a 48MHz clock to the ADC block.

The ADC kernel clock called adc_ker_ck is generated from the 48MHz PLL2P clock going into the ADC but is divided by 2×prescaler.

Thus the ADC kernel clock is Tadc_ker_ck = 48MHz/2×prescaler = 48MHz/2×128 = 187.5kHz

The ADC sampling time = ( Tsmpl + Tsar ) × Tadc_ker_ck

Where:

Tsmpl is the channel conversion time which is set to 16.5 clock cycles in our example

Tsar is the SAR ADC conversion time which is dependent on the number of ADC bits. Since we are using a 16bit resolution, it will take 8.5 clock cycles.

Thus:

ADC sampling time = ( 16.5 + 8.5 ) × (1/187.5kHz) = 133.3us or 7.5kHz sample rate

Next we will configure the UART so that we can print out our FFT result to a console.

STM32 configure UART for serial printing to console

That should be all of the configurations we need to do. To generate the code go to Project >> Generate Code.

The STM32 FFT Code

Now, working in the main.c file, let’s include the necessary libraries for our project. We’ll include math.h to perform mathematical operations, such as sine and cosine calculations needed for generating twiddle factors. We’ll also include stdio.h and string.h, which will be used for serial printing to the console and for handling string operations.

/* Private includes ----------------------------------------------------------*/
/* USER CODE BEGIN Includes */

// Used mainly for calculations
#include "math.h"

// Used for printing strings to serial console
#include <stdio.h>
#include <string.h>

/* USER CODE END Includes */

Next, we’ll define a few constants, notably FFT_BITS and FFT_SIZE. The FFT_SIZE represents the number of samples used in the FFT calculation, and it is computed as FFT_SIZE = 2^FFT_BITS.

/* Private define ------------------------------------------------------------*/
/* USER CODE BEGIN PD */

#define PI	3.14159265358979f

// Number of FFT bits
#define FFT_BITS	9
// Number of samples (FFT_SIZE = 2^FFT_BITS)
#define FFT_SIZE	512

/* USER CODE END PD */

We’ll also define a macro to handle the conversion of the ADC’s 16-bit output value to a voltage (as a float), assuming a reference voltage of 3.3 V. This macro will scale the raw ADC value by 3.3f to convert it into a corresponding voltage level.

/* Private macro -------------------------------------------------------------*/
/* USER CODE BEGIN PM */

// Macro to convert 16bit ADC value to voltage (float)
#define adc16_to_voltage(value) (float) ((value * 3.3f)/((float)0xFFFF))

/* USER CODE END PM */

Next, we’ll define our global variables. The first one is an array called adcBuffer_16b, which is used to store raw ADC samples via DMA. This buffer is sized to be twice as large as the number of samples needed for the FFT, allowing us to collect ADC data continuously while processing previous data. When the first half of the buffer is filled (i.e., the first 512 samples), the DMA triggers a half-transfer interrupt to signal that the buffer is halfway full. At that point, our code begins processing those 512 samples with the FFT. While that processing is happening, the ADC and DMA continue collecting data into the second half of the buffer. Once the second half is filled (samples 512–1023), the DMA generates a transfer-complete interrupt. The code can then process that half while the first half is being refilled, and the cycle repeats. This double-buffering scheme ensures continuous sampling and processing without losing data.

Since adcBuffer_16b is an array of type uint16_t, we need to convert the raw ADC samples into voltage values (as float) before performing the FFT. These converted values will be stored in a separate array called fftADC. As we’ll show later, we’ll also perform bit-reversal during this transfer step; reordering the samples as they’re moved from adcBuffer_16b to fftADC in preparation for the FFT. The fftADC will be twice as large as the number of samples needed for the FFT as we need to store both the real and imaginary parts that make up a complex number.

The twiddle array will store all the the twiddle factors that are pre-computed for the FFT. This array will also be twice as large as the number the number of required twiddle factors (FFT_SIZE/2) as we need to store both the real and imaginary parts.

Lastly, we have the fftADC_ready flag to signal to our program that the ADC sample data is converted to voltage, arranged in bit-reversal, moved into fftADC and we are now ready to process the FFT.

/* USER CODE BEGIN PV */

// Buffer to store the samples from the 16-bit ADC
// Make adcBuffer twice the size as the number of samples we want for FFT
// Once the first half of the buffer fills, we process the first half while the second half
// starts filling up. And vice-versa.
uint16_t adcBuffer_16b[2 * FFT_SIZE];

// Array to store the converted voltage values from the 16-bit ADC samples
// FFT is of FFT_SIZE but this variable is twice the size to include real and imaginary values
// Where: even index = real, odd index = imaginary
float fftAdc[2 * FFT_SIZE];

// Twiddle factors is FFT_SIZE/2 but is twice the size to include real and imaginary values
// Where: even index = real, odd index = imaginary
float twiddle[FFT_SIZE];

// Flag to signal that the data from ADC is ready for FFT 
uint8_t fftADC_ready = 0;

/* USER CODE END PV */

Before enabling the ADC and running the FFT, we’ll perform some initial setup. Specifically, we want to precompute the twiddle factors, which will be used during the FFT calculation. In addition, we’ll zero out both the adcBuffer_16b buffer and the fftADC array. While this step isn’t strictly necessary for functionality, it can help with debugging by ensuring that all values are initialized to known states. For that reason, I recommend doing it at this stage.

We will create a function called fft_512samp_init( ) and call it inside the USER CODE 2 section of the main.c (ie. just after out hardware initialization code):

	/* USER CODE BEGIN 2 */

	// Initialize FFT (mainly calculate twiddle factors)
	fft_512samp_init();

	/* USER CODE END 2 */

We will place the fft_512samp_init( ) function in the USER CODE 0 section.

/* Private user code ---------------------------------------------------------*/
/* USER CODE BEGIN 0 */

void fft_512samp_init() {
	
	// Zero out buffer (not needed but can be useful for debugging)
	for (int i = 0; i < 2 * FFT_SIZE; i++) {
		adcBuffer_16b[i] = 0;
		fftAdc[i] = 0;
	}

	// Pre-compute twiddle factors and interleave real and imaginary numbers in same array
	//		even indexes: real , odd indexes: imaginary
	for (uint16_t k = 0; k < FFT_SIZE / 2; k++) {
		float angle = 2.0f * PI * k / FFT_SIZE; // Calculate angle
		twiddle[2 * k] = cosf(angle); // Real
		twiddle[2 * k + 1] = -1*sinf(angle); // Imaginary
	}

}

/* USER CODE END 0 */

As a reminder, we can use the following equation to calculate the twiddle factor:

Twiddle factor equation showing the separation of real and imaginary parts

Since we need to handle complex numbers, the twiddle array is allocated at twice the size as the number of twiddle factors that we need (# twiddle = FFT_SIZE / 2, thus we need the twiddle array to be FFT_SIZE long). This allows us to store both the real and imaginary parts of each complex sample in a single interleaved array. The real parts are stored at even indices (0, 2, 4, ...), and the imaginary parts are stored at the corresponding odd indices (1, 3, 5, ...). This format is commonly used in embedded DSP implementations because it simplifies memory access and keeps the data tightly packed. Alternatively, we could have used two separate arrays, one for the real components and one for the imaginary, but interleaving is often more efficient.

Now that all initial setup is complete, we’ll start the ADC in DMA mode to begin continuous sampling. This is typically done using the HAL_ADC_Start_DMA( ) function, which links the ADC to the DMA controller and begins filling the buffer with ADC samples (placed in the USER CODE 2 section).

	// Start ADC with DMA
	if (HAL_ADC_Start_DMA(&hadc1, (uint32_t*) adcBuffer_16b, 2 * FFT_SIZE) != HAL_OK) {
		Error_Handler();
	}

At this point, the ADC and DMA are running continuously, filling the adcBuffer_16b array with sampled data. When the buffer is half full, an interrupt is triggered, and the DMA calls the HAL_ADC_ConvHalfCpltCallback( ) function. Inside this callback, we begin processing the newly available data. We move the samples from adcBuffer_16b, convert each one to a floating-point voltage, and store the result in the fftADC array in bit-reversed order. To do this, we define a pointer to index 0 of adcBuffer_16b and calculate the correct offset for each sample based on its bit-reversed position using the bit_reverse16( ) function (see below). Then we use the adc16_to_voltage() macro to scale the raw 16-bit ADC value into its corresponding voltage. This process ensures that our FFT input is properly formatted and ready for processing as soon as the data becomes available.

Since we need to handle complex numbers, the fftAdc array is allocated at twice the size of the FFT length. This allows us to store both the real and imaginary parts of each complex sample in a single interleaved array. The real parts are stored at even indices (0, 2, 4, ...), and the imaginary parts are stored at the corresponding odd indices (1, 3, 5, ...).

The same approach is used when the second half of the buffer is filled, this time the HAL_ADC_ConvCpltCallback( ) function is called. In that case, we define the pointer to start at index FFT_SIZE-1 (i.e., sample 512 - 1 = 511 because we start at index 0) and repeat the same operations: bit-reversal, voltage conversion, and storing the data in fftADC. This alternating, double-buffered processing allows continuous sampling and real-time FFT computation without missing any data.

(placed in the USER CODE 0 section)

void HAL_ADC_ConvHalfCpltCallback(ADC_HandleTypeDef *hadc) {

	if (hadc->Instance == ADC1) {

		// Normally commented out, but it can be useful to stop DMA for debugging
		// HAL_ADC_Stop_DMA(&hadc1);

		// Define a pointer to point at the begining of the adcBufer array
		uint16_t *p_adcBuffer = &adcBuffer_16b[0];

		// Move data from adcBuffer to new fftAdc buffer for processing:
		// 		Convert ADC values to float voltages
		// 		Bit-reveral re-arrage of data
		for (uint16_t i = 0; i < FFT_SIZE; i++) {
			fftAdc[2 * i] = adc16_to_voltage(*(p_adcBuffer + bit_reverse16(i, FFT_BITS))); // Real part
			fftAdc[2 * i + 1] = 0; // Imaginary part
		}

		// Flag that data is ready for FFT
		fftADC_ready = 1;
	}
}

void HAL_ADC_ConvCpltCallback(ADC_HandleTypeDef *hadc) {

	if (hadc->Instance == ADC1) {

		// Normally commented out, but it can be useful to stop DMA for debugging
		// HAL_ADC_Stop_DMA(&hadc1);

		// Define a pointer to point at the middle of the adcBufer array
		uint16_t *p_adcBuffer = &adcBuffer_16b[FFT_SIZE];

		// Move data from adcBuffer to new fftAdc buffer for processing:
		// 		Convert ADC values to float voltages
		// 		Bit-reveral re-arrage of data
		for (uint16_t i = 0; i < FFT_SIZE; i++) {
			fftAdc[2 * i] = adc16_to_voltage(*(p_adcBuffer + bit_reverse16(i, FFT_BITS))); // Real part
			fftAdc[2 * i + 1] = 0; // Imaginary part
		}

		// Flag that data is ready for FFT
		fftADC_ready = 1;
	}
}

The bit_reverse16( ) function calculates the bit-reversed index of a given value, which is required for correctly ordering input data before performing the FFT. It works by using a series of bitmasking and shifting operations to reverse all 16 bits of the input value in stages; first swapping individual bits, then 2-bit groups, then nibbles (4-bit groups), and finally bytes. After fully reversing the bits, the result is shifted to match the actual bit width of the FFT (e.g., 9 bits for a 512-point FFT).

(placed in the USER CODE 0 section)

// Returns bit reversed index of the value that should go into position x
uint16_t bit_reverse16(uint16_t x, uint8_t bits) {
	x = ((x & 0x5555) << 1) | ((x >> 1) & 0x5555);  // swap odd and even bits
	x = ((x & 0x3333) << 2) | ((x >> 2) & 0x3333);  // swap 2-bit groups
	x = ((x & 0x0F0F) << 4) | ((x >> 4) & 0x0F0F);  // swap nibbles
	x = (x << 8) | (x >> 8);                         // swap bytes
	return x >> (16 - bits); // return adjusted for a bits <= 16
}

The final step inside both the HAL_ADC_ConvHalfCpltCallback() and HAL_ADC_ConvCpltCallback() functions is setting the fftADC_ready flag. This flag acts as a signal to the main while(1) loop; indicating that a new set of ADC samples has been processed and stored in the fftADC buffer, and we’re now ready to calculate the FFT. By using this flag, we avoid doing heavy FFT computation inside the interrupt itself and instead defer it to the main loop, keeping the interrupt handlers fast and efficient.

In our main while(1) loop we have an if statement checking the fftADC_ready flag. If this flag is set we will perform the FFT (called fft_dit_512samp( ) where the dit refers to the fact that we are doing a decimation in time FFT). After the FFT is complete we will plot all the data to serial console using the printFFT_Magnitude_All( ) function and finally clearing the fftADC_ready flag.

/* Infinite loop */
	/* USER CODE BEGIN WHILE */
	while (1) {
		/* USER CODE END WHILE */

		// If ADC has completed filling buffer + We moved data to fftAdc (which is bit reversed
		// and converted to a float voltage value)
		if (fftADC_ready) {

			// Perform decimation in time FFT
			fft_dit_512samp();

			//Print all the FFT data as magnitude
			printFFT_Magnitude_All();

			// Reset ready flag
			fftADC_ready = 0;
		}

		/* USER CODE BEGIN 3 */
	}
	/* USER CODE END 3 */

The FFT on the STM32 is performed very similar way as we did in Python with some notable differences:

In Python, complex numbers are handled natively and can be manipulated just like any other numeric type. In contrast, on the STM32 using C, we represent complex numbers manually by interleaving the real and imaginary parts within the same array. This means that for each complex value, the real part is stored at an even index and the corresponding imaginary part is stored at the next odd index.
In Python, we calculated the base twiddle factor for each FFT stage and, starting from W⁰ = 1, applied successive complex multiplications to generate the necessary rotations during the FFT computation. This approach is flexible and easy to implement in high-level languages, but it can be computationally expensive in real-time applications. On the STM32 using C, we take a different approach by precomputing all the required twiddle factors ahead of time and storing them in an array. During the FFT, we select the appropriate twiddle factor using the stride variable and the loop index j. This saves time during execution by avoiding repeated calls to sine and cosine functions, which is especially important on resource-constrained embedded systems.

// Performs decimation in time FFT of 512 samples "in place" inside fftAdc buffer
// Bit reveral ordering and twiddle factor calculations must be completed beforehand
void fft_dit_512samp() {

	// For each FFT stage
	for (uint16_t stage = 1; stage <= FFT_BITS; stage++) {

		// Butterfly size is 2^stage
		uint16_t m = 1 << stage;
		uint16_t mHalf = m >> 1;

		// At each FFT stage with group size m, you use the twiddle spacing FFT_SIZE/m
		uint16_t stride = FFT_SIZE / m;

		// For each frequency bin
		for (uint16_t k = 0; k < FFT_SIZE; k += m) {
			// For each corresponding butterfy
			for (uint16_t j = 0; j < mHalf; j++) {

				// Determine the index of first value in butterfly
				uint32_t i0 = k + j;

				//Determine the index of second value in butterfly
				uint32_t i1 = i0 + mHalf;

				// Determine index of required twiddle value for this butterfly
				uint32_t t_idx = j * stride;

				// Get real and imaginary parts of twiddle value
				float wr = twiddle[2 * t_idx];
				float wi = twiddle[2 * t_idx + 1];

				// Get real and imaginary parts of first value
				float xr = fftAdc[2 * i0];
				float xi = fftAdc[2 * i0 + 1];

				// Get real and imaginary parts of second value
				float yr = fftAdc[2 * i1];
				float yi = fftAdc[2 * i1 + 1];

				// Complex multiply twiddle values with second value
				float tr = wr * yr - wi * yi;
				float ti = wr * yi + wi * yr;

				// Perform butterfly
				// Operate on first value
				fftAdc[2 * i0] = xr + tr; // Real
				fftAdc[2 * i0 + 1] = xi + ti; // Imaginary
				// Operate on second value
				fftAdc[2 * i1] = xr - tr; // Real
				fftAdc[2 * i1 + 1] = xi - ti; // Imaginary

			}

		}

	}

}

After executing the FFT, the data in the fftAdc array represents the signal in the frequency domain. Each complex value in the array corresponds to a specific frequency bin, containing both the real and imaginary parts of the frequency component. These values can be used to compute the magnitude and phase of each frequency present in the original signal.

Finally we will print out the magnitude frequency spectrum to serial console using printFFT_Magnitude_All( ) function:

// Function serially prints all frequency bins and data as magnitude
void printFFT_Magnitude_All() {

	// Cycle through fftAdc array where even index is real, odd index is imaginary
	for (uint16_t i = 0; i < 2 * FFT_SIZE; i += 2) {

		// The k-th bin of the FFT
		uint16_t k = i / 2;

		// If k-th bin is positive frequency
		if (k < FFT_SIZE / 2) {

			// Calculate corresponding bin frequency
			float fk = ((float) k * 7500.0f) / (float) FFT_SIZE;

			// Calculate magnitude of bin data
			float magnitude = sqrtf(fftAdc[i] * fftAdc[i] + fftAdc[i + 1] * fftAdc[i + 1]);

			// Print frequency and magnitude to serial
			char buffer[32];
			snprintf(buffer, sizeof(buffer), "%.2f\t%.2f\n", fk, magnitude);
			HAL_UART_Transmit(&huart3, (uint8_t*) buffer, strlen(buffer), HAL_MAX_DELAY);

		} else { // If k-th bin is negative frequency

			// Calculate corresponding bin frequency (as negative value)
			float fk = (((float) k - FFT_SIZE) * 7500.0f) / (float) FFT_SIZE;

			// Calculate magnitude of bin data
			float magnitude = sqrtf(fftAdc[i] * fftAdc[i] + fftAdc[i + 1] * fftAdc[i + 1]);

			// Print frequency and magnitude to serial
			char buffer[32];
			snprintf(buffer, sizeof(buffer), "%.2f\t%.2f\n", fk, magnitude);
			HAL_UART_Transmit(&huart3, (uint8_t*) buffer, strlen(buffer), HAL_MAX_DELAY);
		}
	}

}

To get the snprintf( ) to work, you will need to enable the printf_float option. You can do this in STM32CubeIDE by right clicking on your project and open the properties dialog.

STM32CubeIDE opening the properties dialog

Then expand the C/C++ Build >> Settings >> MCU/MPU Settings and finally check the Use float with printf … checkbox.

STM32CubeIDE enabling the use of floating point numbers in printf

With all the pieces in place; ADC configuration, DMA setup, buffer management, twiddle factor initialization, and the FFT implementation. We now have a complete system that continuously samples data, processes it in real time, and transforms it into the frequency domain. In the next section, we’ll test this setup by feeding in known signals and analyzing the FFT output to confirm that everything is working as expected.

If you want to see the complete code for this FFT, check out on Github

Testing STM32 FFT Code

In this section, we’ll test the FFT code by feeding a known 1 kHz sine wave into the ADC input pin and plotting the FFT results received over the serial console. To do this, we’ll connect a signal generator to the appropriate ADC pin and observe how the system processes and transforms the signal. Unfortunately, due to the time it takes to transmit the FFT output over serial, we need to pause data acquisition during testing. To prevent the buffer from being overwritten while we're still printing, we stop the DMA immediately after the HAL_ADC_ConvHalfCpltCallback() function is triggered. This gives us time to inspect the output without losing any data.

void HAL_ADC_ConvHalfCpltCallback(ADC_HandleTypeDef *hadc) {

	if (hadc->Instance == ADC1) {

		// Normally commented out, but it can be useful to stop DMA for debugging
		 HAL_ADC_Stop_DMA(&hadc1);

		.... Code ....

	}
}

With this change we can run our code and our output in the serial plot looks like this in Tera Term:

Output FFT spectrum in Tera Term (serial console)

By selecting all the serial data and copying it into Excel using Edit → Copy Table, we can generate a plot of the FFT output. The resulting graph shows a peak at DC (which is expected due to the signal’s offset), along with two clear peaks at ±1 kHz, matching the frequency of our test input and confirming that the FFT is working correctly.

Resulting FFT ouptut from STM32 C code, plotted in excel.

This initial test is useful, as it confirms that our FFT is functioning correctly and that there are no unexpected spikes or noise artifacts in the output. However, for practical applications, it may be more useful to extract and print only the most significant frequency component, specifically the peak frequency in the spectrum (excluding the DC bin). This approach reduces the amount of data sent over serial and gives us a quick, high-level view of the dominant frequency present in the signal.

First add the new function called printFFT_Magnitude_MaxPossitive( ) in the USER CODE 0 section:

void printFFT_Magnitude_MaxPossitive() {

	float max = 0;
	uint16_t pos = 0;

	// Cycle through fftAdc array where even index is real, odd index is imaginary
	for (uint16_t i = 2; i < FFT_SIZE; i += 2) {

		// The k-th bin of the FFT
		uint16_t k = i / 2;

		// Calculate magnitude of bin data
		float magnitude = sqrtf(fftAdc[i] * fftAdc[i] + fftAdc[i + 1] * fftAdc[i + 1]);

		if (magnitude > max){
			max = magnitude;
			pos = k;
		}

	}

	// Calculate corresponding bin frequency
	float fk = ((float) pos * 7500.0f) / (float) FFT_SIZE;

	// Print frequency and magnitude to serial
	char buffer[32];
	snprintf(buffer, sizeof(buffer), "%.2f\t%.2f\t%.2f\n", (float) adcbuff_half, fk, max);
	HAL_UART_Transmit(&huart3, (uint8_t*) buffer, strlen(buffer), HAL_MAX_DELAY);

}

Then in the main while(1) loop replace the function call printFFT_Magnitude_All( ) with printFFT_Magnitude_MaxPossitive( ) as shown:

	/* Infinite loop */
	/* USER CODE BEGIN WHILE */
	while (1) {
		/* USER CODE END WHILE */

		// If ADC has completed filling buffer + We moved data to fftAdc (which is bit reversed
		// and converted to a float voltage value)
		if (fftADC_ready) {

			// Perform decimation in time FFT
			fft_dit_512samp();

			printFFT_Magnitude_MaxPossitive();

			// Reset ready flag
			fftADC_ready = 0;
		}

		/* USER CODE BEGIN 3 */
	}
	/* USER CODE END 3 */

Finally, we want to remove the code line which stops the DMA in the HAL_ADC_ConvHalfCpltCallback() function

void HAL_ADC_ConvHalfCpltCallback(ADC_HandleTypeDef *hadc) {

	if (hadc->Instance == ADC1) {

		// Normally commented out, but it can be useful to stop DMA for debugging
		// HAL_ADC_Stop_DMA(&hadc1);

		.... Code ....

	}
}

With these changes to our program, we can run our code and our output in the serial plot looks like this in Tera Term:

Tera Term (serial console) output showing the peak frequency in the spectrum

Where the first column is the max positive frequency and the second column is its magnitude.

Conclusion

With this, I’ve shown how to write your own FFT from scratch; first in Python for clarity and experimentation, and then in C on an STM32 microcontroller for embedded applications. This implementation is intentionally kept simple and unoptimized, with a focus on clarity and education rather than speed or efficiency. The goal was to understand the underlying mechanics of the FFT and how to apply them in both high-level and low-level environments.

While there are faster, more advanced libraries available for both Python and embedded systems, building the FFT yourself gives valuable insight into how the algorithm works, how data needs to be arranged, and how performance considerations come into play, especially in resource-constrained systems. If you're just getting started with digital signal processing or embedded programming, this kind of hands-on exploration is a great way to build intuition and confidence.

There’s plenty of room for improvement from here, such as adding fixed-point support, optimizing memory access, or implementing windowing and overlap. But this foundation provides a strong starting point for more advanced DSP work.

Writing Your Own FFT in Python and STM32