The author finds, as many do, that naive or first-approximation approaches fail within certain constraints and that more complex methods are necessary to achieve simplicity. He finds, as I have, that perceptual and spectral domains are a better space to work in for things that are perceptual and spectral than in the raw data.
What I don't see him get to (might be the next blog post, IDK), is getting into constraints in the use of color - everything is in 'rainbow town' as we say, and it's there that things get chewy.
I'm personally not a fan of emissive green LED light in social spaces. I think it looks terrible and makes people look terrible. Just a personal thing, but putting it into practice with these sorts of systems is challenging as it results in spectral discontinuities and immediately requires the use of more sophisticated color systems.
I'm also about maximum restraint in these systems - if they have flashy tricks, I feel they should do them very very rarely and instead have durational and/or stochastic behavior that keeps a lot in reserve and rewards closer inspection.
I put all this stuff into practice in a permanent audio-reactive LED installation at a food hall/ nightclub in Boulder: https://hardwork.party/rosetta-hall-2019/
I really like your LED installation in Rosetta Hall, it looks beautiful!
Another related project that builds on a similar foundation: https://github.com/ledfx/ledfx
I remember thinking really hard on what to do with color. Except like you say mine is pretty much a naive fft.
https://github.com/aleksiy325/PiSpectrumHoodie?tab=readme-ov...
Thanks for reminding me.
- The more filters I added the worse it got. A simple EMA with smoothing gave the best results. Although, your pipeline looks way better than what I came up with!
- I ended up using the Teensy 4.0 which let me do real time FFT and post processing in less than 10ms (I want to say it was ~1ms but I can't recall; it's been a while). If anyone goes down this path I'd heavily recommend checking out the teensy. It removes the need for a raspi or computer. Plus, Paul is an absolute genius and his work is beyond amazing [1].
- I started out with non-addressable LEDs also. I attempted to switch to WS2812's as well, but couldn't find a decent algorithm to make it look good. Yours came out really well! Kudos.
- Putting the leds inside of an LED strip diffuser channel made the biggest difference. I spent so long trying to smooth it out getting it to look good when a simple diffuser was all I needed (I love the paper diffuser you made).
RE: What's Still Missing: I came to a similar conclusion as well. Manually programmed animation sequences are unparalleled. I worked as a stagehand in college and saw what went into their shows. It was insane. I think the only way to have that same WOW factor is via pre-processing. I worked on this before AI was feasible, but if I were to take another stab at it I would attempt to do it with something like TinyML. I don't think real time is possible with this approach. Although, maybe you could buffer the audio with a slight delay? I know what I'll be doing this weekend... lol.
Again, great work. To those who also go down this rabbit hole: good luck.
I haven't seen that done yet. I think it's one of those Dryland myths.
There’s plenty of visual experiments of pianists doing this “rock band” “guitar hero” style visualization of notes.
To solve this I tried pre-processing the audio, which only works with recordings obviously. I extract the beats and the chords (using Chordify). I made a basic animation and pulsed the lights to the beat, and mapped the chords to different color palettes.
Some friends and I rushed it to put it together as a Burning Man art project and it wasn't perfect, but by the time we launched it felt a lot closer to what I'd imagined. Here's a grainy video of it working at Burning Man: https://www.youtube.com/watch?v=sXVZhv_Xi0I
It works pretty well with most songs that you pick. Just saying there's another way to go somewhere between (1) fully reactive to live audio, and (2) hand designed animations.
I don't think there's an easy bridge to make it work with live audio though unfortunately.
I wonder if transformer tech is close to achieving real-time audio decoding, where you can split a track into it's component instruments, and light show off of that. Think those fancy Christmas time front yard light shows as opposed to random colors kind of blinking with what maybe is a beat.
I tried recreating the app (and I can connect via BT to the lights) but writing the audio-reactive code was the hardest part (and I still haven't managed to figure out a good rule of thumb or something). I mainly use it when listening to EDM or club music, so it's always a classic 4/4 110-130bpm signature, yet it's hard to have the lights react on beat.
But perhaps you'd get better results if more of a ML speech/audio recognition pipeline were included?
Eg. the pipeline could separate out drum beats from piano notes, and present them differently in the visualization?
An autoencoder network trained to minimize perceptual reconstruction loss would probably have the most 'interesting' information at the bottleneck, so that's the layer I'd feed into my LED strip.
This allowed the device to count the beats, and since most modern EDM music is 4/4 that means you can trigger effects every time something "changes" in the music after synching once.
And of course, by the time I got it to work perfectly I never looked at it again. As is tradition.
(And it looks like the 7 frequencies are not distributed linearly—perhaps closer to the mel scale.)
I tried using one of the FFT libraries on the Arduino directly but had no luck. The MSGEQ7 chip is nice.
It was fiddly, and probably too inaccurate for a modern audience but I can't claim it was diabolically hard. Tuning was a faff but we were more willing to sit and tweak resistor and capacitor values then.
I think its more likely going to come from a direct integration with existing synthesis methods, but .. I’m kind of biased when it comes to audio and light synthesizers, having made a few of each…
We have addressed this expert tuning issue with the MagicShifter, which is a product not quite competing with the OP’s work, but very much aligned with it[1]:
.. which is a very fun little light synthesizer capable of POV rendering, in-air text effects, light sequencer programming, MIDI, and so on .. plus, has a 6dof sensor enabling some degree of magnetometers, accelerometers, touch-sensing and so on .. so you can use it for a lot of great things. We have a mode “BEAT” that you can place on a speaker and get reactive LED strips of a form (quite functional) pretty much micro-mechanically, as in: through the case and thus the sensor, not an ADAC, not processing audio - but the levers in between the sensor and the audio source. So - not quite the same, but functionally equivalent in the long-rung (plus the magicshifter is battery powered and pocketable, and you can paint your own POV images and so on, but .. whatever..)
The thing is, the limits: yes, there are limits - but like all instruments you need to tune to/from/with those limits. It’s not so much that achieving perfect audio reactive LED’s is diabolically hard, but rather making aesthetically/functionally relevant decisions about when to accept those limits requires a bit of gumption.
Humans can be very forgiving with LED/light-based interfaces, if you stack things right. The aesthetics of the thing can go a long way towards providing a great user experience .. and in fact, is important to giving it.
[1] - (okay, you can power a few meters of LED strips with a single MagicShifter, so maybe it is ‘competition’, but whatever..)
Edit: Oh wait, that project needs a PC or Raspberry PI for audio processing. WLED does everything on the ESP32.
https://en.wikipedia.org/wiki/Wicked_problem
Kinda funny but I am a fan of green LED light to supplement natural light on hot summer days. I can feel the radiant heat from LED lights on my bare skin and since the human eye is most sensitive to green light I feel the most comfortable with my LED strip set to (0,255,0)
“Most people who attempt audio reactive LED strips end up somewhere around here, with a naive FFT method. It works well enough on a screen, where you have millions of pixels and can display a full spectrogram with plenty of room for detail. But on 144 LEDs, the limitations are brutal. On an LED strip, you can't afford to "waste" any pixels and the features you display need to be more perceptually meaningful.”
That can be done with analog electronics, but even half an analog vocoder needs a lot of parts. It's going to be cheaper and more reliable to simulate it in software. This uses entirely IIR filters, which are computationally cheap and calculated one sample at a time, so they have the minimum possible latency. I'd be curious if any LLM actually recognizes that an audio visualizer is half a vocoder instead of jumping straight to the obvious (and higher latency) FFT approach.
I get a cert mismatch on that site, and when clicking the shop link I end up at https://hackerspaceshop.com/ which is advertising an online fax service.
And yea, I agree with the article. In my past I've also dabbled in audioreactive for LEDs and it's fiendishly difficult to make anything interesting.
Make it react too much, and it's chaos, and inversely when the algorithm reacts less the audio, it's boring.
And in all cases it's really not easy to see what the leds are doing in correspondence to all the complexity of music.
In short, audio and visual perception do not map perfectly. Humans don't have a linear perception of either so a perfect A to D then D to A conversion yields unsatisfying results.
Effects themselves are written in embedded Javascript and can be layered a bit like photoshop. Currently it only supports driving nanoleaf and wled fixtures, though wled gives you a huge range of options. The effect language is fully exposed so you can easily write your own effects against the real-time audio signals.
It isn't open source though, and still needs better onboarding and tutorials. Currently it's completely free, haven't really decided on if I want to bother trying to monetize any of it. If I were to it would probably just be for DMX and maybe midi support. Or maybe just for an ecosystem of portable hardware.
The classic "Color Organ" from the 70's.
There was a nice paper with an overview last year too https://arxiv.org/html/2511.13146v1 that introduced RT-STT which is still being tweaked and built upon in the MSS scene
The high quality ones like MDXNet and Demucs usually have at least several seconds of latency though, but for something like displaying visuals high quality is not really needed and the real time approaches should be fine.
At the end it's "just" chunking streamed audio into windows and predicting which LEDs a window should activate. One can build a complex non-realtime pipeline, generate high-quality training data with it, and then train a much smaller model (maybe even an MLP) with it to predict just this task.
For my use case I want something fully portable and battery powered anyways. So the audio stuff should happen on the ESP32. (Or on my phone, that might work too)
When the author says:
> Every commercial audio reactive LED strip I've seen does this badly. They use simple volume detection or naive FFTs and call it a day. They don't model human perception on either side, which is why they all look the same.
well no, if they sell, then they are doing just fine until someone comes up with the $next $thing
(Note both the scanner in front of KITT and the visual FX on his dashboard when he speaks, which changes from season to season.)
The wickedness comes from wanting something that works just as well for John Summit as the Grateful Dead as Mozart and Bad Bunny.
But it seems like you could cheat for installations where the type of music is known and go from there. The other cheat is to have a "tap" button, and to pull that data and go from there.
mental note: the thought "it can't be that hard" when obviously it is sent me down a rabbit hole for a couple of hours
Everything is relative, though. In terms of maximums, a Pi 4 (for example) can use up to about 7 Watts under load by itself, which adds up fast when operating on batteries.
But a single 1 meter string of 144 WS2812B LEDs can suck down up to around 43 Watts, and 43 is a lot more than 7. :)
Lighting rigs are thirsty. The processing (even if it's the whole Pi) is generally a small drop in the bucket.
In 2016, I bought an LED strip and decided to make it react to music in real time. I figured it would take a few weeks, but it ended up being a rabbit hole. Ten years later, the project has 2.8k GitHub stars, has been covered by Hackaday, and is one of the most popular LED strip visualizer projects available. People have built it into nightclubs, integrated it with Amazon Alexa, and used it as their first electronics project.
I'm still not satisfied with it.

The scroll effect. Colors originate from the center and scroll outward, reacting to music in real time.
I started with non-addressable LED strips where I could control the brightness of the red, green, and blue color channels independently, but not the individual LED pixels. I tried the most obvious thing first: read the audio signal, measure the volume, and make the LEDs brighter when it's louder. These were all relatively straightforward time domain processing methods. Read a short chunk of audio around 10-50ms in duration, low pass filter it, map the intensity to brightness.
I assigned each color channel to a different time constant to get a kind of color effect. One color would respond rapidly to changes in volume, one would respond slowly, and one in the middle. You can get something like this working in an afternoon and it looks okay on an LED strip or a lamp with a single RGB LED.
It gets boring fast. All the interesting frequency information is lost and it works best on punchy electronic music. It is terrible on many other kinds of music where volume is not the most interesting feature. There is no understanding of what kind of sound the system is reacting to, just how loud it is.
I also had to implement adaptive gain control almost immediately. If you set a fixed volume threshold, the visualizer either saturates in a loud room or barely flickers in a quiet one. My favorite way to do this was with exponential smoothing a simple and effective filter that I used over and over in various parts of the code.
Although the time domain visualizer was okay, I found the limited output channels made the result unsatisfying. There is only so much information you can display on three color channels. Eventually, I switched to WS2812 addressable LEDs so that I'd have many more output features to work with.
The earliest prototype, 2017. Non-addressable LEDs, controlling only global brightness per color channel. Before I discovered the mel scale, before addressable LEDs. This is where it started.
The obvious next step was to use frequency domain methods. Collect a short chunk of audio, compute a Fourier transform (a mathematical tool that breaks audio into its individual frequencies), get frequency bins, and map them to LEDs. I had 144 pixels on a one meter strip, so I thought, 144 bins, one per LED. Then render the spectrum.
It kind of worked. I could tell right away that more of the audio was being captured compared to the volume method. But the result was deeply unsatisfying. Almost all of the energy was concentrated in a handful of LEDs, and most of the strip was dark.
I tried cropping the frequency range to use more of the strip. It helped a little, but I still felt that many of the LEDs were underutilized and that the FFT method was lopsided. I struggled with this for a long time.
Most people who attempt audio reactive LED strips end up somewhere around here, with a naive FFT method. It works well enough on a screen, where you have millions of pixels and can display a full spectrogram with plenty of room for detail. But on 144 LEDs, the limitations are brutal. On an LED strip, you can't afford to "waste" any pixels and the features you display need to be more perceptually meaningful.
Pixel Poverty, Feature Famine, Compression Curse, whatever you want to call it, is the central lesson I learned and the reason LED strip visualization is so difficult. You might think that LED strips are simpler than screen-based visualizers, but the opposite is true. A screen-based visualizer has millions of pixels to work with, but an LED strip has hundreds at most. You can compute tons of audio features and display them all on the screen, and if most of them are uninteresting, it doesn't matter. As long as some of the features resonate with what a human perceives as interesting, the visualization works. On an LED strip, you have to be right about which features are worth displaying.
An LED strip is pixel-poor. A one meter strip might have 144 LEDs. That's it, and there's nowhere to hide. Nearly every single pixel has to be doing something that a human perceives as musically relevant. The margin for error is incredibly narrow.
This is what makes LED strip visualizers fundamentally harder than screen-based ones. I couldn't just display raw signal processing data. I had to understand how humans actually perceive music and build a perceptual model into the pipeline.
I started reading papers from the speech recognition field to understand how their signal processing pipelines worked. Speech recognition has spent decades figuring out how to extract features from audio that match human perception, because if you can't model what a human hears, you can't transcribe what they said, and that's where I found the mel scale.
Humans don't perceive pitch linearly. The perceptual distance between 200Hz and 400Hz feels much larger than the distance between 8000Hz and 8200Hz, even though both spans are 200Hz. Our brains are heavily tuned to the speech band between roughly 300Hz and 3000Hz, and much less interested in frequencies far outside that range.
The mel scale transforms frequencies from Hz into a perceptual space where pitches are equally distant to a human listener. Instead of mapping raw FFT bins to pixels, which spreads the perceptually important frequencies across only a few LEDs, I mapped mel-scaled bins to pixels.
The difference was night and day. The entire strip lit up. Every LED was doing something meaningful. That was the breakthrough. Everything else built on top of it.
What I realized is that the audio LED visualizer uses much of the same frontend as a traditional speech recognition pipeline. The mel filterbank, which speech systems use to extract perceptually relevant features before feeding them into a recognizer, is exactly what makes the LED strip come alive. I take the output of the mel filterbank and feed it directly into the three visualizations.

The audio visualizer implements most of the frontend of a traditional speech recognition pipeline. Speech recognition continues further into log energy and discrete cosine transforms, but the LED visualizer stops at the mel filter bank output and feeds it directly into the three visualization effects.
The mel scale solved the frequency mapping problem, but the raw output still flickered badly. Features changed too rapidly and the strip looked jittery and unpleasant. I needed the visualization to feel smooth and intentional, not noisy.
I applied exponential smoothing on a per-frequency-bin level, so each frame blends with the previous one. Features change gradually instead of jumping around. This eliminated the flicker without adding perceptible latency.
Then I discovered that convolutions (a mathematical operation that blends neighboring values together) were perfect for spatial smoothing. LED strips are 1D vectors, which makes them an ideal substrate for convolution operations. In university I learned the math of convolutions but the applications felt abstract. On the LED strip, it finally clicked. Different kernels gave me different effects: a narrow kernel for a max-like operation on adjacent pixels, wider kernels for gaussian blur. I could smooth the spectrum, soften transitions, and control how features blended spatially. I still think about convolutions in terms of LED strips today.
At this point I realized the visualizer needs perceptual models on both sides of the pipeline. On the input side, the mel scale models how humans perceive sound. On the output side, I needed to model how humans perceive light.
We don't perceive brightness linearly either. A raw linear mapping of audio energy to LED brightness looks wrong because our eyes have a logarithmic response. This led me into gamma correction (adjusting brightness values to match how our eyes actually perceive light) and color theory: RGB, HSV, LAB, sRGB, complementary colors. I learned that mapping frequency content to color is its own rabbit hole, and that getting the color palette right makes a surprising difference in how "musical" the visualization feels.
I ended up with three visualizations. Spectrum renders the mel-scaled frequency content directly, one LED per perceptual frequency band. Scroll creates a time-scrolling energy wave that originates from the center and scrolls outward, with frequency content mapped to color. Energy pulses outward from the center with increasing sound energy. I wish there were more, but these three work well together.
The scroll effect on addressable LEDs, recorded in 2018 in my dorm room at UBC. A 20 second demo of what the visualizer can do once the mel scale, IIR filters, and convolutions are all working together.
All of this has to work in real time, with no knowledge of what comes next. Longer audio chunks give you higher quality frequency data but add lag. Shorter chunks are fast and responsive but noisy. I ended up using a rolling window that overlaps successive chunks, which gives you better frequency resolution without adding much lag. Finding the right window size took a lot of tweaking.
The project supports two main platforms. On a Raspberry Pi, the Pi handles both audio processing and LED rendering via GPIO. On an ESP8266, the audio processing runs on a PC in Python, and pixel data is streamed to the microcontroller in real time. The whole project is freely available at github.com/scottlawsonbc/audio-reactive-led-strip.

The system in action. Audio is processed on the computer and pixel data is streamed to the LED strip in real time.
The system installed in my living room. LED strips near the ceiling project light upward, and an LED matrix on the table adds another visualization. The laptop streams pixel data to all three strips in real time.
The first version of this project was installed in the Engineering Physics clubhouse at UBC. We used it at parties. It was crude, but people liked it. To reduce the glare from bare LEDs, I hand-crafted diffusers from paper sheets taped into tubes over the strip, giving a softer, more diffuse glow.
The earliest installation, 2017. The Engineering Physics clubhouse at UBC. Non-addressable LEDs, paper tube diffusers, people dancing. This video has a special place in my heart. I loved the smiles and delight it brought my friends and classmates.
After I graduated, I had a bit of time and spent a few weeks polishing the code, writing thorough documentation, and finishing up the project before putting it on GitHub.
It took off in a way I never expected. The project was covered by Hackaday in January 2017 and became popular on Reddit. As of today it has over 2,800 stars and 640 forks on GitHub. It's been used by thousands of people.
One of the first people to try the project was Joey Babcock. He reached out early on and eventually submitted a pull request. I remembered reading a blog post years earlier called The Pull Request Hack, where the central idea was: whenever someone sends you a pull request, give them commit access to your project. I thought, what the heck, I'll try it. So I gave Joey commit access. He became the first maintainer of the project other than me, and I am forever thankful for his efforts responding to issues and keeping the project alive. I couldn't believe the advice actually worked so well.
People started sending me videos of what they built. Richard Birkby integrated the project with his Amazon Echo. In his video, he says "Alexa, tell kitchen lights to show energy" and his room lights up. I was blown away that people were taking my project and using it in ways I had never expected.
Richard Birkby's Amazon Alexa integration. "Alexa, tell kitchen lights to show energy." I never imagined someone would do this with my project.
Another user who does AV at a club sent me a video of the strip in action during a DJ night, with dozens of people dancing in front of a live band. He wrote: "people were very happy... If they only knew this was the 6th Raspberry Pi doing stuff in the bar around them." The LED strip was mounted above the stage for everyone to see, lighting up in real time as the band played.
A nightclub DJ night. The LED strip is mounted above the stage, reacting to the live band in real time. Dozens of people dancing. This video was sent to me by someone on the other side of the world. Dozens of people dancing.
A developer in China forked the project, added ESP32 and Home Assistant support, wrote Chinese documentation, and built a custom microphone shield PCB to make setup easier. Someone made a YouTube video about the project because they felt it deserved more recognition. People around the world submitted pull requests adding beat detection, new effects, and code improvements.
The most rewarding part was learning that people used this as their first electronics project. Someone who had never soldered before bought an LED strip and a Raspberry Pi, followed the documentation, and got it working.
It's the only project I've worked on that took on a life of its own.
When a human manually codes an animation sequence for a specific song, the result is dazzling. Every beat and drop is perfectly timed. That hand-coded result is the gold standard, and automatic visualization is still far from it.
The biggest unsolved problem is making it work well on all kinds of music. The visualizer works best on punchy electronic music with clear beats and strong contrast. Vocal-heavy music, jazz, classical piano, guitar, violin all have different frequency and time domain characteristics. One piece of code can't perform well on all of them. They call for different approaches.
The other thing I want to crack is capturing that essential quality of music that makes a human tap their foot. When you listen to a song, you feel something and your body wants to move. Writing code that mimics that response would make the visualizer dramatically better. I haven't figured out how to do it reliably in real time.
I think the future of audio visualization on LED strips will involve a mixture of experts tuned for different genres, likely using neural networks. I have this idea of generating a training dataset by listening to music while holding an accelerometer, and using the relationship between the audio signal and my body's physical response to train an AI-based visualizer. I haven't done it yet. I have lots of ideas and not enough time.
I started this as a fun LED project. I ended up spending years learning how humans perceive pitch, how to smooth noisy signals, how our eyes respond to brightness, and the difficulty of mapping sound onto light through a pixel-poor bottleneck.
Every commercial audio reactive LED strip I've seen does this badly. They use simple volume detection or naive FFTs and call it a day. They don't model human perception on either side, which is why they all look the same.
When the mel scale is tuned and the filters are dialed in and the colors map to the right frequency bands, the strip comes alive. You put on a song and the LEDs feel like they understand the music. People sent me videos from nightclubs on the other side of the world.
It's the hardest thing I've built, and I'm still not done with it.