Audio-reactive WebGPU: from mic FFT to a 64-bin spectrum

I wanted the shader lab to listen. Not as a gimmick — as a second input alongside the mouse and the gyroscope, so the visuals could breathe with whatever's playing in the room. That meant getting microphone audio into a WGSL shader. There's no direct pipe; you go through the Web Audio API, pull frequency data out as a byte array on the CPU, and upload it to the GPU each frame. I ended up building two paths: a coarse one every scene gets for free, and a rich one for a dedicated visualizer.

The audio graph

The graph is small. Request the mic, wrap the stream in a source node, and feed it into an analyser:

const stream = await navigator.mediaDevices.getUserMedia({ audio: true });
const ctx = new AudioContext();
const src = ctx.createMediaStreamSource(stream);
const analyser = ctx.createAnalyser();
analyser.fftSize = 256;
src.connect(analyser);
const fftData = new Uint8Array(128);

The AnalyserNode is the whole point. With fftSize set to 256, calling getByteFrequencyData fills a 128-element byte array — 128 frequency bins, each 0..255, low frequencies first. I deliberately don't connect the analyser to the audio destination, so nothing plays back; I only want to read it, not hear it.

Coarse path: a four-band vec4f every scene multiplies in

The cheap, universal path collapses those 128 bins into four numbers — bass, mid, high, and overall level — by averaging slices of the array:

analyser.getByteFrequencyData(data);
let sBass = 0, sMid = 0, sHigh = 0;
for (let i = 0; i < 16; i++) sBass += data[i];
for (let i = 16; i < 64; i++) sMid += data[i];
for (let i = 64; i < 128; i++) sHigh += data[i];
const bass = sBass / 16 / 255;
const mid = sMid / 48 / 255;
const high = sHigh / 64 / 255;
const level = (sBass + sMid + sHigh) / 128 / 255;

Those four floats get packed into one slot of the shared uniform buffer as an audio vector — bass, mid, high, level. Every scene reads the same uniform layout, so each one can multiply audio into whatever it already animates. The Mandelbulb scales its spin by level. The tunnel amplifies its twist by bass. The particle field injects bass into flow speed and lets mid shift the cool-to-hot color ramp. It's one shared field, and each scene decides how loud the music gets to be. No new pipeline, no new binding — the band data just rides along in the uniforms that were already there.

Rich path: 64 bins in a dedicated storage buffer

Four bands is enough to make a scene pulse, but it's not enough to draw the audio. For a real visualizer I wanted per-frequency resolution. So the second path downsamples the 128 FFT bins to 64 by averaging adjacent pairs, then uploads all 64 into a buffer the shader can index.

A uniform buffer is the wrong tool for an array that long — it's for small, fixed-shape data. The right tool is a read-only storage buffer. The key design decision: that buffer only exists for scenes that ask for it. Each scene carries a needsSpectrum flag, and only when it's set does the host allocate the buffer and build a bind-group layout that adds a second binding. The shader declares it explicitly:

@group(0) @binding(1) var<storage, read> spectrum: array<f32, 64>;

Scenes that don't need it keep their original single-binding layout untouched. This is the part I'm happiest with: adding an audio visualizer didn't disturb the five existing pipelines at all. Binding 0 (the shared uniforms) is identical everywhere; binding 1 is purely additive, opt-in per scene.

Smoothing happens on the CPU

Raw FFT magnitudes flicker hard frame to frame. If you upload them straight, the bars strobe. So the host keeps a persistent 64-element float array and exponentially smooths it toward each frame's target before uploading:

const k = Math.min(Math.max(knob, 0.02), 1);
for (let i = 0; i < smooth.length; i++) {
  const target = (data[i * 2] + data[i * 2 + 1]) / 2 / 255;
  smooth[i] += (target - smooth[i]) * k;
}
device.queue.writeBuffer(spectrumBuffer, 0, smooth);

The smoothing factor is wired to the scene's knob slider: low values give heavily damped, fluid bars; high values snap to the beat. Doing this on the CPU rather than in the shader means the smoothed state survives between frames in plain JS, which is far simpler than ping-ponging buffers on the GPU.

The radial analyzer scene

The dedicated scene — I called it Spectrum — reads that buffer in the fragment shader and draws 64 bars arranged in a full circle. Each pixel's angle maps to a bin index; the bar's outer radius tracks that bin's magnitude. Color lerps from one theme tint to another across the frequency range, low to high. At the centre sits a bass-weighted core disc that pulses with the first eight bins, wrapped in a soft bloom. An outer halo and a faint mirrored ring add depth. To keep the bars from hard-stepping at bin boundaries, the shader samples the spectrum at a continuous index with linear interpolation between neighbours.

Never blank, even silent

A visualizer that's a black void until you grant mic permission feels broken. So when the mic is off, the host fills the same smoothing array with a synthetic idle sweep — a slow sine travelling across the bins plus a low-frequency hump that breathes over time. The scene looks alive immediately; the difference once you enable the mic is that it starts reacting to you instead of to a generator. A one-time inline hint nudges visitors toward enabling the mic, and it doubles as the enable button.

Permission and teardown

Two disciplines keep this clean. First, the mic is request-on-toggle, not request-on-load — nobody wants a permission prompt the instant a page renders. Second, turning audio off, or unmounting the component, tears the whole graph down: stop every track on the stream, close the AudioContext, and null out the analyser and data references. A leaked AudioContext keeps the mic indicator lit in the browser tab, which reads as spyware. Closing it is non-negotiable.

If you want to see both paths live, the lab is at /lab — flip on the mic in any scene to feel the coarse modulation, then switch to Spectrum to watch the FFT become the whole picture.