all posts
May 18, 2026·5 min read·webgpuwgslcomputeshaders

Compute shaders in the browser: 25k GPU particles in WebGPU

A real WebGPU compute pipeline driving a curl-noise particle field — 25,000 particles integrated entirely on the GPU, the one thing WebGL fundamentally cannot do.

The /lab page has a metaballs scene, a Mandelbulb, a tunnel — all raymarched in a fragment shader. Those are pixel work: each fragment is independent, the GPU runs the same code per pixel, you get a picture. But none of them simulate anything. The particles scene is different. It runs an actual physics step on 25,000 particles every frame, on the GPU, and there is no way to do that in WebGL. This is the post about why, and how the pipeline is shaped.

WebGL has no compute, and that cost me a generation

WebGL — even WebGL2 — has fragment shaders and vertex shaders and nothing else. There is no general-purpose "run this kernel over N elements and write the results back to a buffer" primitive. For fourteen years the workaround was a hack: encode your simulation state into a floating-point texture, render a fullscreen quad into a second texture, ping-pong between them, and read positions back as texels in the next pass. It works. It is also miserable — you are fighting the rasterizer to do arithmetic, your data layout is dictated by texture dimensions, and reading the result back to update a vertex buffer means another copy.

WebGPU has a real compute stage. You write a @compute function, you bind a storage buffer, you dispatch a grid of workgroups, and the GPU writes your results straight into a buffer you can then draw from. No texture encoding, no ping-pong, no readback. That single capability is the reason the particles scene exists.

The per-particle struct

Each particle is six floats — 24 bytes — and the WGSL struct mirrors the JS seed layout exactly:

struct Particle {
  pos: vec2f,
  vel: vec2f,
  life: f32,
  seed: f32,
}

On the host I seed PARTICLE_COUNT of these into a Float32Array: positions scattered on a unit disc, zero velocity, a randomized life, and a per-particle seed so each one decorrelates in the noise field. The whole array gets uploaded once with device.queue.writeBuffer and then never touched from JS again — the GPU owns it from that point on.

The compute pipeline shape

The buffer is created with three usages at once: STORAGE so the compute shader can read and write it, VERTEX so the render pass can draw it directly, and COPY_DST so I can seed it. Then a compute pipeline binds it as read_write storage:

@group(0) @binding(0) var<uniform> u: Uniforms;
@group(0) @binding(1) var<storage, read_write> particles: array<Particle>;
 
@compute @workgroup_size(64)
fn cs_main(@builtin(global_invocation_id) gid: vec3u) {
  let i = gid.x;
  if (i >= arrayLength(&particles)) { return; }
  // integrate particle i ...
}

@workgroup_size(64) means 64 invocations per workgroup, and the host dispatches ceil(particleCount / 64) workgroups — about 391 for 25k. The bounds check matters: the dispatch rounds up, so the last workgroup runs invocations past the end of the array, and they return early.

Each frame the encoder runs the compute pass before the render pass. The order is non-negotiable — the render pass reads the positions the compute pass just wrote:

const cpass = encoder.beginComputePass();
cpass.setPipeline(computePipeline);
cpass.setBindGroup(0, computeBindGroup);
cpass.dispatchWorkgroups(workgroups);
cpass.end();
// ... then beginRenderPass and draw from the same buffer

Curl-noise flow field

Motion comes from a divergence-free vector field. I sample a cheap value-noise scalar potential and take its curl, which gives a field that swirls instead of converging — particles trace clean vortices rather than piling up at sinks:

fn curl(p: vec2f) -> vec2f {
  let e: f32 = 0.05;
  let n1 = vnoise(p + vec2f(0.0,  e));
  let n2 = vnoise(p - vec2f(0.0,  e));
  let n3 = vnoise(p + vec2f( e, 0.0));
  let n4 = vnoise(p - vec2f( e, 0.0));
  return vec2f((n1 - n2), -(n3 - n4)) / (2.0 * e);
}

The integration is a damped Euler step: vel = vel * DAMPING + field * speed * DT, then pos = pos + vel * DT, with a hard speed clamp so a noise spike can never fling a particle off-screen in one tick. When a particle's life runs out or it drifts past a death radius, it respawns on the disc with a fresh life. The system stays alive forever without any host involvement.

The pointer as a weak attractor

The mouse position rides in the shared uniform block as a vec2f. In the compute step it becomes a gentle pull, scaled low enough to steer stragglers without collapsing the whole field into a knot:

let target = (u.mouse - vec2f(0.5)) * 2.0;
let pull = (target - p.pos) * 0.04;

A click adds a stronger, decaying transient pull toward the tap point. Both are just additional acceleration terms summed into the same velocity update.

Particles as a vertex buffer

The clever part of the render side: the storage buffer is bound as a per-instance vertex buffer. The render pipeline declares stepMode: "instance" with an arrayStride of 24 bytes, and three attributes — pos at offset 0, vel at offset 8, life at offset 16 — pulled straight out of the same Particle layout. The vertex shader draws 6 vertices per instance (a quad), positions it at the particle, and colors it by velocity magnitude mixing tintA toward tintB. The draw call is one line: pass.draw(6, 25000). No CPU loop, no per-particle uniforms.

The fragment shader gives each quad a soft circular falloff and writes premultiplied additive color, so overlapping particles bloom into light instead of stacking opaquely. The canvas is configured alphaMode: "premultiplied" and the blend state is one/one add.

Performance

25,000 particles hold a steady 60fps on integrated graphics. The compute step is trivially parallel — 25k independent integrations is exactly what a GPU eats for breakfast — and the render is one instanced draw with cheap fragments. The cost is dominated by overdraw from the additive glow, not the simulation. I could push this to 100k+ before the compute step mattered.

This is the scene I point to when someone asks what WebGPU actually unlocks over WebGL. Not prettier pixels — a whole stage of the pipeline that simply did not exist before.

Try it at /lab, and for the fragment-shader side of the same sandbox see the metaballs writeup.