← back to field notes

The 1080p pipeline that kept melting our queue.

Short version: a media pipeline we built was crashing under load. We blamed the renderer for three weeks. The renderer was fine.

Long version below.

The setup

A client needed to render 1080p video clips on demand. User uploads a script and some assets, the system stitches them together with TTS audio, returns a finished mp4. Sounds straightforward. Three stages — input, render, deliver.

We built it on a queue. Each stage was a worker. Each worker pulled jobs, did its thing, pushed to the next queue.

The symptom

Under light load: fine. Under medium load: fine. Under "real users started using it": the render workers would spike CPU, time out, and the queue would back up until the system was effectively wedged.

What we tried

In order, because hindsight makes everything look obvious:

  1. More workers. Doubled the worker count. The wedge happened later but harder.
  2. Bigger machines. Threw a beefier instance at it. Bought us 30% more headroom and the same eventual wedge.
  3. A different renderer. Spent a week porting from one toolchain to another. Same wedge, different stack trace.
  4. Profiling. Finally. Should've been step one.

The actual problem

Each render job was processing one full clip end-to-end. Some clips were 15 seconds. Some were 90. The 90-second clips weren't 6× slower — they were 12× slower because of how memory was being held during the operation.

Every time a long clip locked a worker, every short clip behind it queued. Workers weren't busy; they were blocked. Adding more workers didn't help because each new worker also got blocked behind a long clip.

The fix

Chunk by duration, not by job.

We split each clip into 10-second chunks at the input stage, ran the chunks through a worker pool independently, and stitched at the output stage. Same total work. But now no single worker held the queue hostage.

The wedge stopped happening. We didn't need bigger machines. We needed smaller jobs.

The lesson

The queue isn't where you fix queue problems. The queue is where you see queue problems. The fix is almost always upstream, in how you cut work into pieces.

If your queue is backing up under load, the first question isn't "do I need more workers." It's "are my jobs the same size?"

Got a project that needs all three teams?

One sentence is enough to start. We'll take it from there.

Start a project