Week 9: The Bug Sweep Nobody Screenshots

Week 9 of Shipping Every Week: I didn't ship a single feature this week. Last week I switched Powr's entire video pipeline over to a rebuilt one. This week I made sure it was safe to leave on.

When you replace the thing that stores every user's form videos, the scary part isn't the switch. The switch is exciting. The scary part is the week after, when real uploads from real sessions are running through new code, and you find out which edge cases you missed — the ones that never show up in testing because you can't imagine them, only encounter them.

So I treated this like a soak. Leave the new pipeline running under real traffic, watch what breaks, fix it, and don't move on to anything shiny until the new thing has earned trust the old thing had spent months building.

Five bugs, all the kind nobody screenshots

The coach link said "video not found." A coach opens your share link — the feature I'd just shipped as the headline of week 8 — and the viewer couldn't yet read videos stored by the new pipeline. The most visible thing I'd built was sitting on top of the thing I'd just swapped, and the seam between them leaked. Fixed.
One failed download locked you out for seven days. A single auth blip while fetching a clip would trip a cooldown, and that clip was then unavailable until the cooldown expired — a week later. A momentary network hiccup got treated like a permanent failure. Fixed.
The summary showed a black screen and told no one. When a video failed to load on the workout summary, you got a black rectangle where your set should be, and the app said nothing — to you or to me. Silent failure is the worst kind, because it doesn't even generate a complaint you can act on. Now it surfaces the failure with a retry button, and logs it, so I see it the moment it happens instead of never.
A big upload could run the server out of memory. The upload server was loading whole video files into RAM, so one large upload could exhaust a worker and leave an orphaned video behind — the exact failure mode the rebuild was meant to kill, still reachable through the old path. I capped the upload size at 300 MB and bounded the timeouts so a hung request can't pin a worker indefinitely.
A webhook was silently dropping events. This is the subtle one. The pipeline relies on webhooks from the video provider to know when an asset is ready. The handler had a deduplication step, and it was catching every error as if it meant "duplicate event, safely ignore" — including real database errors. So when a write genuinely failed, the webhook told the provider "got it, thanks," and the provider never retried. Events vanished. The fix was to distinguish a true duplicate (a unique-key collision, which is safe to ignore) from a real failure (which must be re-raised so the provider retries). Swallow the one that's safe; surface the one that isn't.

Look at that list and you'll notice none of them are features. They're all the same category of work: a swap I made last week leaking through a seam I didn't know was there. That's what a soak week is for — not building the next thing, but finding the holes in the last thing while the stakes are still low.

A test for the slowest path

I also wrote a test that drives the entire program-import flow against a live server: upload a real program file, wait for it to actually parse, and check that the result comes back shaped correctly. Importing a 12–16 week program is the slowest and most fragile path in the app — it involves a long-running parse, polling, and a lot of structure to get right. Now I find out the moment it breaks, from a test, instead of from a user telling me their program never imported.

And the boring infrastructure win that makes all of this sustainable: development, preview, and production Powr can now live on one iPhone at the same time, each as a separate app with its own identity. That means I can test a genuinely risky build without uninstalling the real app I actually train with every day. The cost of trying something dangerous just dropped to near zero, which means I'll try more dangerous things, which is the point.

The switch I haven't flipped

The force-update switch — the one that can require every user to update before the old pipeline gets retired for good — is built, staged, and ready. I have not flipped it.

It stays off until the soak data says the new pipeline has carried real production traffic cleanly for a week straight. Building the switch was the easy part. The discipline is in not pulling it early just because I'm impatient to call the migration done. "Done" isn't when the new code ships. It's when the new code has proven, under real load, that it deserves to be the only code.

None of this is a demo. It's the unglamorous half of shipping: the week you spend earning the right to leave last week's work turned on.

What did you ship this week?

Five bugs, all the kind nobody screenshots

A test for the slowest path

The switch I haven't flipped

Want a site that does this for your business?