On April 22nd 2025, for 7 minutes between 15:57 and 16:04 UTC we received reports of stylesheets failing to load for customers using our web dashboard.
We were able to quickly resolve the issue and on investigation, narrowed the cause down to a bug introduced while upgrading our deployment pipelines earlier that day.
What happened?
Earlier that day on April 22nd, we began testing container image builds using a new pipeline that we plan to migrate to.
Whilst testing this, we continued to build and deploy containers using our old pipeline. As the old and new pipelines were running in parallel, we were inadvertently generating two copies of a container image for a single commit hash in our Google Cloud container registry.
This alone shouldn't have been a problem, given that builds should be deterministically reproducible and the duplicate build should have been a no-op.
However, but we also had some (very old) CSS code which generates random numbers for animations. The intention of this code was that we'd get some random numbers at runtime, which was used very specifically to generate a “snowstorm” at Christmas time using a particle effect.
As we’re using SCSS, there’s a CI build step to our styling and the random numbers were in fact being generated at build time (not runtime). As a result it was possible for two builds with the same commit hash to actually generate containers with different content.
The combination of CSS and dual-image building issues would have been manageable, if not for a separate problem with our index.html file. We had one process uploading static assets (CSS, JS, etc.) and a separate process uploading our index.html file, stemming from historical cache-busting needs.
Since we relied on the commit hash to find the container image to upload from, and had two different images, one process uploaded assets from image A while another process uploaded the index.html from image B.
This resulted in an index.html file that referenced a CSS file that hadn't yet been uploaded, meaning when fetched, it resulted in a 404 error, and no styles were loaded.
As soon as we realised this was a container image problem, we re-triggered a deployment from the old pipeline, which quickly fixed the issue.
We then swiftly followed up to prevent concurrent runs, we removed the code that made builds non-reproducible, and we're improving our build process and asset upload pipeline to further reduce risks here.