Summary
Our web dashboard was unavailable between ~07:05 and ~07:25 UTC on May 7th 2026. All other surfaces were fully functional - public API including alert ingestion, escalations, messaging integrations such as Slack and Teams, and mobile app.
The outage was caused by a bug in our deployment pipeline.
Root cause analysis
Our web dashboard is deployed in two stages:
1. An `index.html` file that's fetched by the browser and boots the app. This is written to a Docker image and deployed to a Kubernetes cluster
2. Multiple JavaScript bundles that implement all the behaviour in the app, that are fetched by `index.html`. These are deployed to a Google Cloud Storage bucket.
The outage was caused by us publishing a Docker image whose `index.html` pointed to JS bundles that had not been uploaded to GCS. The dashboard would not load, and the browser showed an empty screen.
We tag our Docker images with the commit hash of the content within them, eg `core:a1b2c3`. We also include a hash of the content in static asset file names, eg `index.d4e5f6.js`. This makes our deployment artefacts effectively immutable - if we build the same commit multiple times, we'll generate the same image tag and static asset hash, and uploads are idempotent.
On April 24th 2026 we introduced a new feature that allowed us to detect stale JS bundles in the browser, and prompt users to refresh to use the latest version of the app. As part of this work we added the time the assets were built in to the bundle, which updated the hash in the file name. This broke the immutability of our deployment artefacts: if we build the same commit multiple times, the image tag remains the same, but the static asset hash changes.
In this case, a scheduled CI job that runs a test suite raced with a job triggered by a change on the master branch. They built the same commit hash, and so pushed to the same Docker image tag, but each pointed to different static asset bundles. The scheduled job completed second, overwriting the image pushed by the change. We don't publish static assets for scheduled jobs, and so the most recent image pointed to static assets that were not in the GCS bucket. When Kubernetes rolled out this latest image, the dashboard app couldn't be booted, causing the outage.
Incident response
We were alerted to this issue by our customers and we paged our on-call responders, who quickly identified the issue and rolled back to last-known good state.
07:05: The faulty deployment completed and the dashboard became unavailable.
07:12: Our internal incident was raised by reports from customers, and on-call responders were paged.
07:21: Responders established the root cause.
07:24: Responders executed a rollback to a previous known good state.
07:26: Service restored.
Learnings and follow-ups
We now tag all our Docker images by both commit hash and digest to ensure that only the exact image that uploaded static assets is deployed to our clusters.
We've re-implemented the feature that allows us to detect stale JS bundles in the browser using the commit timestamp. This restores the immutability of our deployment artefacts.
Our end-to-end testing didn't catch this issue, because it only checks that `index.html` is accessible, not that the dashboard fully loads. We've updated our tests to check the static assets are accessible too, and introduced synthetic tests that run the JS to boot the app.
We've updated our scheduled CI jobs to never push Docker images.
We've updated the script that generates our deployment pipelines, so that whenever we build a Docker image, we always push static assets too.