incident.io

Write-up
Intermittent issues with Google services

On Thursday 12th June, between 17:49 and 18:48 UTC, some of our customers experienced issues with Android push notifications, Google Meet integrations, and Insights. These issues were caused by widespread problems affecting Google Cloud Platform services. GCP have published a detailed incident report if you want to learn more.

What happened

Google Cloud Platform experienced increased errors across multiple services across their product suite. We were primarily impacted by errors from BigQuery, Cloud Tasks, and Firebase.

This impacted a subset of our product features, including:

  • Insights: customers couldn’t view insights dashboards as we couldn’t pull data from BigQuery

  • Android push notifications (Firebase): Most notifications weren't delivered (some were delayed but most were lost entirely)

  • Google Meet integration: all our incident call features were not working for Google Meet, as Google Meet itself was unavailable

  • Transient errors: we observed a small number of temporary errors taking actions in Slack and the dashboard (which would work on a retry)

Throughout this time, the rest of our core platform remained fully available, despite significant load. To be explicit, that includes:

  • Alert ingestion

  • Paging (with the exception of push notifications for Android)

  • Status pages (publishing, viewing and notifications)

  • Declaring and responding to incidents

  • Our web dashboard and Slack application

How we responded

We quickly identified that the issues were stemming from Google's infrastructure rather than our own systems. Our monitoring helped us to quickly understand the scope of impact, and once we’d confirmed all critical product services were operating fully, we then focused on communicating with affected customers while waiting for Google to resolve the underlying problems.

All production services quickly returned to normal operation once Google resolved their infrastructure issues. We’ve been stable since, and there were no delayed-impact events from the outage.