We’ve now debriefed on this incident internally and have confirmed the exact impact. At 15:21 we made a change that removed the usage of an old permission from our role based access control (RBAC) system. We believed this was unused, and expected this to be a routine housekeeping task.
Having made the change we quickly realised it was affecting a subset of customer accounts. More specifically, any admin or owners within an account that was created before 22nd of February this year were unable to interact with our dashboard, mobile app or Slack app.
All customers were still able to create incidents, and status pages, alert ingestion and on-call escalations were unaffected.
We rolled back the broken change at 15:28, which resolved the issue. However, at 15:56 we inadvertently reintroduced the change due to an unrelated bug in our deployment pipeline. This was resolved and service resumed as normal at 16:00.
We’ve since implemented follow-ups around both our deployment pipeline and permission handling to prevent issues like this happening again. Additionally, we’re implementing further drills internally to improve our response time to issues like this in the future.