incident.io

Issue affecting our dashboard and Slack app
Resolved·Partial outage

We’ve now debriefed on this incident internally and have confirmed the exact impact. At 15:21 we made a change that removed the usage of an old permission from our role based access control (RBAC) system. We believed this was unused, and expected this to be a routine housekeeping task.

Having made the change we quickly realised it was affecting a subset of customer accounts. More specifically, any admin or owners within an account that was created before 22nd of February this year were unable to interact with our dashboard, mobile app or Slack app.

All customers were still able to create incidents, and status pages, alert ingestion and on-call escalations were unaffected.

We rolled back the broken change at 15:28, which resolved the issue. However, at 15:56 we inadvertently reintroduced the change due to an unrelated bug in our deployment pipeline. This was resolved and service resumed as normal at 16:00.

We’ve since implemented follow-ups around both our deployment pipeline and permission handling to prevent issues like this happening again. Additionally, we’re implementing further drills internally to improve our response time to issues like this in the future.

Thu, Jul 11, 2024, 11:40 AM
(10 months ago)
·
Affected components
Dashboard
Mobile app
Slack app
Updates

Resolved

We’ve now debriefed on this incident internally and have confirmed the exact impact. At 15:21 we made a change that removed the usage of an old permission from our role based access control (RBAC) system. We believed this was unused, and expected this to be a routine housekeeping task.

Having made the change we quickly realised it was affecting a subset of customer accounts. More specifically, any admin or owners within an account that was created before 22nd of February this year were unable to interact with our dashboard, mobile app or Slack app.

All customers were still able to create incidents, and status pages, alert ingestion and on-call escalations were unaffected.

We rolled back the broken change at 15:28, which resolved the issue. However, at 15:56 we inadvertently reintroduced the change due to an unrelated bug in our deployment pipeline. This was resolved and service resumed as normal at 16:00.

We’ve since implemented follow-ups around both our deployment pipeline and permission handling to prevent issues like this happening again. Additionally, we’re implementing further drills internally to improve our response time to issues like this in the future.

Thu, Jul 11, 2024, 11:40 AM

Resolved

We're back up-and-running now. We'll spend some time digging into the details before providing further updates, but we're confident we understand enough of what happened to say you shouldn't expect any further interruptions.

Tue, Jul 9, 2024, 03:31 PM(1 day earlier)

Monitoring

We've identified and fixed the underlying issue, but we're continuing to monitor the situation.

Tue, Jul 9, 2024, 03:00 PM(31 minutes earlier)

Investigating

We're investigating an issue affecting our web dashboard and Slack application.

Status pages, alert ingestion, and on-call notifications are unaffected.

We'll provide an update in the next 15 minutes.

Tue, Jul 9, 2024, 02:52 PM