Service degradation

Incident Report for Atlassian Bitbucket

Postmortem

Summary

On Tuesday, January 9th, 2018, we experienced an incident with Bitbucket Cloud that resulted in service degradation for our users, including two hours of repository unavailability. The incident was triggered by a disk failure in our storage layer that was followed by a resource-intensive, automatic repair process during the peak traffic window. No data was lost, but the repair operation caused long response times from our storage system and led to connections from clients to be queued and ultimately dropped as the limit of the connection queue was exceeded. Customers experienced intermittent request failures during the height of the incident which lasted for several hours on January 9th.

Fault & detection

Disk failure is a somewhat regular occurrence with any large storage system and a single disk failing is normally a non-event that has no visible impact, but in this case, it wasn't.

The incident began at 10:55 UTC on January 9th. Bitbucket Cloud started seeing increased response times across all services, but no alerts were triggered as the service was operating within predefined limits without any backlogs. 12:00 UTC is one of our highest traffic periods of the day, and the increase in traffic led to more stress on the storage system. We began to get alerts for requests being dropped and users began to experience slower response times and backlogs.

During the initial triage, common candidates such as recent deployments, database latency and unexpected use patterns were eliminated, leaving file system latency as the true culprit. The investigation into our filesystem nodes revealed that a disk array rebuild process had started and was impacting service response times, leading to backlogging and load shedding.

Impact & recovery

During this incident, the bitbucket.org website and our Git & Mercurial services were largely unavailable for a period of 180 minutes on January 9th. A large percentage of client connections were being queued and/or dropped within additional 90 minute periods of service degradation before and after the interruption window. The storage system repair operation consumed a large amount of I/O resources and could not be paused as it was a system-critical operation. The net effect was that the high response times from the affected part of our storage system created a pile-up of connections for all clients, leading to overall service degradation.

To alleviate the impact, we moved storage off the affected storage node and onto well-performing alternative nodes. As we started the migration, we prioritized moving the volumes that were used most heavily by multiple services. To improve migration throughput and reduce the number of stalled client connections that were piling up in the connections queue, our team began to temporarily disable access to repositories that resided on the storage segment that was under repair. To restore service as quickly as possible to customers whose repositories were blocked, we moved those repositories to different storage segments that were not in a self-repair mode. Furthermore, we shed additional load off our systems by disabling any non-critical services and operations.

The background repair operation took 5 days to complete due to the large size of the volumes and the concurrent load of client requests during this time. Ultimately, service to all customers was brought back to normal performance by moving hundreds of thousands of repositories to other storage nodes.

In parallel, we also opened a support issue with our storage vendor and began to make system and network configuration changes to further reduce the impact of the repair operation on our customers. Our team continued to work around the clock for several days after the initial outage to mitigate the effects of the storage system repair process to ensure that during the peak traffic window, the majority of customers would be able to continue to work using Bitbucket Cloud with improving performance.

In line with our company values and a spirit of transparency, we kept our users up to date with regular posts to our Statuspage and by responding to support requests that were opened via our support service desk.

What is being done to prevent this in the future?

While the overhead of disk cluster rebuilds has never presented a problem in the past, we have decreased the aggressiveness of these rebuilds for future faults.
We have moved customer data off the storage node that reacted poorly to the fault.
We are working with our vendor to analyze the state of the node to prevent similar scenarios in the future.
As part of an ongoing build-out plan, we have expanded the number of nodes in service to include newer, more powerful hardware.
We are implementing more fine-grained load shedding capabilities to better isolate future interruptions if other remediations fail to reduce or contain impact adequately.

Was any data lost during the incident?

No. The actions performed and the repair job itself were non-destructive in nature.

Apology

To the people affected by the outage, we apologize. We know that you rely on Bitbucket Cloud to keep your teams working and your businesses running and this incident was disruptive. Our systems and processes are designed to balance customer traffic and behind-the-scenes, automatic recovery from faults. However, in this case, we encountered a new performance impact from what would normally be a common storage system fault. We will continue to incorporate what we've learned from this event in the design and implementation of our systems and processes in the future.

Posted 7 years ago. Jan 19, 2018 - 23:37 UTC

Resolved

We have successfully returned Bitbucket Cloud to normal response times and will no longer be providing status updates about the recent performance degradation. We want to thank all customers for your patience as we worked on resolving the issue. We know that you rely on Bitbucket to keep your teams working and your businesses running, and apologize for the disruption this has caused.

We will be publishing a post-mortem next week on this Statuspage.

If you have a specific question about your repository, you can raise an issue with our support team here: https://support.atlassian.com/contact

Posted 7 years ago. Jan 12, 2018 - 19:13 UTC

Update

The effort to repair Bitbucket Cloud's affected storage volumes is approximately 50% complete and we are continuing to target Saturday, Jan 13 for completion of this process. We will provide status updates every 12 hours to communicate our progress.

Posted 7 years ago. Jan 12, 2018 - 06:55 UTC

Update

The effort to repair Bitbucket Cloud's affected storage volumes is approximately 40% complete and we are continuing to target Saturday, Jan 13 for completion of this process. We will provide status updates every 12 hours to communicate our progress.

Posted 7 years ago. Jan 11, 2018 - 18:48 UTC

Update

As communicated previously, we are continuing the maintenance process to return Bitbucket Cloud to normal performance after an issue with our storage system. There was no data loss associated with this incident.

The repair effort for the affected storage volumes is approximately 30% complete and we are continuing to target Saturday, Jan 13 for completion of this process. We will provide status updates every 12 hours to communicate our progress.

If you have a specific question about your repository, you can raise an issue with our support team here: https://support.atlassian.com/contact

Posted 7 years ago. Jan 11, 2018 - 06:57 UTC

Update

We are running a multi-day maintenance process to rebuild our storage volumes and return to normal performance. Bitbucket Cloud services and repositories are online and available for all users, however some will experience slower than normal response times until the maintenance completes. We are continuing to make improvements throughout the day to enhance performance in the interim.

We expect Bitbucket Cloud to return to normal response times by Saturday, January 13th, 2018, and will continue to provide status updates every 12 hours to communicate our progress.

If you have a specific question about your repository, you can raise an issue with our support team here: https://support.atlassian.com/contact/#/

Posted 7 years ago. Jan 10, 2018 - 18:54 UTC

Monitoring

Our team continues to work around the clock to recover from an incident that involved a failure in Bitbucket's storage layer. Most services and repositories are functioning normally, albeit slower than usual while the storage layer repairs itself. No data has been lost, but we are continuing to run a maintenance process to bring everything back online at full capacity.

Posted 7 years ago. Jan 10, 2018 - 02:21 UTC

Identified

All services and all repositories are back, but overall performance of the website and SSH/HTTPS transactions will continue to be slow.
We are still in the process of mitigating the performance issue and will be able to share additional information soon.
Thank you for your patience. Next update 1 hour.

Posted 7 years ago. Jan 09, 2018 - 23:47 UTC

Update

All services and all repositories are back, but overall performance of the website, and SSH/HTTPS transactions will continue to be slow.

We are in the process of mitigating the performance issue, but cannot currently give an accurate eta for when this will be completed.

Thank you for your patience. Next update in three hours.

Posted 7 years ago. Jan 09, 2018 - 20:22 UTC

Update

All services and all repositories are back, but overall performance of the website, and SSH/HTTPS transactions will continue to be slow.
We are in the process of fixing the overall issue with our vendor, but cannot currently give an accurate eta for when this will be completed.

Thank you for your patience. Next update in an hour.

Posted 7 years ago. Jan 09, 2018 - 19:16 UTC

Update

The team is working with our storage vendor to address performance issues. Thank you for your patience. Next update in an hour.

Posted 7 years ago. Jan 09, 2018 - 18:14 UTC

Update

The team is continuing to investigate possible network and/or storage layer issues, thank you for your patience. Next update in an hour.

Posted 7 years ago. Jan 09, 2018 - 16:58 UTC

Update

We are continuing to investigate and test potential fixes for this service degradation, thank you for your patience. Next update in 60 minutes or if there is a notable change in status.

Posted 7 years ago. Jan 09, 2018 - 15:51 UTC

Update

We are continuing to investigate the service degradation and are testing out potential fixes. Thank you for your patience, the next update will be in 60 minutes.

Posted 7 years ago. Jan 09, 2018 - 14:50 UTC

Investigating

We are presently investigating service issues with Bitbucket resulting in degraded performance on the website and via SSH & Git via HTTPS. Next update in 60 minutes

Posted 7 years ago. Jan 09, 2018 - 13:47 UTC

This incident affected: Website, Git via SSH, and Git via HTTPS.