Performance problems

Incident Report for Atlassian Bitbucket

Postmortem

After more than a day of monitoring we can now confirm that this issue has been resolved.

We found that a bug in the software that runs one controller of our storage infrastructure began to cause some network slowdowns for all file operations issued to that controller. We believe the bug began early in the week, but only manifested itself under high load.

On Thursday, the site was much busier than normal during our peak traffic time. We discovered some applications causing significantly higher traffic to Bitbucket's API endpoints and temporarily blocked it.

This appears to have exacerbated the problem which then resulted in the entire site having pages timeout and some operations to fail.

Our storage infrastructure runs multiple controllers so we can remain highly available in the face of a failure. At approximately 1:15pm PST we chose to perform a manual failover and take-back to allow the affected controller to reset. Once the controller was reset the site began to return to normal.

As a result of this incident, we are working with our storage vendor to patch the bug that caused the slowdown. We believe we know how to keep our controllers from entering this state in the future.

We appreciate your support and patience while we worked out the problem.

Posted Jan 11, 2014 - 20:08 UTC

Resolved

This incident has been resolved.

Posted Jan 10, 2014 - 01:28 UTC

Monitoring

We've identified the major contributing factors to the slowdowns we've been experiencing today and have taken steps to mitigate these issues.

At this time, we do not anticipate further performance problems as a result of these events. However, we will continue to monitor the situation.

Posted Jan 09, 2014 - 23:39 UTC

Update

We are continuing to investigate and narrow down causes of the performance problems across our infrastructure. You may continue to see slowness and occasional timeouts for various pages across the site while we work on the problem.

Posted Jan 09, 2014 - 17:56 UTC

Investigating

We are experiencing performance problems with our site and Git backends. We're working on the root cause and will keep you posted on our progress.

Posted Jan 09, 2014 - 13:34 UTC

This incident affected: Website, Git via SSH, and Git via HTTPS.