Yesterday, SSH connections to Bitbucket were slow, to the point of being unusable, for about 2 hours. In line with our company values, we're committed to transparency and we wanted to share more details with you about this incident. Reliability of the service is always a top priority for us, and we will continue to learn from this incident and make the necessary improvements over the coming weeks. We know that you and your team depend on Bitbucket being available all the time and when it's not it affects your ability to ship software to your customers. We apologize for any disruption this might have caused for your team.
What caused the incident?
At 14:05 UTC on Monday, Oct 17th, 2016 we made a network change to our global edge that was thought to be low-risk. The intention of the change was to set a lower MTU (maximum transmission unit) across our public-facing switches; however, the load balancers behind the switches had not been turned down to match the MTU. This increased packet fragmentation rates, which unintentionally disrupted most SSH sessions to Bitbucket for customers trying to push or pull with Git or Mercurial. The Bitbucket Web site, API, and Git and Mercurial transactions over HTTPS were unaffected.
Our monitoring tools alerted us to the problem within minutes, and our team responded by triggering our incident management process. As always, we updated our status page, automatically sending notifications to subscribers. Given the distributed nature of our team, the problem diagnosis took longer than expected as our engineers tried a number of steps to restore service as quickly as possible.
At 15:58 UTC, the networking change was rolled back, and service was restored. We confirmed the fix and then returned our status page to normal, thus sending the "all clear" notification to our subscribers.
What are we doing to avoid any such incidents in the future?
As I'm sure many of you reading this can understand, there was some degree of confusion in isolating the particular networking change that caused the incident, and we took an unacceptably long time to identify and recover. This will be an area we will investigate moving forward. We are also conducting a thorough internal review to understand how this incident occurred and what we can do to prevent it from happening again in the future. We're primarily focusing on strengthening our cross-team communication and identifying better testing mechanisms to ensure that similar changes are more thoroughly tested before being made to our infrastructure.