Facebook Explains How the Network Crashed and How it was Brought Back Online

905
FILE PHOTO: A 3D-printed Facebook logo is seen placed on a keyboard in this illustration taken March 25, 2020. REUTERS/Dado Ruvic/Illustration/File Photo
FILE PHOTO: A 3D-printed Facebook logo is seen placed on a keyboard in this illustration taken March 25, 2020. REUTERS/Dado Ruvic/Illustration/File Photo

Thunder Bay – TECHNOLOGY – When Facebook, WhatsApp and Instagram went down, there was a major interruption not only to site users, but also to businesses advertising on Facebook.

Facebook was down for six hours.

Facebook has now released their Root Cause Analysis of what exactly happened and how the network was repaired.

Root Cause Analysis

Summary of issue:

This incident, on October 4, 2021, impacted Facebook’s backbone network. This resulted in disruption across all Facebook systems and products globally, including Workplace from Facebook.  This incident was an internal issue and there were no malicious third parties or bad actors involved in causing the incident. Our investigation shows no impact to user data confidentiality or integrity.

The underlying cause of the outage also impacted many internal systems, making it harder to diagnose and resolve the issue quickly.

Cause of issue:

This outage was triggered by the system that manages our global backbone network capacity. The backbone is the network Facebook has built to connect all our computing facilities together, which consists of tens of thousands of miles of fiber-optic cables crossing the globe and linking all our data centers.

During a routine maintenance job, a command was issued with the intention to assess the availability of global backbone capacity, which unintentionally took down all the connections in our backbone network, effectively disconnecting Facebook data centers globally. Our systems are designed to audit commands like these to prevent mistakes like this, but a bug in that audit tool prevented it from properly stopping the command.

This change caused a complete disconnection of our server connections between our data centers and the internet. And that total loss of connection caused a second issue that made things worse.

One of the jobs performed by our smaller facilities is to respond to DNS queries. Those queries are answered by our authoritative name servers that occupy well known IP addresses themselves, which in turn are advertised to the rest of the internet via another protocol called the Border Gateway Protocol (BGP).

To ensure reliable operation, our DNS servers disable those BGP advertisements if they themselves can not speak to our data centers, since this is an indication of an unhealthy network connection. In the recent outage the entire backbone was removed from operation, making these locations declare themselves unhealthy and withdraw those BGP advertisements. The end result was that our DNS servers became unreachable even though they were still operational. This made it impossible for the rest of the internet to find our servers.

Workplace timeline:

This incident related to a network outage that was experienced globally across Facebook services and included Workplace. The outage was live for around 6 hours, from approximately 16:40 – 23:30 BST.

Steps to mitigate:

The nature of the outage meant it was not possible to access our data centers through our normal means because the networks were down, and the total loss of DNS broke many of the internal tools we’d normally use to investigate and resolve outages like this.

Our primary and out-of-band network access was down, so we sent engineers onsite to the data centers to have them debug the issue and restart the systems. But this took time, because these facilities are designed with high levels of physical and system security in mind. They’re hard to get into, and once you’re inside, the hardware and routers are designed to be difficult to modify even when you have physical access to them. So it took extra time to activate the secure access protocols needed to get people onsite and able to work on the servers. Only then could we confirm the issue and bring our backbone back online.

Once our backbone network connectivity was restored across our data center regions, everything came back up with it. But the problem was not over — we knew that flipping our services back on all at once could potentially cause a new round of crashes due to a surge in traffic. Individual data centers were reporting dips in power usage in the range of tens of megawatts, and suddenly reversing such a dip in power consumption could put everything from electrical systems to caches at risk.

In the end, our services came back up relatively quickly without any further systemwide failures.

Prevention of recurrence:

We’ve done extensive work hardening our systems to prevent unauthorized access, and ultimately it was this hardening that slowed us down as we tried to recover from an outage caused not by malicious activity, but an error of our own making. It is our belief that a tradeoff like this is worth it — greatly increased day-to-day security vs. a slower recovery from a rare event like this.

However, we’ll also be looking for ways to simulate events like this moving forward to ensure better preparedness and ensuring that we take every measure to strengthen our testing, drills, and overall resilience to make sure events like this happen as rarely as possible.