At the start of October 2021 there was a disturbance in the Force, as if millions of voices suddenly cried out in terror and were suddenly silenced. This partial quote from Star Wars obviously refers to the massive outage hitting Facebook, Whatsapp and Instagram.
As many users of the affected services sought other means to communicate and alleviate their social media addiction, competitors were quick to jump on the troubles of the internet giant. Twitter welcomed a surge in traffic by simply tweeting: “hello literally everyone”. Other brands, ranging from McDonald’s to British singer Adele, were quick to reply to the thread. You can find the simple, but very popular tweet, here.
While regular users had no means to understand what was going on, the more tech-savvy among us were left wondering how this could have happened. A company that is solely reliant on the internet suffering such an extreme outage have left many technologists wondering if Facebook has been the target of a very successful cyber attack.
As Facebook engineers were working hard to resolve their services they also posted an article on their engineering blog. In the article, they state that the cause of their outage is not due malicious intent, but due to a faulty configuration change. This configuration change caused Facebook, for all intents and purposes, to simply disappear from the internet. Much like they had been hit by the laser on the Death Star.
Now fortunately, unlike the Star Wars analogy, Facebook services could be restored. It still leaves many to wonder how such a big platform could have gone up in smoke for a decent number of hours.
How the internet works
To understand what happened from a technical point of view, we need to dive into a bit of how the internet works. In networking if you want to go from computer A to computer B the traffic needs to be routed on the right path. Within your home network this is relatively simple as my laptop is in the same network as for example my printer. No real routing needs to be done. Even if I add an extra network, maintaining the routing information manually is a trivial task. There are usually just a few rules and this information hardly changes.
My laptop also has a default route. Everything it does not know will be routed to my, can you guess it, router. And from my router the traffic will probably go to a bigger router at my ISP.
So far the routing information is relatively simple. But if we go onto the wider internet things become more complicated. The internet is a big bunch of connected networks. Such networks are called autonomous systems (AS) and to be able to reach one network from the other they need to tell each other how to do so.
So in a nutshell, if my computer wants to reach Facebook, the network of my ISP needs to know how to reach the network of Facebook. This can be direct, or through another network.
To ensure this routing information between networks is updated, an AS uses the Border Gateway Protocol (BGP). And simply put, this protocol constantly advertises routing information for the addresses within that AS. Each AS connected to the originating AS will propagate that information to other networks and thus updating the routing information across the wider internet.
What broke the Internet?
As Facebook wrote in their article, a misconfiguration was the cause of their outage. Again simply put, they stopped advertising how to reach the IP-addresses within their Autonomous System.
Issues like this have happened before. Some of them by accident, but there have also been malicious attempts. The cause for this is that the connection between each AS and propagating the routing information is largely based on trust. Any owner of an AS can simply start advertising routing information for addresses they do not own. Now most ISP’s and large companies have no malicious intent. Being a bad actor on that level could quickly get your entire network disconnected. However for short periods of time this could cause major outages or security breaches.
Are we doomed?
If trust rules the internet are we doomed to see accidents and malicious actors cause such problems in the future as well? From the perspective of security there are initiatives to prevent malicious actors from advertising malicious routing information. This protocol however has not been fully rolled out on the internet and with over 64000 Autonomous Systems this could still take a while.
As for the accidents such as the Facebook outage? We can only hope that better processes for configuration changes are put in place.
About Mark van der Walle
Mark is an experienced software architect with more than 15 years professional experience in software development and operations. In his years of experience Mark has always had a drive of building and designing reliable and simple solutions to complex problems. To realize this Mark has a strong focus on quality backed by solid engineering, CI/CD pipelines, DevOps principles and craftmanship. At his customers Mark guides development teams and supports business stakeholders in building cloud native applications and going through cloud native transitions.
More on Mark van der Walle.