Facebook completely disappeared from the net for several hours. In any case, this is how Internet users saw things. They weren’t entirely wrong. Technical details make it possible to understand how a social network and its subsidiaries WhatsApp, Messenger and Instagram seemed to have left the Internet, before returning.
It was in a very short message, without frills or explanations, that Mark Zuckerberg announced what many had already seen on the night of October 4 to 5: ” Facebook, Instagram, WhatsApp and Messenger are back online », Declared the founder and big boss of the social network, after a historic failure of the entire Facebook ecosystem which lasted several hours.
A message accompanied by an apology, while the general context is unfavorable for the site – a few days before, a whistleblower accused the company of promoting profit to the detriment of the safety of its members: ” We apologize for today’s disruption. I know how much you rely on our services to stay in touch with the people who are dear to you. “
For those who were on Twitter on the evening of October 4, it was possible to see the first tracks circulating seeking to explain the technical origin of the failure, because the event was considerable: for several hours, Facebook had completely disappeared from the Internet . Admittedly, it is only one site among many others on the web, but it is far from being the same thing as the small personal page of the neighbor.
Several explanatory threads, sometimes a little technical, sometimes accessible, flourished in the hours that followed, both in English and in French. There were in particular the very educational messages of the journalist Alex hern, of the Guardian, but also the details of Renaud Guerin and Cecile Morange, two network specialists.
An internal failure at Facebook, which had external repercussions
Very quickly, the thesis of a large-scale computer attack carried out by hackers in the pay of a State was dismissed, after having been mentioned a few times on the microblogging site, to sow discord or by ignorance. The source of the incident is actually much more mundane (although at this magnitude it was not in its daily effects on hundreds of millions of people around the world).
An October 4 post by Facebook vice president for infrastructure Santosh Janardh on the engineering blog provides several crucial insights to shed light on the situation. Thus, it appears that “ the root cause of this failure is a faulty configuration change Which has spread throughout the entire Facebook ecosystem.
In its misfortune, Facebook can however have a reason for satisfaction: a priori, at the same time, there was no compromise of personal data belonging to members of the socia network.l (timing coincidence, an alleged leak that would affect 1.5 billion profiles emerged at the same time, but this is excessively dubious and would not in fact be a scam attempt).
According to Santosh Janardh, “ configuration changes on the backbone routers that coordinate network traffic between our data centers caused issues that interrupted this communication “. It was this disruption in traffic that subsequently led to the chain reaction that the web witnessed on October 4, with the shutdown of all services of the American company.
This cascading downfall of the Facebook ecosystem has been seen by Internet users, but its magnitude has sometimes had unintended consequences, to the point of slowing the resolution of the problem. This is what Santosh Janardh admits: “ many internal tools and systems “Used on a daily basis were also impacted,” which complicated our attempts to quickly diagnose and resolve the problem “.
An illustration of the extent of the disaster was given by New York Times reporter Sheera Frenkel: ” I was on the phone with someone who works for Facebook who described employees unable to enter buildings this morning to begin assessing the extent of the outage because their badges were not working to access the doors. “
For the teams in charge of infrastructure management, this exceptional situation clearly resulted in the inability to remotely manage certain operations. So, according to the New York Times, quoted by BNO News, it took the social network to dispatch a team to its data centers in California to try to manually reset its servers.
Facebook was temporarily unreachable via the Internet
Santosh Janardh’s posting of the blog post helped shed some light on what happened internally at Facebook. Another way to understand the disappearance of Facebook from the net is to look at this visualization proposed by Steve Weis, a computer security engineer, where we can see the gradual disappearance of the ASN (Autonomous System Number) of Facebook.
Visualization of Facebook withdrawing its ASN, made with https://t.co/REvbPepOHK and Yakety Sax. pic.twitter.com/aGVXOPtliu
– Steve Weis (@sweis) October 4, 2021
It must be remembered that the Internet is in fact a network of networks – this is its nickname and the Internet is in fact a kind of diminutive for internetwork, that is to say a system of interconnected networks. In short, the Internet therefore consists of linking various networks together, all over the world. And every network is made up of computers that “serve” information – hence the name servers.
It turns out that each autonomous system has its own number (ASN). Facebook has its own (AS 32934) and, as explained by the Cloudflare company, which provides delivery services for the Internet, each autonomous network (AS, autonomous system) follows its own unified internal routing policy. The question then remains: how do we get all these ASes to talk to each other and exchange data? In short, let them connect.
It takes a special mechanism called BGP (Border Gateway Protocol). This device is used to exchange routing information (that is to say, in a way, the path to be taken on the network) between these autonomous systems. BGP, Cloudflare points out, allows a network like Facebook to advertise its presence to other networks. In short, BGP is a kind of binder to form the Internet.
” Each ASN must advertise its routes to the internet using BGP, otherwise no one will know how to connect and where to find [tel ou tel réseau] “, And therefore this or that site, writes Cloudflare. Clear, ” if Facebook does not announce its presence, ISPs and other networks cannot find Facebook’s and it is therefore unavailable », Like October 4th.
The cascade of effects does not end there.
An internal configuration problem, then BGP, then DNS, then access for Internet users
After the internal problem, caused by the configuration error in the Facebook backbone (or backbone, that is to say the interconnection of fiber optic networks to carry traffic, reminds Networks & Telecoms), and whose routers mentioned by Mr. Janardh are used to coordinate data and exchanges between all network infrastructures social, and the impact on BGP, then there was a DNS issue.
This is what Stéphane Bortzmeyer, R&D engineer at AFNIC, the organization which manages the top-level domain name assigned to France (“.fr”), points out on his blog. The failure at the level of BGP obviously caused the malfunction of the DNS (Domain Name System), that is to say the domain name system. This mechanism consists of matching IP addresses to domain names.
DNS is essential, because it is not possible to do without IP addresses, which serve as license plates in order to distinguish the servers from each other – and each must have, to put it simply, a number unique. Also, it is easier for a person to remember an address made up of characters like www.numerama.com, than the IP address of the site’s server.
” It should be remembered that domain names and the DNS protocol are critical for the functioning of the Internet. Most activity on the Internet begins with a DNS query. If it doesn’t work, next to nothing is possible “, Recalls this specialist in networks. However, the authoritative DNS servers for Facebook were also unreachable during the outage.
These authoritative servers are those which store the data allowing to know to which IP address is attached such server. and, thus, will allow DNS resolution (you type in an address via your web browser and, after a succession of steps, you can access the site which is hosted on a server). And since these servers were also in the same Facebook AS…
” A routing problem in the AS […] can also prevent DNS servers from responding. Putting all your eggs in one basket is definitely a bad practice », Observes Stéphane Bortzmeyer. Indeed, Facebook has experienced despite itself the consequences of the single point of failure, that is to say when a failure is capable of bringing down an entire network.
Facebook, single point of failure for Facebook
There is no doubt that Facebook will certainly learn from this gigantic blackout. It remains to be seen which ones, but no longer putting your eggs in the same basket is still wise advice. ” We are working to better understand what happened today in order to continue to make our infrastructure more resilient “, Promised Santosh Janardh.
Facebook could thus be encouraged in the future to review its organization a little. This is what the journalist Alex Hern reminds us not without jokes: “ Facebook does everything through Facebook. So when its servers were cut off from the Internet, the option to send the follow-up message [pour rattraper le coup, NDLR] was also cut. And the ability to log into the system that would send the follow-up message. “
” And the ability to use the smart card lock on the front door of the building that contains the servers that control the system that sends the follow-up message and the messaging service you use to contact the security chief physical to tell him that he has to go to the data center in the east with a physical key to bypass the smart card lock on the front door … “
For websites, this case can also lead to rethinking the use of the Facebook Login utility, which is a device that allows you to connect to a website or an application. Admittedly, the tool has generally worked well so far, but it too fell temporarily during the global outage of the social network, also illustrating what such dependence can entail.
The case also indirectly highlighted the considerable weight that Facebook has taken on the web, but also in the collective imagination. We have seen the blossoming of humorous tweets explaining that the web had fallen, because Facebook was offline, or even that the entire Internet was down. As if the web was all about Facebook. Or that Facebook was the whole Internet.
– Netflix France (@NetflixFR) October 4, 2021