Facebook, along with its messaging platform WhatsApp and its photo-sharing app Instagram knocked every corner of Mark Zuckerberg’s empire on Monday.
The social media blackout started just before noon ET (9:30 pm IST) and took nearly six hours before it was resolved. This is the worst outage for Facebook since a 2019 downtime that took its site offline for more than 24 hours, hitting small businesses and creators the most who relied on these services for their income.
As per reports, this time not only did Facebook’s primary platforms were down, but so too did some of its internal applications, including the company’s own email system. Users on Twitter and Reddit have also said that employees at the company’s Menlo Park, California, campus were unable to access offices and conference rooms that required a security badge.
The problem is apparently DNS
Facebook itself has not confirmed the root cause of its woes, but the company’s family of apps effectively fell off the face of the internet, according to reports, when its Domain Name System (DNS) records became unreachable. Hence the problem is apparently DNS – often referred to as the internet’s phone book; it’s what translates the host names you type into a URL tab—like facebook.com—into IP addresses where those sites live.
DNS mishaps are common though, and they can happen for all kinds of weird technical reasons, often related to configuration issues, and can be relatively straightforward to resolve. In this case, though, something more serious appears to be afoot. As Cloudflare senior vice president Dane Knecht notes that Facebook’s border gateway protocol routes — BGP helps networks pick the best path to deliver internet traffic — were suddenly “withdrawn from the internet.” While some have speculated about hackers, or an internal protest over last night’s whistleblower report, there isn’t any information yet to suggest anything malicious is to blame.
If DNS is the internet’s phone book, BGP is its navigation system; it decides what route data takes as it travels the information superhighway.
Nonetheless, like in any security threat scenario, here are some takeaways for technology leaders from the recent Facebook glitches and other recent outages.
Go for regular disaster checkup, planning
While system failures are common and understandable, as the head of technology, it is your responsibility to be proactive about disaster planning, checkup and evaluation. If you’re a CIO or CTO responsible for maintaining email service to 1,000 employees, your disaster plan will look different than a technical team that services 500,000 external customers. Therefore, it is important to understand how outages will impact different areas of your business. Knowing the mitigation costs, as well as backups cost and standby systems costs, make sense for disaster planning.
As a tech leader you should also mark “mock failures” on your calendar and inform everyone involved on the given outage what responsibilities people have. He or she should take the opportunity to engage all stakeholders without the pressure of a real outage. Paying attention to incident response planning Any company can get compromised despite there being huge security teams working on them.
Partha Sengupta, Vice President-IT Shared Services at ITC, mentions that incident response planning will define a company’s survival after a breach and is therefore of prime importance. “It is vital how fast an organization recovers from an attack,” he says, adding that the CIO (in some firms the CISO) is accountable to respond from a technology perspective. Therefore, they are going to be strong constituents and strong collaborative partners with others in the C-suite before a disaster strikes and also when an incident occurs.”
“Communication is the key When in doubt, ‘communicate’ it out is the mantra for CIO/CISOs during an outage. Instead of simply fixing the issue during an outage, it is advisable to communicate the matter to the other stakeholders. Don’t forget there are other stakeholders in the issue, depending on whether your outage is internal, external or both,” believes Fernando Castanheira, Chief Information Officer at Aternity.
“If you run a service for customers, they deserve to know what’s going on and to receive an estimated time to service restoration,” Anil Kuril, GM-IT at Union Bank of India opines. In such cases, he believes that communication can’t be an afterthought. It must be a high priority, next only to resolving the outage.
Run your Backups more frequently
While most businesses understand the importance of backing up their important documents and files, many don’t create a backup of their entire server, believes Shyamol B Das, Chief Information and Digital Officer at Mutual Trust Bank Ltd., Bangladesh. “What they don’t realize is that having a backup of your vital data won’t help much if you need to rebuild your server from scratch. Without a complete image of your server, the entire server settings can be lost in the event of a server crash,” he says.
Sometimes, it could take more than a week to restore your server to working order, especially that of installing the operating system, applying patches and updates, recreating file permissions, and setting up the email server, to name a few. In other words, it disrupts the regular work flow of the organization.
One way CIOs can prevent this by regularly using your backup systems as production systems. They can schedule times to move regular load to the backup systems. Das advises that while a system outage occurring in front of your eyes can be the worst thing for you and your company, you can at least be assured that when outages attack, you’re prepared, confident and responsive so as to avoid making a bad situation worse.
This outage demonstrates the risks of the whole Internet being dependent on one company and this could have been minimized, believe experts. Within few hours of the incident, a Twitter user posted a question: “If WhatsApp isn’t back by tomorrow, is it a holiday?”
Nanjunda Prasad Ramesh, CEO, Multi-Verse Technologies says, “We don’t want this situation to arise again. A possible solution to this is to segregate infrastructure by country/region so that the impact in case of failure is minimized/localized. It also helps respect the local data laws of that particular country/region. From a user’s perspective, we need to have alternatives available. It is not good for the entire world to be dependent on apps from one single company. Especially when communication, business transactions are dependent on them.”
Tim Mackey, Principal Security Strategist at Synopsys, believes that CIOs should be looking at the implications of these outages impacting Facebook, Instagram and WhatsApp and apply the best practices of security and privacy in their organizations.
“When an outage like this occurs, C-suite shouldn’t take for granted that the security of its information is protected and should take the opportunity to both reset our passwords used on social media platforms and to revoke and reauthorize our access tokens issued by those same platforms,” he said, adding that doing both of these items will minimize the chances of a malicious group benefiting from any service outage and gaining access to one’s personal data.