The paradoxical scenario with technology is that it is constantly refined to prevent failure. Yet, failure is almost inevitable when it comes to digital technology and online assets. For instance, you might be using the most advanced technologies, and commit >99% uptime or even 99.99% uptime to your clients for their websites. However, you can never pledge 100% uptime. Even though there is no mathematical models for predicting outages and all the engineering work is build ground up to not have them, based on precedence of many incidences one can safely Its a mathematical certainty that someday, there would be an incident that would cause downtime to your service and will happen more than once and will be a recurring occurrence no matter how low the frequency is.
No service provider can completely eliminate this risk of failure. neither could Cloudflare. Unfortunately for them, the downtime was hit a couple of days back. Being the gateway to numerous websites all over the world, when the Cloudflare server went offline, it also deprived access to all those clients whose websites are hosted on its platform. Users failed to access apps and websites such as Zerodha, Upstox, Groww, Shopify, Canva, Twitter, and even Amazon Web Services etc., during this outage which saw 19 of the Cloudflare data centers go offline. Even though the problem was solved within hours, and one-by-one all the data centers and websites etc resumed functioning, this incident brought a critical issue to light.
More often than not, when Service Level Agreements are discussed, vendors and buyers share information about the features, capabilities and pricing that is offered. All that is good when the services function smoothly, i.e., for 99.99% of the time. However, as this incident shows, customers need to know what will happen when that 00.01% downtime occurrence hits.
All major service providers including Cloudflare spend huge amounts building infrastructure and also redundancy to cope with outages better, such as the network of data centers that went down today. No ISP or hosting services provider ever wants servers to go offline, but it is essential to prepare for occasions when they do.
It is just like the nuclear bunkers and vast network of tunnels that both Soviets and Americans built during the Cold War era. With thousands of nuclear devices on each side, none of them would have ever wanted to face the day when they had to run to those bunkers and tunnels, and thankfully they haven’t needed to. Yet, they prepared for it because that’s what prudence necessitated.
Similarly, adopting a ‘Design for Failure’ approach while building web infrastructure is critical to keep the digital world up and running. When you design for failure, it means you have put in place automated processes in place that will save the day when the main systems fail. Though many invest on redundancy capabilities, a key point to ponder are can we also isolate and reduce the surface area of the impact of the outage. Ideally the outage should cause disruption only to the service providers capability and nothing else for the window of time there is outage, but the bigger failure in this case is it caused outages beyond that including to the availability of the business and not just the security and acceleration features that cloud flare provides
To give a simpler example, the functioning of the ‘Designed for failure’ infrastructure is similar to the UPS devices that computers have. It is inevitable that there would be a power outage sometime or the other, but the UPS devices though not a permanent and ideal state to be in, keep your computers running without any disruption.
The automatic service bypass features ensure that the websites remain available even when the service provider’s server goes down. Building such capabilities takes a lot of resources, time and money, but it speaks volumes about the service provider’s resiliency and ability to foresee any risks that customers might face. If Cloudflare’s clients had been hosting their websites with a service provider that has Designed for Failure system in place, they would not have experienced any outage. The engineering framework put in by the provider would have automatically switched the system and kept everything up and running without the customer’s involvement.
Going forward as we increasingly move to a completely digital infrastructure for utilities, financial services, daily chores, professional tasks, communication, and healthcare etc., even the 00.01% outage would be too big a risk to accept.
There is also another side to such incidents which make them riskier. When a website goes offline, it also becomes inaccessible for the security and monitoring tools that a service provider might have deployed. To prevent getting blind-sided like that, the cybersecurity and solutions providers need to also integrate ‘Monitoring for Failure’ automation that would continue to monitor and protect websites which remain up and running even when the main systems go down.
As a digital business, you must ask yourself whether you wish to settle for complete outage of your business or ask for what capabilities are built in to isolate the outage scenario only to the service providers scope of services. It will not be a tough choice to make. Next time you choose a vendor, make sure to ask if the service provider has ‘Design for Failure’ and ‘Monitoring for Failure’ frameworks in place or not and most importantly do they have a process to minimize the surface area of impact of the outage only to their services and nothing more.
(The author is Mr. Venkatesh Sundar – Co-founder and CMO, Indusface and the views expressed in this article are his own)