On June 8, I experienced intermittent connectivity to several websites like Amazon, Twitter, and Hulu. Frustrated, I tried multiple attempts at refreshing my browser, but nothing improved. This outage was due to a failure at Fastly, which supports many of the web’s most popular sites.This incident highlighted the fragility of modern digital infrastructures, at any degree of complexity or style. Specifically, it re-emphasized the importance of a digital architecture with high availability to avoid costly downtime.
How Can Your Business Mitigate The Risk Of Downtime?
It’s important that your business identifies and implements proven principles of establishing a high-availability architecture. For example, if your business manages a private cloud, you’ll need to ensure that your resources possess the following four proven attributes of a highly available system to mitigate downtime:
- Scalable — A highly available system must come with the ability to scale vertically by adding memory and storage resources to a virtual machine, or to scale horizontally by adding instances of resources to your configuration.
- Agile —A highly available system must be able to deploy resources quickly to meet changing conditions and requirements. The ability to automatically scale “elastic” resources is even better.Elastic resources are on-demand resources which can be dynamically scaled to meet changing workloads.
- Geo-Distributed — A highly available system must give you the ability to distribute computational clusters across regions and countries. This supports the ability to distribute workloads in a balanced way and provides opportunities for data redundancies to protect your information in case of failure. AWS, Azure, and other modern cloud providers can deploy your apps to regional data centers around the world.
- Power, Network & Computing Resource Redundancies — A highly available system must have a back-up plan for an unexpected event so that it can recover quickly and resume normal operations. This means the system must have data redundancies to leverage when needed.
In many cases, your business might already use a popular platform like Salesforce or Azure for your cloud computing services. It’s up to you to decide your desired service level to codify in a service-level agreement (SLA). An SLA is a performance commitment that usually focuses on up-time or the percentage of time that a service is operational.
As the table below shows, an SLA percentage of 99.999 offers only 5.26 minutes of expected downtime per year. Ultimately, it’s up to the business to determine the value of its time and then decide what SLA might be most cost-effective.
|SLA percentage||Downtime per week||Downtime per month||Downtime per year|
|99||1.68 hours||7.2 hours||3.65 days|
|99.9||10.1 minutes||43.2 minutes||8.76 hours|
|99.95||5 minutes||21.6 minutes||4.38 hours|
|99.99||1.01 minutes||4.32 minutes||52.56 minutes|
|99.999||6 seconds||25.9 seconds||5.26 minutes|
It’s important to be aware that SLAs can quickly become expensive.Therefore, your business should first decide how much downtime it can afford. Then, you can determine which SLA and supporting architecture your business will need.
These proven attributes of a highly available system architecture will mitigate the risks of a potential disruption to your business. The Fastly incident proved that even a short disruption to your business’ mission-critical systems can have meaningful financial and operational consequences. Therefore, take the time to evaluate the most cost-effective way to mitigate downtime for your business, whether that’s re-negotiating your SLA or developing your own strategy using these proven principles to maximize the availability of your system. It’ll be well worth your time.
(The author is a Consultant in Opportune LLP’s Process & Technology practice and an expert in data science)