CIOs Can Learn From AWS Outage And Move On

by Sohini Bagchi    Mar 02, 2017


Every business is becoming a software business, the adage goes, which means cloud outages would ultimately result into business outages.  It is hardly surprising then that a massive outage at Amazon Web Services [AWS] on Wednesday knocked down countless websites and impacted many businesses across the world.  

AWS, Beware of rivals

The outage could also give rival firms to pitch their services to Amazon’s customers to dissuade them from continuing with AWS, believe some analysts.

As Trip Chowdhry of Global Equities Research commented in a report that the outage will have around a 2 percent negative impact on Amazon’s first quarter revenue, but perhaps more concerning is the fact the outage occurred at the worst possible time. For example, it gives an opportunity for Oracle or others to exploit Amazon’s short comings as a “single point of failure.” [Read the full report]

“Competition is talking to customers and over-amplifying the problems of its Single point of Failure and you go Burst,” writes Chowdhry.

The list of companies impacted by the outage was indeed vast and even industry giants like Apple were not immune. Apple’s App Store, music, TV, iBook store, iCloud, iTunes, Mac App store and Apple Photos were all “dead” and Twilio and Twitter were not immuned.

The outage illustrated just how big AWS has become. No wonder it is the ‘golden goose’ for Amazon, topping $12 billion in sales in 2016 — up 55 percent 2015 — blowing past a goal of reaching $10 billion in sales in 2016. It also has captured more than 40 percent of the cloud computing market, according to a recent report.

Outages are inevitible

There have been many high profile accounts in the news recently of big name websites being attacked and thus crashing. The New York Times has gone down twice in recent weeks and Microsoft Outlook and Apple both experienced outages. Even Google wasn’t immune and was offline for almost five minutes.

The first cloud outage that AWS suffered, had occurred in 2011 and then a series of outages followed till date though every time it was only a matter of a few hours, but the loss had been tremendous.

As enterprises migrate more mission-critical workloads into production cloud environments, mere minutes of downtime from a provider can significantly impact profits, damage relations with customers, and cause IT administrators to prematurely age.

But while the global economy increasingly hinges on the ability of cloud services providers — especially those operating at hyper-scale proportions — to guarantee uptime and maintain service, outages are still common.

The causes can range from power outages to faulty software updates to overloaded servers to database errors. And far too often, we never learn the true nature and scope of the service failure.

A pertinent question is what a business can ask when faced with the disaster of its website going down. Nati Shalom, CTO & Founder at Cloudify, mentioned in his blog some effective ways in which CIOs can avoid such failures.

- Have an expert disaster recovery manager in place.

- Find a design a solution that will take care of the disaster.

- Avoid single points of failures such as parts of apps that are easily available in different regions/cloud 

- Alerts need to be set in place.

- Documenting DR operational processes and automations and then breaking different parts of applications will also help

- Set up a redundant software application that can service failure

Experts agree that any IT system, whether on-premises or in the cloud, will always be susceptible to service outages, and organizations need to be prepared for that. As Veteran entrepreneur and tech podcaster Marco Arment pointed out, “Nobody can gloat during the S3 outage. EVERY hosting platform goes down. It’s just a lesson for the young people who thought AWS didn’t.”

 However, the most proactive CIOs should take failure [their own or others’ failure] into account to build a robust and dynamic infrastructure that can withstand any cloud failure.