AIOps – the combination of big data and machine learning to automate IT operations processes – allows businesses to manage large volumes of data across distributed IT systems and reduces the time to perform and respond to daily operational tasks.
By: Anand Patil
A few weeks ago, I had the opportunity to meet up with the CIO of a large Indian Enterprise. This CIO is well known for driving the digital agenda for the company, which has helped them get much closer to their customers. This time, however, there was concern in his voice. A critical, customer-facing application was reporting performance issues for a few days, and his Operations team had spent days trying to get to the bottom of the problem, without much success. This was causing dissatisfaction with customers, not to mention escalations from business stakeholders who wanted the issue resolved immediately.
If this scenario sounds familiar, you are not alone. Virtually every business today depends on continuous performance and innovation of its digital services. And the application environs of today have become increasingly complex –IT professionals have to ensure performance and reliability of distributed systems across virtualized and multi-cloud environments, interspersed with multi-layer security and controls. When an issue hits, the ambiance in the IT Operations room can vary from being tense and frantic, to resembling a war zone – depending on the nature of the issue and the service impacted.
Downtime is Expensive
IT outages can be expensive. In ITIC’s 11th annual Hourly Cost of Downtime survey of 1000 businesses from March through June 2020, 40% of those surveyed indicate that a single hour of downtime can cost their firms from $1 million to over $5 million! LogicMonitor’s 2019 IT Outage Impact Study found that those companies that experience brownouts and outages frequently,often experience costs 16 times higher than companies with fewer downtime instances. The study also found that companies with frequent downtime require almost twice as many team members tasked with troubleshooting problems and troubleshooting in these companies takes almost twice as long.
Needless to say, reducing downtime can have a significant impact on the company’s bottom line, but how do the digital leaders of today take a proactive approach to managing resources, capacity, storage and security, while ensuring business continuity? The answer lies in leveraging the very capabilities that are being deployed through their initiatives.
Let’s try and break this down. The biggest challenge for operations teams today is the large volumes of data being generated from every corner of the IT environment. It is almost impossible to think that this data can be processed through manual, human-dependent processes. Yet, the answers to improving SLAs, reducing Mean Time to Repair (MTTR) and identifying trends lie in this very data. The second challenge is that IT teams often work in disconnected silos, with very little cross-domain visibility or understanding. This is where AIOps can help.
Gartner defines AIOps as combining big data and machine learning to automate IT operations processes, including event correlation, anomaly detection and causality determination. Simply put, it is the technology that allows businesses to manage large volumes of data across distributed IT systems and reduces the time it takes to perform and respond to daily operational tasks.
Artificial Intelligence (AI) and Machine Learning (ML) can ingest this data from the IT infrastructure, and perform analytics to filter the data, identify patterns and arrange them into meaningful information for efficient actioning. It can help automate routine practices, identify issues faster and more accurately than humans, and also help in breaking down the silos in IT Operations by providing a unified view of the digital infrastructure.
Use Cases for AIOps
The most basic use case of AIOps is Intelligent Alerting. This uses event data from various sources –such as log files, monitoring tools, ticketing systems and others –and correlates it to identify significant incidents and generate intelligent alerts, which can be prioritized based on user and business impact.This helps in minimizing noise or spurious alerts and reduces “alert fatigue”.
As we move along the maturity curve of adoption, AIOps can enable many practical use cases that can be adopted by Enterprises to transform their IT Operations:
- Cross-Domain Situational Understanding: The basic ability of an AIOps platform to collect data from various sources allows it to provide an aggregated, cross-domain view and create causality between silos of systems.
- Identification of Probable Root Causes: Combining Intelligent Alerting and Cross-Domain causality allows IT teams to quickly narrow down on the top suspected causes of the issue. This also provides for feedback, where the AI engine can learn from human expertise.
- Cohort Analysis: The ability of the AI engine to scale across thousands of instances makes it relatively easy to identify any differences in deployed versions or configurations – something that is extremely hard to do for humans.
- Automated Remediation: This is probably the most sought-after goal of all IT teams, and AIOps has the potential to make it real. Once problems are reliably identified, closed-loop feedback can be implemented to remediate the issue based on past actions and historical data.
As a simple example of how AIOps can help in automatic remediation, take the case of an application running in a virtualized environment. A spike in demand depletes the resources available to the app, and thereby slows down its performance. On detecting this symptom, the AIOps engine is able to correlate the slow-down to low resources, and also trigger the relevant system to either spin up new instances or increase the allocation of resources (like memory) to correct the issue on a continuous basis – all without human intervention!
AIOps uses a collection of AI strategies – including aggregation, analytics, algorithms, automation and orchestration, machine learning and visualization. Most of these technologies are well defined and mature today and are ready to be deployed by businesses that want to transform their IT Operations.
However, deploying AIOps takes careful planning and a longer-term strategy to make it successful. It is important to first have a well-through through technological framework to gather data from various sources in your environment.
As with most large transformational projects, it is important to start with a short-term goal, in an area where results can be seen immediately. This can help your teams get valuable experience while helping you evangelize the results for expanding the scope of the project. As with every AI project, AIOps also needs training, and adjustment.
It is important to keep going back to your original goals and check that the platform is working as intended, from time to time.
As the world goes digital, application performance is a key business metric and one that has a direct impact on brand loyalty, and often, the bottom line. However, in this increasingly complex IT environment, delivering application performance demands working across a myriad of expansive environments and trying to make sense of all the data that is generated by these systems.
This is where AIOps can help. By leveraging Artificial Intelligence and Machine Learning techniques, AIOps can help identify problems before they start helping IT avert expensive outages. AIOps also enables IT teams to get cross-domain visibility and causality, thereby fostering collaboration and breaking down the silos that typically come in the way of efficient operations.
(The author is Director, Systems Engineering, Cisco India and SAARC and the views expressed in this article are his own)