Interviews

Chaos Engineering: Prepare Well Now to Win Wars Later

Fiserv, Inc. (NASDAQ: FISV) aspires to move money and information in a way that moves the world. As a global leader in payments and financial technology, the company helps clients achieve best-in-class results through a commitment to innovation and excellence in areas including account processing and digital banking solutions; card issuer processing and network services; payments; e-commerce; merchant acquiring and processing; and the Clover® cloud-based point-of-sale and business management platform.  Mr. Girish Narasimha Raghavan, Vice President, Fiserv Global Services shares more insights on the same.

 

Why is there a strong scope for high return on investment on chaos engineering?

Chaos engineering provides a mechanism for identifying and addressing potential abnormalities and failures that can occur within the ecosystem in which a software system operates. As part of the software development process, it can provide insights into mechanisms that enhance reliability, build predictability, and improve resilience. It essentially helps build confidence in the system’s ability to withstand the vagaries of a production environment.

Given that these tests are run within controlled simulated conditions – chaos engineering offers an opportunity to test, uncover and fix critical issues safely and upfront. This can save time spent in troubleshooting during triage, reduce outages, and save costs in the production phase itself.

 

What are some of the key principles of chaos engineering?

In an increasingly digital world, delivering a compelling customer experience requires providing continuously reliable services to a wide spectrum of customers.

However, as systems becoming more complex, distributed, and multi-cloud based, variables that can cause abnormal system behaviors continue to increase and potential issues become more difficult to foresee. System abnormalities can lead to bad customer experience, and, at their worst, critical system failures. The best way to avoid these failures is to be proactive in learning about them and what can be done to minimize them. That is where chaos engineering comes in.

Chaos engineering entails conducting well-thought-out experiments that can showcase the strength of systems in the face of failure. It is a vital discipline to influence how a software is designed and developed and to addresses systemic unpredictability in distributed applications. The principles of chaos engineering offer us the ability to innovate swiftly at scale and deliver the high caliber experiences clients expect.

The key principles of chaos engineering include:

  • Defining the steady state: The most important aspect is to understand what constitutes a normal behaviour of the system. The principle behind it is to define the ‘steady state’ under normal conditions and observe if this state is retained when chaos is introduced into the system.
  • Real-world simulations: The most accurate outcomes in testing can come from using real-world circumstances and conditions. Through chaos engineering, testers insert bugs and possible failures that have previously manifested themselves in the system. These experiments are frequently run in a controlled environment using production setups. While it would be best to run the tests by creating a robust test environment that mimics the production phase, it can become difficult or expensive to replicate a sizable, dispersed system for testing.
  • In addition, experimentation can predict how a system ought to act when faced with issues that impact its steady state. This involves testing the system, defining the parameters of the problem, calibrating the real-world events, and variables that can be abnormally affected. A few examples include scenarios like hard disk crashes, network unavailability, and so on.
  • Define and contain the impact: Many times, chaos experiment ends up being run in production setups. Hence it is imperative to contain any potential impact of the test. Developing a good understanding of the potential impact helps a chaos engineer to make sure experiments do not have an adverse impact on the customers using the system.
  • Automating tests: Manually running experiments can become time consuming and effort oriented. Over time, as the tests grow, manual models become unsustainable, and their use can discourage regular tests. It becomes imperative to automate and orchestrate these tests to run with minimal human intervention at regular intervals.

 

How can Chaos Engineering facilitate the building of reliable financial systems?

Chaos engineering is fast becoming one of the strongest techniques to evaluate system reliability and promote financial system modernization. Running chaos experiments help develop upfront insights into system behavior.

Through chaos engineering, organizations can proactively determine and institute safety buffers that prevent probable catastrophic failures and enhance the resiliency of the system. This helps facilitates the development and operation of dependable financial systems that adhere to internal and external regulatory standards and in modernizing the user experience.

 

What do you think is the future of Chaos Engineering?

Chaos engineering can be seen as a practical approach to identify system vulnerabilities within the operating environment, unlikely to be anticipated by the human mind. The future lies in developing a holistic understanding around the need to ensure a high degree of resilience and reliability of modern distributed software systems.

The above can be achieved by making chaos engineering a regular practice that can be institutionalized within the software engineering cycle, by training personnel into this discipline, sharing chaos plans with other business divisions, and collectively evolving it over time. 

Another step to ensure a higher degree of resilience and reliability of modern systems is ‘shifting left’ on site reliability engineering (SRE) practices and integrating chaos engineering as part of ‘DevOps’ during continuous delivery, deployment, and validation stages

I also see a trend towards increasing adoption of chaos engineering best practices across all industries and software stacks. This logically would lead to the development of more tools and frameworks that help automating and executing chaos experiments with minimal friction.

 

What is the impact of software failures?

We are living in a digital world in which consumers and businesses expect 24×7 availability. There is little tolerance for sub-optimal experiences, and an inability to meet customer expectations can lead to significant customer attrition and negative business impact.

Preventing software failures before they have an impact on customers, is therefore a business necessity.

 

How beneficial is Chaos Engineering for fintech?

 The use of chaos engineering facilitates the delivery of more reliable and more robust services.

Chaos experiments aid in the reduction of technical mishaps and enable engineering teams to gain a better understanding of systems and their interdependencies. This can result in early exposure of technical weaknesses that can be addressed by the creation of more durable, resilient designs. End users benefit from fewer outages and less disruption to key services.

Through the prevention of extended outages, chaos engineering also enables fintechs to halt potentially severe financial losses due to such outages. Empowering fintechs with real-world risk analysis, enhanced performance, and increased resiliency for customers is a compelling reason to turn to chaos engineering.

Leave a Response