Managing the Inevitable – A Primer on Operational Resilience
As events show, the phrase ‘expect the unexpected’ is good advice. Over the past few years, there have been many events which we could not easily predict, yet in hindsight seemed inevitable – from large scale cyber attacks, data outages and supplier process failures, to disrupted commutes, terror incidents and extreme natural disasters. While there is no link between these occurrences, they typically have one thing in common. They have all, to varying degrees, disrupted financial services firms’ ability to provide their customers with an expected level of service.
The financial ecosystem is becoming more complex with greater outsourcing, more use of cloud computing and fintechs operating at more points within the value chain, all increasing the opportunities for disruption to services at the same time as enhanced expectations from customers and regulators alike.
Operational resilience is the response to this. The underlying assumption behind operational resilience is that events will occur and that the response needs to be planned and managed once the inevitable happens. It is no longer a question of ‘if’ but ‘when’. It differs from operational risk in assuming that events will happen and require a response rather than measuring control effectiveness and quantifying potential losses. It is more focused on the broad impact on customers and financial stability than continuity of business (CoB) and operational continuity in resolution (OCIR).
Operational resilience should not be seen as a one-off exercise but rather a consideration that should be embedded in how a firm operates, and in all decision-making. The analysis needs to be refreshed regularly and the response to potential events rehearsed on a frequent basis. It would also be wrong to assume that operational resilience revolves primarily around cyber threats; most service disruptions are caused by internal problems such as the issues around the TSB data migration in 2018 following its sale from Lloyds to Sabadell. The chart below shows the number of disruptions reported to the FCA in 2017 and 2018 by type..
While the responsibility for a firm’s resilience framework rests, under the UK Senior Managers and Certification Regime (SM&CR), with the SMF24 Chief Operations Function, the ultimate responsibility lies with boards and senior managers and they should be familiar with and sign off the approach as part of the annual self-certification process.
For most firms, many of the elements required to ensure operational resilience already exist to some degree. What has changed is that UK financial regulators have, following an extensive consultation exercise, defined the steps that they expect firms they regulate to undertake to ensure that their operations are resilient.
Both the UK FCA and PRA published final consultation papers on the topic in December 2019, stating that they expect to publish formal regulations mandating these steps in late 2020 for implementation over the following three years. While the principle of proportionality will apply, this is more evident in the sense of urgency expected and the impact tolerances placed on a firm’s business operations rather than omitting any element.
There are three distinct phases to operational resilience:
- Preparing for the inevitable.
This drives the bulk of the task and involves really understanding the underlying dynamics of key business processes, their vulnerabilities and then testing how they respond to simulated events. The step by step approach taken by the regulators breaks this down into a logical series of actions.
- Managing the response.
The success or otherwise in responding to an event will be determined by the thoroughness of the steps taken beforehand around training, governance, communications and setting up the physical/informational arrangements to manage the response.
- Learning lessons.
Processes and procedures need to be reviewed in light of events that have impacted the firm or other organizations to ensure that the approach to resilience is still sound and that the firm can stay within its impact tolerances.
Preparing For the Inevitable
Identification of important business services
The starting point is identifying the important business services which a firm delivers to its customers and, by extension in some circumstances, the marke. These are specific, viewed from a customer perspective and include items like making an annuity payment, making correspondent banking payments, providing account balances, selling or buying equities, customer authentication (that is the gateway to providing further services) and the suchlike. They are not internal systems such as a general ledger, HR database or business lines such as mortgages or foreign exchange, nor are they internal systems such as the HR salary system.
Definition of impact tolerances
For each important business service, a measurable level of disruption should be defined within firms as the maximum that is tolerable to customers who use the service. The Impact tolerances should be in terms of the impact on clients and also the broader market, if relevant. Impact tolerances should take into account relevant factors such as the number of customers effected and financial loss to them, duration of the disruption, data integrity, substitutability, time of day, impact on market stability etc
To be able to understand the potential points of disruption to a process it first needs to be mapped. Parameters such as timing should be added to the information flows between the components of the processes that deliver all a firm’s important business services. There is a careful balance to be drawn between going into too greater detail and capturing the points in a process that potentially could cause disruption. There is no need to include the mapping documentation into the annual self-certification.
Third-party service providers
With the developments in the provision of financial services and how financial companies are structured leading to increased use of suppliers in key processes, FMIs, fintechs, cloud computing, there is a growing recognition that resilience extends beyond the traditional boundaries of a firm. There is also an awareness that the provision of services to the market by a limited number of suppliers may itself create concentration risk.
With the key elements of a process mapped and third parties identified, a thorough review of the vulnerabilities at each step of the process can be undertaken. Some of these will be more general than others (terrorist attack or natural disaster versus EUC host platform failure). All aspects should be taken into account, not just technology but also factors such as dependency on key individuals and single points of failure. A good example is the Crypto exchange CEO who died along with his passwords.
Once vulnerabilities have been identified then the necessary resources should be put against removing each one to the point at that the process can be managed so as to not exceed the defined impact tolerance if an event occurs. This may lead to the acceleration of system replacement as the cost of addressing specific vulnerabilities is uneconomic compared with that of simply replacing legacy. This is also an opportunity to simplify systems into a more customer centric model that is better suited to a world of APIs and increasing integration with suppliers.
The opportunity should also be taken to improve the MIS that is generated on a firm’s process performance and status. Once the approach to resilience is embedded it can be designed into systems from inception at little additional cost. Embedding resilience will also involve training individuals responsible for process design and implementation to ensure that good practice is followed and reduce the need for subsequent remediation.
To ensure that impact tolerances are not exceeded given an event a number of plausible scenarios should be run through rigorously to identify each one’s impact on the Important Business System (IBS) While the regulators will not specify the scenarios that should be considered, their emphasis is both on plausible as well as severe events; they fully realize that there are some theoretically possible scenarios or combination of scenarios would cause an impact tolerance to be breached. Scenario testing should be a part of the annual self-certification process.
(The authors Will Packard is Managing Principal, James Arnett is Partner and Owen Jelf is Partner at CAPCO)