When you’re building critical market infrastructure used by the world’s largest financial institutions to process and settle trillions of dollars of assets every day, as we do at Baton Systems, resiliency needs to be a fundamental cornerstone of the entire architecture.

As an Engineer, I used to think that resiliency was all about technology, however I’ve come to understand, and really appreciate, how ensuring true resiliency requires a fully interconnected three pronged strategy:

  • Firstly, you need to build a resilient technology stack
  • Secondly, the technology stack needs to be supported by an operational process, and that in itself needs to have resiliency built into it.
  • Finally, there needs to be in place a strong governance process that stands behind both the technology and the operational process

Let me explain further.

The Technology Stack
The type of critical market infrastructure Baton deploys, needs to be able to support hundreds of millions, if not a billion events plus, a day. These events don’t increase on a linear scale, we’re dealing with very bursty traffic, and at times very high spikes – often at the beginning and end of the trading day. These spikes have to be handled in a very efficient manner, so the software needs to be able to run on commodity hardware and it needs to be able to scale (1). These are all factors that need to be considered in the way technology providers design multiple aspects of the solution including the data pipes, storage, compute infrastructures, monitoring and alert processes.

At Baton, we’ve focused on this and built resiliency and redundancy into the technology itself, to support our clients by providing technology that’s able to quickly recover. We believe being able to deliver this is incredibly important because there are always going to be elements outside of a technology provider’s control and accommodating for this needs to be built into the infrastructure. So we designed Baton’s technology to include real time stream processing features such as MQ Series and Kafka based data pipes that offer guaranteed delivery and large queue depths. The queues are monitored for latency, throughput and queue depth for example. We also architected stream processing using serverless architectures and asynchronous processing to reduce latency and increase parallelism. Additionally, the monitoring systems feed to notification systems and case management tools for alerting and calling different tiers of support as needed.

Our software also autoscales using stateless microservices that are deployed on a Kubernetes cluster. Our services use an asynchronous event driven design pattern – they’re designed to be idempotent so the replaying of data in any order will result in the same terminal state.

The Operation Process
Let’s talk about SLAs. As you’re most probably aware, for this type of software the SLAs will include detailed throughput and latency numbers. As a technology provider, we need to be measuring our system’s performance against these SLAs in real-time at all levels of our platform and application stack. This includes the various layers including compute, network, transport, data access and storage. We need to know if the SLAs are likely to be breached and take corrective actions. We tie this into a case management tool where our operations teams are informed and there is a chain of command that kicks in for the business continuity and disaster recovery process. These plans need to not only be documented, but actually tested on a frequent basis.

Additionally, the level of operational support offered needs to align with client needs. To effectively support our clients, we provide support 24hrs a day, 6 days a week. This allows us to use the 7th day to deploy any patches or updates to the software, but in anticipation of expanding client needs we’re already preparing to extend our support to 24/7 – so once required we’ll be ready to deliver.

The Governance Process
The governance structure ensures that the software and operational processes in place are backed by strong and effective controls. This means, for instance, ensuring there is no single point of failure, which from a personnel perspective could include ensuring if someone were to leave the business neither the company, the client nor their data, would be at risk. This is a cultural shift technology providers have to make as an organisation.

Accountability plays a huge role in any governance process and as part of Baton’s we offer clients frequent governance reports and meetings where we review both our performance against SLAs and the system KPIs. We can also provide clients with automated reporting and offer a a service portal so clients can log tickets and be updated on the issue’s progress.

We believe true resiliency needs to be factored in at so many levels when you’re deploying and supporting a bank’s critical market infrastructures and we’re grateful to have been able to work so collaboratively with our clients as we’ve further developed and enhanced our approach to resiliency.

I hope this blog has provided you with a better understanding of how we work with our clients to manage resiliency. If you have any further questions please do not hesitate to reach out and email [email protected].

1. The ability to scale linearly on commodity hardware is important to keep costs low. For example hardware that is three times as fast as commodity hardware is more than six times as expensive as commodity hardware making the cost per transaction twice as high.