Architecting for failure: how to ensure application availability and resiliency

Werner Vogels, CTO of Amazon, said it best “Everything fails, all the time.” The statement is of course simple and obvious, yet also quite thought provoking. Infrastructure can and does fail for a myriad of reasons, e.g., natural failure rates of hardware, natural disasters, power, network, cooling. This means the applications that run the infrastructure must be set up in a way to tolerate these failures to maintain service level agreements (SLAs).

For the most part, Cloud Service Providers (CSPs) are able to provide incredibly high levels of availability, while largely avoiding outages. However when there are outages, they tend to hit hard and affect a large amount of customers. Those that have planned for failure often see little impact when they’ve designed their application stack to tolerate transient failure.

That brings us to the topic of architecting for failure. It might seem a bit ominous, but if you know failures can and will happen, you have all of the control to plan for them. In this post I’ll dig into some perspectives and thoughts around architecting for failure.

From shared on-premises infrastructure to cloud

Before we dive into what this means specifically for cloud workloads, let’s take a step back and look at the progression of infrastructure that got us to where we are today. Shared infrastructure is not a new concept, in fact it’s been around since the early days of mainframes. Server virtualization was the ultimate manifestation of shared infrastructure that VMware capitalized on for nearly 20 years. Growing server capacities actually have increasingly created larger blast radiuses for a single server that can host hundreds for virtual machines.

Cloud greatly changed the game by moving this shared infrastructure to a place you can’t see nor control. Hyperscalers have both economies of scale and scope. Even the least mature providers can run infrastructure quite well and at pennies on the dollar of what a typical on-premises shop can do by themselves.

The real game-changing capability that cloud brings is that applications can achieve regionally to globally available infrastructures without the tremendous expenses and complexity of building and operating data centers and colocation footprints. This complexity has transitioned from the infrastructure layer to the application layer. Application architects must understand best practices for availability and durability. In the cloud you can’t simply rely on shared infrastructure to fail applications over. Application architectures must be rooted in clearly defined requirements and designed to meet or exceed them.

Understanding requirements

A common theme among many workloads is a lack of business requirements around availability and resiliency. Such business requirements are critical to driving an application architecture to ensure such requirements are met. Clearly defined requirements help ensure that an appropriate architecture is used. This also greatly aids in cost optimization of an application architecture.

There are two main points to application availability and resiliency that drive architecture are known as Recovery Point Objective (RPO) and Recovery Time Objective (RTO). Collectively, they represent the metrics of the criticality of the application, especially in the event of infrastructure failures (because we know failure will happen). Let’s think of these in their extremes. RPO is how much data an application can lose and RTO is how long an application can be down. It’s tempting to say that both values are zero. That is a real-time system, which is one of the most expensive and complex architectures to maintain and operate.

There’s a lot that goes into systems that never lose data and never go down. The reality is that the vast majority of applications have non-zero values, for either or both, RTO and RPO. When the RTO and RPO metrics for an application reflect the value to the business, the result is, at the very least, a clearer understanding of an appropriate architecture.

The absolute worst case of this I’ve personally seen is an environment where the infrastructure team said there was a “blanket 24-hour RTO and RPO.” The application owners had quite different actual requirements for the most critical applications. In reality, there was no ability to obtain even a 24-hour objective. Enter a failure that lasted over 24 hours and there was no run book for how to even execute a manual failover. Needless to say, there was an outage, critical business data was lost, and there was no failover to the disaster recovery site.

Failure testing

This naturally begs the question “how can I ensure my application is available and resilient?” Regardless of architecture, failure testing is key. It may be called something like chaos engineering in some places, but regardless of what it’s called, its essentially failure testing that is critical to ensure application availability and resiliency.

In a data center this can mean unplugging power and network cables. This is actually something I did quite often before deploying systems into production on-premises. If you understand the fault domains and how hardware fails, you can design appropriately to meet requirements. In a cloud environment it takes on a different meaning, but it’s easy enough to find the virtual power button and break even managed cloud services.

Failure testing is largely in two camps – manual and automated.

Manual

Testing should happen with any architecture before deploying into production. Doing so manually is one of the easiest ways to test for common failure patterns. Practically, this means shutting VMs/instances down, handjamambding hostnames and routes to make traffic fail, and anything else you can do to inject failure either real or simulated.

This kind of testing should certainly be done before an application goes into production. After which, it may be part of annual disaster recovery testing, but still executed with some degree of frequency to ensure components of an application fail in ways that are understood and predictable.

Automated

Kicking the manual approach up a notch, application teams are increasingly looking to chaos engineering approaches. This really embraces the notion of failure and injects unpredictable failure into a production application. Netflix is the pioneer in doing this in the cloud with its open source project Chaos Monkey.

A word of caution – don’t just go deploy tools that will break infrastructure into the wild without thoughtful testing and consideration. It’s not for every application, but for Netflix it has certainly helped them be prepared for the kind of unpredictable failures that happen at scale, up to and including getting through some fairly public outages.

Testing under load

Another point that’s easy to glance over is ensuring that you test failover under load. Failures tend to happen when you least expect and a system may already be under stress. The additional stress of a failure can topple over an application. To observe and understand how a system will react to failures, it is important to test under load.

There are a variety of tools out there that can assist in load testing. Bees with machine guns, Selenium, and JMeter are just a few of the open source options out there. One of the benefits of cloud is you can build up an incredibly large test bed and really test the resiliency of your application under load.

Conclusion

As we have seen, failure in any infrastructure is a given. Embracing this concept and architecting for failure ensure that applications are both available and resilient. Failure testing takes this a step further to understand points in a system that can break and how the application responds when it does. I hope this post provides some valuable insights. I’m curious about your thoughts! Feel free to leave comments and happy architecting for failure.