{"id":103396,"date":"2022-02-14T17:23:21","date_gmt":"2022-02-14T22:23:21","guid":{"rendered":"https:\/\/jamesdevine.info\/?p=103396"},"modified":"2022-02-14T17:32:40","modified_gmt":"2022-02-14T22:32:40","slug":"architecting-for-failure-how-to-ensure-application-availability-and-resiliency","status":"publish","type":"post","link":"https:\/\/jamesdevine.info\/index.php\/2022\/02\/architecting-for-failure-how-to-ensure-application-availability-and-resiliency\/","title":{"rendered":"Architecting for failure: how to ensure application availability and resiliency"},"content":{"rendered":"<p><span style=\"font-weight: 400;\">Werner Vogels, CTO of Amazon, said it best &#8220;Everything fails, all the time.&#8221; The statement is of course simple and obvious, yet also quite thought provoking. Infrastructure can and does fail for a myriad of reasons, e.g., natural failure rates of hardware, natural disasters, power, network, cooling. This means the applications that run the infrastructure must be set up in a way to tolerate these failures to maintain service level agreements (SLAs).<\/span><\/p>\n<p><span style=\"font-weight: 400;\">For the most part, Cloud Service Providers (CSPs) are able to provide incredibly high levels of availability, while largely avoiding outages. However when there are outages, they tend to hit hard and affect a large amount of customers. Those that have planned for failure often see little impact when they\u2019ve designed their application stack to tolerate transient failure.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">That brings us to the topic of architecting for failure. It might seem a bit ominous, but if you know failures can and will happen, you have all of the control to plan for them. In this post I\u2019ll dig into some perspectives and thoughts around architecting for failure.\u00a0<\/span><\/p>\n<h2><span style=\"font-weight: 400;\">From shared on-premises infrastructure to cloud<\/span><\/h2>\n<p><span style=\"font-weight: 400;\">Before we dive into what this means specifically for cloud workloads, let\u2019s take a step back and look at the progression of infrastructure that got us to where we are today. Shared infrastructure is not a new concept, in fact it\u2019s been around since the early days of mainframes. Server virtualization was the ultimate manifestation of shared infrastructure that VMware capitalized on for nearly 20 years. Growing server capacities actually have increasingly created larger blast radiuses for a single server that can host hundreds for virtual machines.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Cloud greatly changed the game by moving this shared infrastructure to a place you can\u2019t see nor control. Hyperscalers have both economies of scale and scope. Even the least mature providers can run infrastructure quite well and at pennies on the dollar of what a typical on-premises shop can do by themselves.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">The real game-changing capability that cloud brings is that applications can achieve regionally to globally available infrastructures without the tremendous expenses and complexity of building and operating data centers and colocation footprints. This complexity has transitioned from the infrastructure layer to the application layer. Application architects must understand best practices for availability and durability. In the cloud you can&#8217;t simply rely on shared infrastructure to fail applications over. Application architectures must be rooted in clearly defined requirements and designed to meet or exceed them.\u00a0<\/span><\/p>\n<h2><span style=\"font-weight: 400;\">Understanding requirements<\/span><\/h2>\n<p><span style=\"font-weight: 400;\">A common theme among many workloads is a lack of business requirements around availability and resiliency. Such business requirements are critical to driving an application architecture to ensure such requirements are met. Clearly defined requirements help ensure that an appropriate architecture is used. This also greatly aids in cost optimization of an application architecture.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">There are two main points to application availability and resiliency that drive architecture are known as Recovery Point Objective (RPO) and Recovery Time Objective (RTO). Collectively, they represent the metrics of the criticality of the application, especially in the event of infrastructure failures (because we know failure will happen). Let\u2019s think of these in their extremes. RPO is how much data an application can lose and RTO is how long an application can be down. It\u2019s tempting to say that both values are zero. That is a real-time system, which is one of the most expensive and complex architectures to maintain and operate.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">There\u2019s a lot that goes into systems that never lose data and never go down. The reality is that the vast majority of applications have non-zero values, for either or both, RTO and RPO. When the RTO and RPO metrics for an application reflect the value to the business, the result is, at the very least, a clearer understanding of an appropriate architecture.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">The absolute worst case of this I\u2019ve personally seen is an environment where the infrastructure team said there was a \u201cblanket 24-hour RTO and RPO.\u201d The application owners had quite different actual requirements for the most critical applications. In reality, there was no ability to obtain even a 24-hour objective. Enter a failure that lasted over 24 hours and there was no run book for how to even execute a manual failover. Needless to say, there was an outage, critical business data was lost, and there was no failover to the disaster recovery site.\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><\/p>\n<h2><span style=\"font-weight: 400;\">Failure testing<\/span><\/h2>\n<p><span style=\"font-weight: 400;\">This naturally begs the question \u201chow can I ensure my application is available and resilient?\u201d Regardless of architecture, failure testing is key. It may be called something like chaos engineering in some places, but regardless of what it\u2019s called, its essentially failure testing that is critical to ensure application availability and resiliency.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">In a data center this can mean unplugging power and network cables. This is actually something I did quite often before deploying systems into production on-premises. If you understand the fault domains and how hardware fails, you can design appropriately to meet requirements. In a cloud environment it takes on a different meaning, but it\u2019s easy enough to find the virtual power button and break even managed cloud services.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Failure testing is largely in two camps &#8211; manual and automated.\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><\/p>\n<h3><span style=\"font-weight: 400;\">Manual<\/span><\/h3>\n<p><span style=\"font-weight: 400;\">Testing should happen with any architecture before deploying into production. Doing so manually is one of the easiest ways to test for common failure patterns. Practically, this means shutting VMs\/instances down, handjamambding hostnames and routes to make traffic fail, and anything else you can do to inject failure either real or simulated.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">This kind of testing should certainly be done before an application goes into production. After which, it may be part of annual disaster recovery testing, but still executed with some degree of frequency to ensure components of an application fail in ways that are understood and predictable.<\/span><\/p>\n<h3><span style=\"font-weight: 400;\">Automated<\/span><\/h3>\n<p><span style=\"font-weight: 400;\">Kicking the manual approach up a notch, application teams are increasingly looking to chaos engineering approaches. This really embraces the notion of failure and injects unpredictable failure into a production application. Netflix is the pioneer in doing this in the cloud with its open source project<\/span><a href=\"https:\/\/netflix.github.io\/chaosmonkey\/\"><span style=\"font-weight: 400;\"> Chaos Monkey<\/span><\/a><span style=\"font-weight: 400;\">.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">A word of caution &#8211; don\u2019t just go deploy tools that will break infrastructure into the wild without thoughtful testing and consideration. It\u2019s not for every application, but for Netflix it has certainly helped them be prepared for the kind of unpredictable failures that happen at scale, up to and including getting through some fairly public outages.\u00a0<\/span><\/p>\n<h3><span style=\"font-weight: 400;\">Testing under load<\/span><\/h3>\n<p><span style=\"font-weight: 400;\">Another point that\u2019s easy to glance over is ensuring that you test failover under load. Failures tend to happen when you least expect and a system may already be under stress. The additional stress of a failure can topple over an application. To observe and understand how a system will react to failures, it is important to test under load.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">There are a variety of tools out there that can assist in load testing. <\/span><a href=\"https:\/\/github.com\/newsapps\/beeswithmachineguns\"><span style=\"font-weight: 400;\">Bees with machine guns<\/span><\/a><span style=\"font-weight: 400;\">, <\/span><a href=\"https:\/\/www.selenium.dev\/\"><span style=\"font-weight: 400;\">Selenium<\/span><\/a><span style=\"font-weight: 400;\">, and <\/span><a href=\"https:\/\/jmeter.apache.org\/\"><span style=\"font-weight: 400;\">JMeter<\/span><\/a><span style=\"font-weight: 400;\"> are just a few of the open source options out there. One of the benefits of cloud is you can build up an incredibly large test bed and really test the resiliency of your application under load.<\/span><\/p>\n<h2><span style=\"font-weight: 400;\">Conclusion<\/span><\/h2>\n<p><span style=\"font-weight: 400;\">As we have seen, failure in any infrastructure is a given. Embracing this concept and architecting for failure ensure that applications are both available and resilient. Failure testing takes this a step further to understand points in a system that can break and how the application responds when it does. I hope this post provides some valuable insights. I\u2019m curious about your thoughts! Feel free to leave comments and happy architecting for failure.\u00a0<\/span><\/p>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Werner Vogels, CTO of Amazon, said it best &#8220;Everything fails, all the time.&#8221; The statement is of course simple and [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":103397,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"om_disable_all_campaigns":false,"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"_uf_show_specific_survey":0,"_uf_disable_surveys":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[1,22,23,26,138],"tags":[],"class_list":["post-103396","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-general-information","category-systems","category-topics-in-virtualization","category-server","category-aws"],"aioseo_notices":[],"jetpack_featured_media_url":"https:\/\/i0.wp.com\/jamesdevine.info\/wp-content\/uploads\/2022\/02\/failure-scaled.jpeg?fit=2560%2C1707&ssl=1","jetpack-related-posts":[{"id":321,"url":"https:\/\/jamesdevine.info\/index.php\/2010\/05\/performance-report-in-the-virtual-infrastructure-client\/","url_meta":{"origin":103396,"position":0},"title":"Performance Report in the Virtual Infrastructure Client","author":"James Devine","date":"May 2, 2010","format":false,"excerpt":"VMware vCenter server reports a lot of performance information and displays tables in the Virtual\u00a0Infrastructure\u00a0client. They provide a nice at a glace view, but do not allow for anything more. While poking around the GUI I found a feature to export the\u00a0performance\u00a0data to Excel by going to file-reports-performance. This is\u2026","rel":"","context":"In &quot;General&quot;","block_context":{"text":"General","link":"https:\/\/jamesdevine.info\/index.php\/category\/general-information\/"},"img":{"alt_text":"","src":"https:\/\/i0.wp.com\/jamesdevine.info\/wp-content\/uploads\/2010\/05\/performance.jpg?resize=350%2C200","width":350,"height":200},"classes":[]},{"id":253,"url":"https:\/\/jamesdevine.info\/index.php\/2009\/10\/your-phone-google-and-the-cloud\/","url_meta":{"origin":103396,"position":1},"title":"Your Phone, Google, and the Cloud","author":"James Devine","date":"October 9, 2009","format":false,"excerpt":"Google has had sync available for quite some time, but up until recently it has only allowed for contacts and calendars to be synchronized between your phone and Google.The feature has been a great and allowed users to easily back their data up to the \"cloud\" where it will forever\u2026","rel":"","context":"In &quot;General&quot;","block_context":{"text":"General","link":"https:\/\/jamesdevine.info\/index.php\/category\/general-information\/"},"img":{"alt_text":"googlesync2","src":"https:\/\/i0.wp.com\/jamesdevine.info\/wp-content\/uploads\/2009\/10\/googlesync2-300x196.jpg?resize=350%2C200","width":350,"height":200},"classes":[]},{"id":541,"url":"https:\/\/jamesdevine.info\/index.php\/2021\/12\/five-traits-of-highly-effective-solution-architects\/","url_meta":{"origin":103396,"position":2},"title":"Five Traits of Highly Effective Solution Architects","author":"James Devine","date":"December 29, 2021","format":false,"excerpt":"The role of Solutions Architect is one of the most versatile, challenging, and rewarding positions I've personally had. I've worked alongside hundreds of such folks across roles and companies and think the following five traits are at the heart of all those that I would consider highly effective.","rel":"","context":"In &quot;General&quot;","block_context":{"text":"General","link":"https:\/\/jamesdevine.info\/index.php\/category\/general-information\/"},"img":{"alt_text":"","src":"https:\/\/i0.wp.com\/jamesdevine.info\/wp-content\/uploads\/2021\/12\/good-sa-illustrate-scaled.jpeg?fit=1200%2C674&ssl=1&resize=350%2C200","width":350,"height":200,"srcset":"https:\/\/i0.wp.com\/jamesdevine.info\/wp-content\/uploads\/2021\/12\/good-sa-illustrate-scaled.jpeg?fit=1200%2C674&ssl=1&resize=350%2C200 1x, https:\/\/i0.wp.com\/jamesdevine.info\/wp-content\/uploads\/2021\/12\/good-sa-illustrate-scaled.jpeg?fit=1200%2C674&ssl=1&resize=525%2C300 1.5x, https:\/\/i0.wp.com\/jamesdevine.info\/wp-content\/uploads\/2021\/12\/good-sa-illustrate-scaled.jpeg?fit=1200%2C674&ssl=1&resize=700%2C400 2x, https:\/\/i0.wp.com\/jamesdevine.info\/wp-content\/uploads\/2021\/12\/good-sa-illustrate-scaled.jpeg?fit=1200%2C674&ssl=1&resize=1050%2C600 3x"},"classes":[]},{"id":134,"url":"https:\/\/jamesdevine.info\/index.php\/2009\/03\/exchange-2007-and-active-directory\/","url_meta":{"origin":103396,"position":3},"title":"Exchange 2007 and Active Directory","author":"James Devine","date":"March 20, 2009","format":false,"excerpt":"As part of a project I am working on for my internship with MITRE I was tasked with building a Domain containing a Server 2003 Domain Controller, exchange 2007 Server, Microsoft Office Sharepoint Services (MOSS) 2007 Server, and SQL Server 2005. Each service was installed in a server 2003 virtual\u2026","rel":"","context":"In &quot;Windows&quot;","block_context":{"text":"Windows","link":"https:\/\/jamesdevine.info\/index.php\/category\/windows\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":330,"url":"https:\/\/jamesdevine.info\/index.php\/2010\/05\/getting-hadoop-mapreduce-0-20-2-running-on-ubuntu\/","url_meta":{"origin":103396,"position":4},"title":"Getting Hadoop MapReduce 0.20.2 Running On Ubuntu","author":"James Devine","date":"May 9, 2010","format":false,"excerpt":"I decided to setup a Hadoop cluster and write a MapReduce job \u00a0for my distrbuted systems final project. I had done this before with an earlier release and it was fairly straight forward. It turns out it is still straight forward with Hadoop 0.20.2, but the process is not well\u2026","rel":"","context":"In &quot;General&quot;","block_context":{"text":"General","link":"https:\/\/jamesdevine.info\/index.php\/category\/general-information\/"},"img":{"alt_text":"","src":"https:\/\/i0.wp.com\/jamesdevine.info\/wp-content\/uploads\/2010\/05\/network.jpg?resize=350%2C200","width":350,"height":200},"classes":[]}],"jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/jamesdevine.info\/index.php\/wp-json\/wp\/v2\/posts\/103396","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/jamesdevine.info\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/jamesdevine.info\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/jamesdevine.info\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/jamesdevine.info\/index.php\/wp-json\/wp\/v2\/comments?post=103396"}],"version-history":[{"count":5,"href":"https:\/\/jamesdevine.info\/index.php\/wp-json\/wp\/v2\/posts\/103396\/revisions"}],"predecessor-version":[{"id":103450,"href":"https:\/\/jamesdevine.info\/index.php\/wp-json\/wp\/v2\/posts\/103396\/revisions\/103450"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/jamesdevine.info\/index.php\/wp-json\/wp\/v2\/media\/103397"}],"wp:attachment":[{"href":"https:\/\/jamesdevine.info\/index.php\/wp-json\/wp\/v2\/media?parent=103396"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/jamesdevine.info\/index.php\/wp-json\/wp\/v2\/categories?post=103396"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/jamesdevine.info\/index.php\/wp-json\/wp\/v2\/tags?post=103396"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}