You Should Plan to Fail

Registration is free. Login or register to view/download this content.

Author(s)

Business Relationship Manager - Product Lifecycle Management, Chevron Corporation

If you’ve ever listened to Amazon.com’s CTO Werner Vogels speak at a conference, one of his favorite subjects is talking about architecting for failure. In fact, he co-authored a paper that says “The Amazon.com platform, which provides services for many web sites worldwide, is implemented on top of an infrastructure of tens of thousands of servers and network components located in many datacenters around the world. At this scale, small and large components fail continuously and the way persistent state is managed in the face of these failures drives the reliability and scalability of the software systems.”

Message Queues as Transmorgifier

Many enterprise architectures make heavy use of message queues; however their potential is far greater than simple message delivery. In fact, this was an epiphany when I sat down to really understand one of our Web Services, known as Amazon Simple Queue, that software developers around the world use to build Web Scale applications. On the surface it sounded pedestrian: the product detail Web page described it as “a reliable, highly scalable hosted queue for storing messages as they travel between computers”.

Looking at the implementation details, I realized two things:

First, message queues are great “impedance matching devices”. In a high-scale Services-Oriented Architecture, you can’t make any assumptions about the speed of the other end of a connection. Your application doesn’t know the relative speed of another system component in any loosely-coupled architecture. In fact, the application doesn’t even know if the other end of a communication channel is available at all.

Using a service that guarantees message delivery enables applications to “fire and forget”. The other endpoint will handle the message asynchronously. Maybe in 2 milliseconds; maybe in 2 days. It shouldn’t matter to the sending application (assuming the task doesn’t require synchronous communication). Why, for example, should your customer be forced to wait for the (relatively slow) credit card back end system to acknowledge that their credit card is valid? If instead your application puts the validation in a queue, then the application can generate an email back to the customer when there is an exception such as an invalid expiration date.

However it was my second realization that really helped me understand what “everything fails” means in practice.

An ordinary message queue delivers a message and then immediately deletes it.

This particular message queue system delivers the message and marks it as invisible in the list of available messages. That’s because the queue assumes that the receiving application failed before processing the message! In fact if the receiving application does not return and explicitly delete the message out of the queue within a prescribed period of time, Amazon SQS makes the message available again.

My initial reaction was that the approach seemed inefficient. Then I realized that the implementation amounts to automatic rollbacks: if the receiving application fails to delete the message, it is available to try again! The approach eliminated all sorts of exception logic in program code, and is brilliant in its simplicity.

Virtual Failure

Servers are a bit different than components such as a message queue service.

Virtual servers are no more or less reliable that physical ones; assuming that all other variables are equal. However there is one major difference: when virtual servers fail, you often lose the virtual disk drive attached to it. Think of this as loosely analogous to “RAID 0 with no persistence”- that is, non-redundant disk drives.

At the most basic level this characteristic requires incremental backups to persistent storage. However it’s really an opportunity to think about what “everything fails” really means. Because a large pool of virtual servers is essentially “infinite capacity”, I prefer to think of them as transitory computing resources at my command to bring on and offline in minutes. Essentially that means that computing resources are a separate and decoupled pool from data resources: really the SOA model applied to what we traditionally considered to be hardware.

Academia has been thinking for some time about computing capacity as a resource instead of hardware. In the Grid Computing community, there is a project known as Condor that assigns “jobs” to resources, with a particular match between resource and job based on the characteristics of the job. For example, a large financial model might require a 64-bit environment that will be available to ten hours. Another job might be a low-priority inventory analysis program that can be pre-empted, and requires a 32-bit resource for three days. Condor seems reminiscent of mainframe batch jobs; yet in reality it is all about distributed computing.

Then there’s Hadoop (lucene.apache.org/hadoop/), which is an open source project framework intended to process large datasets on commodity-grade hardware, with the assumption that the hardware will fail. It slices computing tasks into smaller pieces, in order to distribute them across multiple computers—any one of which could fail and require that its assignment be re-run. Hadoop really shines when failure occurs, because any given task is small relative to the overall process; which in turn means that the “cost” of failure is low.

Hadoop isn’t just a research project. The New York Times used Hadoop in combination with virtual servers to generate 11 million PDF files in just 24 hours.

How Do You Plan to Fail?

There are as many approaches to application-level resiliency as there are architects who design systems in this manner; with only a few are presented here. The point is that designing for failure produces self-healing reliability at an affordable price.

How many components in your architecture are likely candidates for re-architecture using “everything fails” as guiding principal? Every organization has unique needs, and for that reason there are no patent solutions. Fortunately SOA is the set of building blocks needed to create a resilient and scalable architecture.

Similar Resources

Featured Certificate: BPM Specialist

Everyone starts here.

You're looking for a way to improve your process improvement skills, but you're not sure where to start.

Earning your Business Process Management Specialist (BPMS) Certificate will give you the competitive advantage you need in today's world. Our courses help you deliver faster and makes projects easier.

Your skills will include building hierarchical process models, using tools to analyze and assess process performance, defining critical process metrics, using best practice principles to redesign processes, developing process improvement project plans, building a center of excellence, and establishing process governance.

The BPMS Certificate is the perfect way to show employers that you are serious about business process management. With in-depth knowledge of process improvement and management, you'll be able to take your business career to the next level.

Learn more about the BPM Specialist Certificate

Courses

  •  

 

Certificates

  • Business Process Management Specialist
  • Earning your Business Process Management Specialist (BPMS) Certificate will provide you with a distinct competitive advantage in today’s rapidly evolving business landscape. With in-depth knowledge of process improvement and management, you’ll be able to take your business career to the next level.
  • BPM Professional Certificate
    Business Process Management Professional
  • Earning your Business Process Management Professional (BPMP) Certificate will elevate your expertise and professional standing in the field of business process management. Our BPMP Certificate is a tangible symbol of your achievement, demonstrating your in-depth knowledge of process improvement and management.

Certification

BPM Certification

  • Make the most of your hard-earned skills. Earn the respect of your peers and superiors with Business Process Management Certification from the industry's top BPM educational organization.

Courses

 

Certificates

  • Operational Excellence Specialist
  • Earning your Operational Excellence Specialist Certificate will provide you with a distinct advantage in driving organizational excellence and achieving sustainable improvements in performance.
 

 

OpEx Professional Certificate

  • Operational Excellence Professional
  • Earn your Operational Excellence Professional Certificate and gain a competitive edge in driving organizational excellence and achieving sustainable improvements in performance.

Courses

Certificate
  •  

  • Agile BPM Specialist
  • Earn your Agile BPM Specialist Certificate and gain a competitive edge in driving business process management (BPM) with agile methodologies. You’ll gain a strong understanding of how to apply agile principles and concepts to business process management initiatives.  
 

Business Architecture

 

Certificates

  • Business Architecture Specialist
  • The Business Architecture Specialist (BAIS) Certificate is proof that you’ve begun your business architecture journey by committing to the industry’s most meaningful and credible business architecture training program.

  • Business Architecture Professional
  • When you earn your Business Architecture Professional (BAIP) Certificate, you will be able to design and implement a governance structure for your organization, develop and optimize business processes, and manage business information effectively.

BA CertificationCertification

  • Make the most of your hard-earned skills. Earn the respect of your peers and superiors with Business Architecture Certification from the industry's top BPM educational organization.

Courses

 

Certificates

  • Digital Transformation Specialist
  • Earning your Digital Transformation Specialist Certificate will provide you with a distinct advantage in today’s rapidly evolving business landscape. 
 

 

  • Digital Transformation Professional
  • The Digital Transformation Professional Certificate is the first program in the industry to cover all the key pillars of Digital Transformation holistically with practical recommendations and exercises.

Courses

Certificate

  • Agile Business Analysis Specialist
  • Earning your Agile Business Analysis Specialist Certificate will provide you with a distinct advantage in the world of agile software development.

Courses

Certificate
  • DAS Certificate
  • Decision Automation Specialist
  • Earning your Decision Automation Certificate will empower you to excel in the dynamic field of automated decision-making, where data-driven insights are pivotal to driving business innovation and efficiency.