The AWS Well-Architected Framework and Operational Excellence

A well-architected framework for any software system is one that is not only secure, reliable, and cost-optimized, it also allows companies to gain insight into their own operations. It also facilitates the continuous improvement of supporting processes and procedures in order to maximize value. All of this is otherwise known as operational excellence. In this article, we will discuss why AWS (Amazon Web Services) supports the foundational pillar of operational excellence within a well-architected framework.

Principles of Operational Excellence

As with the other foundational pillars, there are several design principles that must be applied in order to achieve operational excellence from a software system. These principles include:

  • Using code to drive operations.
  • Frequently applying small, reversible changes.
  • Frequently refining procedures.
  • Anticipating and planning for failure.
  • Learning from failure.

Using code to drive operations rather than relying on human intervention, allows companies to avoid issues stemming from human error. By implementing operations procedures through coding, then automating their execution through the use of triggers, procedures can be thoroughly tested prior to being placed in production. There is also a written record (of code) to refer back to when questions arise or when changes are needed.

Agile companies have long since learned it is better to routinely apply small changes to independent blocks of software, rather than try to apply numerous changes to an entire system.  By making small changes more frequently, if an adverse event should occur it is more gracefully managed, than a system that requires review from top to bottom to discover where issues are occurring.

Companies who strive for operational excellence in their software systems should frequently review their procedures to look for ways to refine them. As an example, being consistent with evaluation means companies won’t be ambushed by numerous changes that must be accomplished before their busy season.

It’s always best to be prepared for anything, including failure. Companies who routinely test their response procedures to failure will invariably outperform companies who in essence, believed their ship couldn’t sink. Learning from failure is another hallmark of operational excellence as new ideas, streamlined procedures, and creative solutions are often born from experiencing failure.

Operation Excellence — Best Practices

As with the other foundational pillars of a well-architected AWS framework, there are best practices to observe in order to fulfill the specific principles upon which operational excellence resides. These best practices consist of:

  • Organization.
  • Preparation.
  • Operation.
  • Evolution.

Organizing for software system success goes far beyond gathering a well-qualified software team. The needs of customers must be continually evaluated, along with any guidelines or obligations defined by internal governance, as well as external regulatory compliance requirements and/or industry standards. Potential threats to a business, including those not directly related to software systems must be managed as well. These threats may take the form of competing interests, or business risks and liabilities. Organizing risk goes beyond simple identification. It also means placing them in a risk registry for proper monitoring.

Preparing for operational excellence means designing a system’s workload in such a way that it provides all the information teams will need to understand its internal state. (e.g., metrics, traces, events, and logs). Preparation also includes the process of the collection of measurements in order to monitor the health of a system’s workload, to identify potential issues, and draft solutions in response. 

How does one define and measure operational success? Operational success can only be determined by defining expected outcomes and using metrics to measure a company’s definition of success. At a minimum, operational success will include not only measuring system workload health, along with the health of supporting activities, but other areas such as satisfying customers, and the degree to which a system supports a business, as opposed to being a burden or a distraction from a company’s mission statement.  

If there is one constant in software, it’s that it is always changing and improving. Accepting the continual evolution process means providing safe environments and time to experiment, develop, test, and learn from failure. 

Want to know more about operational excellence and how it supports a well-architected framework? Please contact us.

You may also like


7 Best Practices for Managing Security Operations on AWS

Amazon Web Services (AWS ) – Well-Architected Framework

Reliability – Part of a Well-Architected AWS Framework