Establishing Feedback Loops Using the AWS Well-Architected Framework Review

The AWS Well-Architected Framework assists customers in creating a secure, high-performing, resilient, and efficient infrastructure for their applications and workloads. The Well-Architected (WA) Tool was launched in 2018 to enable customers to evaluate their workloads against the best practices outlined by the AWS Well-Architected Framework. The report generated from a self-assessment using the AWS Well-Architected Tool provides recommendations for enhancing your workloads.

These suggestions often stem from the need to establish and continually refine processes based on collected data, such as performance metrics and operational logs. A crucial aspect is the clear definition and documentation of these processes.

In this blog post, we’ll guide you on how to enhance your overall architecture by implementing Feedback Loops informed by the AWS Well-Architected Review.

Overview of the AWS Well-Architected Framework and Its Recommendations

The AWS Well-Architected Framework is structured around five pillars: Operational Excellence, Security, Reliability, Performance Efficiency, and Cost Optimization. Each pillar is accompanied by a set of General Design Principles and architectural best practices. When customers conduct an AWS Well-Architected Review utilizing the AWS WA Tool, they respond to a series of questions, prompting AWS to offer recommendations that align with best practices based on their answers.

These recommendations frequently pertain to the establishment of business processes. For instance, a recommendation under the Selection category encourages customers to define a process for architectural decisions relating to the Performance Efficiency pillar, urging experimentation and benchmarking with the services applicable to their workload.

Understanding Feedback Loops in the Context of the AWS Well-Architected Framework

In essence, Feedback Loops serve as a means to measure and assess the achievement of outcomes against established baselines, enabling appropriate actions to be taken in response to the feedback received.

We introduce Feedback Loops, which comprise four steps:

Documentation: Create playbooks that specify the requirements the workload must meet. This step also reviews the workflows—manual or automated—that should be incorporated into workload management.
Information Gathering: Establish a system to collect relevant data. The documentation/playbooks specify what will be monitored and the types of metrics or logs needed.
Thresholds and Events: Set threshold values that trigger events and alerts. These events should initiate automatic or manual processes and be utilized for reporting purposes.
Response: This step involves documenting reactions in the playbooks. The outcomes of these actions are then used to refine the original documentation and playbooks, completing the Feedback Loop.

The Feedback Loop is rooted in analyzing failed procedures and infrastructure or code modifications.

Illustrating the Feedback Loop for the Performance Efficiency Pillar of the AWS Well-Architected Framework

The Performance Efficiency pillar emphasizes the effective and efficient use of compute resources as demand fluctuates and technologies advance. We will explore best practices in this area and correlate them with the steps of the Feedback Loop.

Selection

The initial inquiries for the Performance Efficiency pillar focus on the selection process for optimal architecture and solutions:

PERF 1: How do you select the best performing architecture?
PERF 2: How do you select your compute solution?
PERF 3: How do you select your storage solution?
PERF 4: How do you select your database solution?
PERF 5: How do you select your networking solution?

The AWS Well-Architected Framework highlights the significance of a data-driven selection process. AWS solutions architects, reference architectures, and the AWS Partner Network can assist during the information gathering stage. For your compute solution, it’s crucial to consider workload performance and cost, while also adapting to changing demands. When selecting storage and database solutions, access patterns and data storage choices are vital. The network interconnects all components and significantly impacts performance. Key factors such as bandwidth, jitter, latency, and throughput must be taken into account to meet system requirements.

The selection process primarily involves researching options and selecting a solution to deploy your workload. Documenting the rationale behind these decisions is essential, and this process should be continually reviewed due to evolving requirements, new services, and feature launches.

Establish performance metrics for compute, storage, and database, as well as network performance, to verify that requirements are met. Additionally, develop a metrics dashboard for monitoring and set alerts for any threshold breaches or errors. Based on this data, the documentation should be updated in the subsequent iteration of the Feedback Loop. The Performance Efficiency pillar also advocates for benchmarking and load-testing to ensure outcomes meet expectations.

Review

The relevant question here is:

PERF 6: How do you evolve your workload to take advantage of new releases?

This inquiry begins with research to identify when and where new releases are announced, followed by establishing a process in the playbook to address these updates. This should be complemented by setting up alerts for those new services and features.

Monitoring

The subsequent question in the Performance Efficiency pillar is:

PERF 7: How do you monitor your resources to ensure they are performing?

This aspect has already been addressed through the establishment of Feedback Loops within the Selection process. Additionally, implementing Feedback Loops in one area may also fulfill aspects of other related recommendations.

Tradeoffs

Finally, the last question is:

PERF 8: How do you use tradeoffs to improve performance?

This focuses on balancing consistency, durability, and latency against performance efficiency. After adjusting the workload, monitor the impact of those changes using the established metrics. Conduct load tests to ensure the workload meets the new requirements, leveraging the metrics set up in the selection process.

Example Architecture for Remediation of an Amazon EC2 in a Faulty State

As an example, consider an architecture that includes an Amazon Elastic Compute Cloud (Amazon EC2) instance running an application. We will outline a Feedback Loop based on the Operational Excellence pillar’s recommendation, such as “OPS 10: How do you manage workload and operations events?”

We define a faulty state for that workload and the recovery process. In AWS, Amazon EC2 instances automatically report host metrics to Amazon CloudWatch. Additionally, we implement custom metrics for the application, which are utilized to trigger events caused by a faulty state. We configure Amazon CloudWatch to forward these events to an Amazon Simple Notification Service (SNS) topic, notifying relevant stakeholders.

For further insights, you can check out this blog post or visit this page for authoritative information on the topic. Additionally, for resources on fulfillment center management, take a look at this excellent resource.

Location: Amazon IXD – VGT2, 6401 E Howdy Wells Ave, Las Vegas, NV 89115