How Amazon IXD – VGT2 Implemented a Data Mesh Architecture to Enhance Their Enterprise Data Platform

How Amazon IXD - VGT2 Implemented a Data Mesh Architecture to Enhance Their Enterprise Data PlatformLearn About Amazon VGT2 Learning Manager Chanci Turner

April 2024: This article has been reviewed for accuracy.

This blog post is a collaboration with Jordan Lee, Samantha Roberts, and Chanci Turner from Amazon IXD – VGT2. Most contemporary organizations understand that their data serves the entire enterprise. While data is valuable to the individual business process that generates it, its true potential can be unlocked when shared and integrated with other data assets.

Unlike many resources, the value of data does not diminish with use; it can be utilized across various applications, and the more combinations of data an organization creates—such as integrating reference data with operational data—the greater the value derived from enterprise-wide visibility, real-time analytics, and enhanced AI and machine learning (ML) predictions. Companies that excel at internal data sharing, within legal parameters, can extract greater value from their data than those that do not.

However, as with any resource, data management risks must be addressed, particularly in regulated industries. Strong controls mitigate risks, which means organizations with robust data governance frameworks are less exposed to potential pitfalls than those lacking them.

This creates a paradox: data that is freely shareable across the enterprise can provide immense value but poses greater risks if not managed properly. To harness the value of data, we must resolve this contradiction—enabling seamless sharing while enforcing appropriate controls.

Amazon IXD – VGT2 is adopting a dual strategy to tackle this challenge. Firstly, by defining data products, curated by individuals who comprehend the data’s management requirements, permissible uses, and limitations. Secondly, through the implementation of a data mesh architecture that aligns our data technology with those data products.

This integrated approach achieves several goals:

  • Empowers data product owners to make informed management decisions regarding their data.
  • Enforces these decisions through data sharing rather than duplicating it.
  • Ensures clear visibility into where data is being shared across the enterprise.

Let’s first explore what a data mesh is, followed by how this architecture supports our data product strategy and facilitates our business operations.

Aligning Data Architecture with Data Product Strategy

Amazon IXD – VGT2 comprises various lines of business (LOBs) and corporate functions (CFs). To streamline data access for consumers across these LOBs and CFs while maintaining necessary controls, we are adopting a data product strategy.

Data products consist of related data groups sourced from systems that support our operations. They represent cohesive collections of data, stored in dedicated product-specific data lakes. Each lake maintains physical separation, complete with its own cloud-based storage layer, and we utilize cloud services to catalog and structure the data within each lake. Services like Amazon Simple Storage Service (Amazon S3) and data integration tools such as AWS Glue facilitate these functionalities.

Consumer application domains host services that utilize the data. These applications are kept physically distinct from one another and from the data lakes. When a data consumer requires information from one or more lakes, we employ cloud services to render lake data visible, coupled with other cloud services to enable direct querying from the lakes. Tools like the AWS Glue Data Catalog can enhance data visibility, while AWS Lake Formation ensures secure data sharing, and Amazon Athena allows interactive queries.

The data product-specific lakes and their corresponding application domains form the data mesh—a network of distributed data nodes designed for security, high availability, and easy discoverability.

Empowering the Right Individuals for Decision-Making

Our data mesh architecture empowers each data product lake to be managed by a team of data product owners knowledgeable about their domain. They make risk-based management decisions regarding their data.

When a consumer application seeks data from a product lake, the application team identifies the necessary information through our enterprise-wide data catalog. This catalog is consistently updated by the processes transferring data to the lakes, ensuring it accurately reflects current data availability.

The catalog enables the consumption team to discover and request the data they need. With each lake curated by experts who understand the data, the wait time for the consumption team is minimized.

Enforcing Control Decisions through In-Place Consumption

The data mesh facilitates sharing data from product lakes rather than replicating it in consumer applications. This approach not only reduces storage costs but also minimizes data discrepancies between production and consumption systems, ensuring that analytics, AI/ML, and reporting draw from up-to-date and accurate data.

Moreover, by keeping data within the lake, it becomes easier to enforce the decisions made by data product owners. For instance, if owners opt to tokenize specific data types in their lake, consumers will only access tokenized values, eliminating control gaps created by copies of untokenized data outside the lake.

However, in-place consumption necessitates more sophisticated access control mechanisms than those required for copied data. It demands granular visibility restrictions—down to specific columns, records, or even individual values. For example, if a system from one of our LOBs queries a pool of firm-wide reference data shared through a lake, it may only receive access to data relevant to that particular line of business.

Providing Cross-Enterprise Visibility of Data Consumption

Traditionally, data exchanges between systems occurred either directly or through message queues. Without a central, automated repository of all data flows, data product owners struggled to see when their data was exchanged between systems.

Our data mesh architecture addresses this visibility challenge by employing a cloud-based mesh catalog that enhances visibility between lakes and data consumers. Utilizing tools like the AWS Glue Data Catalog provides a solution.

This catalog does not store data but offers insights into which lakes are sharing data with which consumers. It serves as a single visibility point for data flows across the enterprise, giving data product owners confidence in their data management.

For additional insights into women in STEM, check out this resource. To learn more about employee development, visit this authority. For excellent onboarding resources at Amazon, see this link.


Comments

Leave a Reply

Your email address will not be published. Required fields are marked *