Learn About Amazon VGT2 Learning Manager Chanci Turner
Data lineage plays a crucial role in the governance strategy of data lakes. Understanding data lineage ensures that businesses utilize accurate and reliable data to inform their decisions. While data catalogs assist with metadata management and provide search functionalities, data lineage reveals the intricate relationships between data sources, tracing the origins of the data, and illustrating how it is transformed and consolidated. Various stakeholders within the data lake ecosystem reap the benefits of data lineage:
- Data scientists can monitor and trace the data journey from source to destination, simplifying the assessment of the quality and origin of specific metrics or datasets.
- Data platform engineers gain deeper insights into data pipelines and the interdependencies among datasets.
- Engineers can more readily implement and validate changes in data pipelines, as they can pinpoint a job’s upstream dependencies and downstream applications, allowing for proper evaluation of service impacts.
As the complexity of data environments increases, organizations encounter significant challenges in efficiently and consistently capturing lineage data. This article outlines a three-step process for establishing an automated data lineage solution for data lakes: capturing lineage, modeling and storage, and visualization.
Our solution captures both coarse-grained and fine-grained data lineage. Coarse-grained lineage, which is primarily geared towards business users, focuses on mapping high-level business processes and overall data workflows. This type of lineage typically clarifies and visualizes the relationships between datasets and how they traverse through various storage tiers, encompassing extract, transform, and load (ETL) jobs, as well as operational data. In contrast, fine-grained lineage provides insights at the column level, detailing the transformation steps in processing and analytical pipelines.
Solution Overview
Apache Spark has become a leading engine for large-scale data processing in data lakes. Our approach employs the Spline agent to capture runtime lineage data from Spark jobs, facilitated by AWS Glue. We utilize Amazon Neptune, a graph database specifically designed for managing and querying highly interconnected datasets, to model lineage data for analysis and visualization.
For those interested in tracking lineage for workloads that integrate graph data and machine learning, Amazon Web Services unveiled Amazon SageMaker ML Lineage Tracking at re:Invent 2021. This functionality integrates with SageMaker Pipelines, capturing and storing details about the stages of automated ML workflows, from data preparation to model deployment. With this tracking information, you can reproduce workflow steps, monitor lineage for models and graph datasets, and reinforce model governance and auditing standards.
The accompanying diagram outlines the solution architecture. We employ AWS Glue Spark ETL jobs for data ingestion, transformation, and loading. The Spline agent is integrated into each AWS Glue job to capture lineage and runtime metrics, transmitting this data to a lineage REST API. This backend features producer and consumer endpoints, supported by Amazon API Gateway and AWS Lambda functions. The producer endpoints handle incoming lineage objects before they are stored in the Neptune database. We utilize consumer endpoints to extract specific lineage graphs for various visualizations in the front-end application, enabling ad hoc interactive analysis through Neptune notebooks.
To assist in deploying this solution to the AWS Cloud, we provide sample code and Terraform deployment scripts on GitHub.
Data Lineage Capturing
The Spline agent is an open-source tool that automatically captures data lineage from Spark jobs at runtime, eliminating the need to modify existing ETL code. It monitors Spark’s query run events, extracts lineage objects from job run plans, and sends them to a preconfigured backend (such as HTTP endpoints). Additionally, the agent automatically gathers job run metrics, including the number of output rows. As of this writing, the Spline agent is compatible only with Spark SQL (DataSet/DataFrame APIs), but not with RDDs/DynamicFrames.
The integration of the Spline agent with AWS Glue Spark jobs is depicted in the following screenshot. The Spline agent, which is an uber JAR, must be included in the Java classpath. The setup requires specific configurations:
- The
spark.sql.queryExecutionListeners
property is necessary to register a Spline listener during initialization. - The
spark.spline.producer.url
specifies the HTTP server address to which the Spline agent should send lineage data.
We develop a data lineage API that is compatible with the Spline agent, facilitating lineage data insertion into the Neptune database and enabling graph extraction for visualization. The Spline agent requires three HTTP endpoints:
- /status – For health checks
- /execution-plans – For submitting captured Spark execution plans post-job submission
- /execution-events – For transmitting job run metrics upon job completion
We also establish additional endpoints to manage various metadata associated with the data lake, such as storage layer names and dataset classifications.
When a Spark SQL statement is executed or a DataFrame action is triggered, Spark’s optimization engine, known as Catalyst, generates various query plans: a logical plan, an optimized logical plan, and a physical plan, all of which can be examined using the EXPLAIN statement. During a job run, the Spline agent analyzes the logical plan to create a JSON lineage object, which includes:
- A unique job run ID
- A reference schema (attribute names and data types)
- A list of operations
- Additional system metadata, such as the Spark version and the Spline agent version
A run plan outlines the steps performed by the Spark job, from data source reading and various transformations to writing the job’s output to a storage location.
In summary, the Spline agent captures not only job metadata (like job name and execution date and time), input and output tables (including data format, physical location, and schema), but also detailed business logic information (SQL-like operations such as join, filter, project, and aggregate).
Data Modeling and Storage
Data modeling commences with business requirements and use cases, mapping these needs into a structure for data storage and organization. In the context of data lineage for data lakes, the relationships between data assets (jobs, tables, and columns) are as crucial as their metadata. Therefore, graph databases are effective for modeling such interconnected entities, facilitating the understanding of complex relationships within the data.
Neptune is a fast, reliable, and fully managed graph database service that simplifies building and running applications with highly connected datasets. It allows the creation of sophisticated, interactive graph applications capable of querying billions of relationships in milliseconds. Neptune supports three prominent graph query languages: Apache TinkerPop Gremlin and openCypher for property graphs, and SPARQL for the W3C’s RDF data model. In our solution, we employ the property graph primitives (including vertices, edges, labels, and properties) to model objects and utilize the gremlinpython library for graph interaction.
For those interested in understanding more about managing fears in professional settings, this post about the gift of fear can provide valuable insights. Additionally, for authoritative information on job descriptions, check out this resource. Furthermore, if you’re seeking an excellent resource on this topic, I recommend watching this video.
Leave a Reply