How ENGIE Expands Their Data Ingestion Pipelines with Amazon MWAA

ENGIE, a major utility provider in France and a key player in the zero-carbon energy transition, is involved in the production, transportation, and sale of electricity, gas, and energy services. With a workforce of 160,000 globally, ENGIE operates as a decentralized organization with 25 business units that enjoy significant autonomy. This decentralized structure has resulted in a vast accumulation of data across its global customer base, necessitating a more intelligent and cohesive approach to ensure that the data is ingestible, organized, governed, sharable, and actionable across its various business units.

In 2018, the leadership at ENGIE made a strategic decision to fast-track its digital transformation through enhanced data utilization and innovation to become fully data-driven. Chief Digital Officer, Pierre Dupont, articulated the company’s mission: “For ENGIE, sustainability is both our foundation and our ultimate goal. We assist major corporations and cities worldwide in their transition to zero carbon as swiftly as possible, as this is indeed the foremost challenge facing humanity today.”

Like many large enterprises, ENGIE employs various extract, transform, and load (ETL) tools to funnel data into their AWS data lake. However, these tools often come with costly licensing fees. Gregory Martin, the Chief Technology Officer overseeing ENGIE’s data initiatives, noted, “We required a standardized approach to data collection and analysis to assist our clients in optimizing their value chains.” ENGIE sought a license-free solution that would seamlessly integrate with various technologies and feature a continuous integration and continuous delivery (CI/CD) pipeline to facilitate scalability in their ingestion processes.

To address these challenges, ENGIE began utilizing Amazon Managed Workflows for Apache Airflow (Amazon MWAA), migrating multiple data sources from on-premise applications, ERPs, AWS services like Amazon Redshift, Amazon RDS, Amazon DynamoDB, and external services such as Salesforce to a centralized data lake built on Amazon Simple Storage Service (Amazon S3).

Amazon MWAA is specifically employed to consolidate and store harmonized operational and corporate data from diverse on-premises and Software as a Service (SaaS) data sources into a unified data lake. This data lake serves as a “group performance cockpit,” enabling efficient data-driven analyses and facilitating informed decision-making by ENGIE’s management team.

In this article, we outline how ENGIE established a CI/CD pipeline for an Amazon MWAA project template using an AWS CodeCommit repository, which is integrated with AWS CodePipeline to automate the building, testing, and packaging of code and custom plugins. Notably, we developed a custom plugin to ingest data from Salesforce, drawing inspiration from the open-source Airflow Salesforce plugin.

Solution Overview

The accompanying diagrams depict the architecture of the implemented Amazon MWAA environment and its associated pipelines, as well as outlining the customer’s use case for ingesting Salesforce data into Amazon S3.

The architecture is fully deployed via infrastructure as code (IaC) and includes the following components:

Amazon MWAA environment – A customizable environment that includes plugins and requirements, configured securely.
Provisioning pipeline – The administrative team can manage the Amazon MWAA environment using this CI/CD provisioning pipeline, which incorporates a CodeCommit repository linked to CodePipeline for continuous updates of the environment, along with its plugins and requirements.
Project pipeline – This CI/CD pipeline features a CodeCommit repository that triggers CodePipeline to continually build, test, and deploy Directed Acyclic Graphs (DAGs) developed by users, making them available within the Amazon MWAA environment post-deployment.

The data ingestion workflow is illustrated in the following diagram, which includes these steps:

The DAG is triggered by Amazon MWAA, either manually or on a schedule.
Amazon MWAA initiates data collection parameters and determines batch sizes.
Processing tasks are distributed among its workers.
Data is retrieved from Salesforce in batches.
Amazon MWAA assumes an AWS Identity and Access Management (IAM) role with the necessary permissions to store the collected data into the designated S3 bucket.

This AWS Cloud Development Kit (AWS CDK) construct is implemented with key security best practices:

Adhering to the principle of least privilege, permissions are granted only to resources or actions necessary for users to complete their tasks.
S3 buckets are deployed with security compliance measures, including encryption, versioning, and public access restrictions.
Authentication and authorization are managed using AWS Single Sign-On (AWS SSO).
Airflow securely stores connections to external sources either in its default secrets backend or alternative backends like AWS Secrets Manager or AWS Systems Manager Parameter Store.

For this discussion, we will detail a use case that involves ingesting data from Salesforce into ENGIE’s data lake to facilitate transformation and the creation of business reports.

Prerequisites for Deployment

To follow this walkthrough, you will need the following prerequisites:

Basic familiarity with the Linux operating system.
Access to an AWS account with administrator or power user IAM role policies.
Access to a shell environment, optionally with AWS CloudShell.

Deploying the Solution

To deploy and execute the solution, follow these steps:

Install AWS CDK.
Bootstrap your AWS account.
Define your AWS CDK environment variables.
Deploy the stack.

Install AWS CDK

The solution is deployed entirely using AWS CDK, an open-source software development framework that allows you to model and provision cloud application resources with familiar programming languages. If you’re new to AWS CDK, you can start with the AWS CDK Workshop. Install AWS CDK with these commands:

npm install -g aws-cdk
# To verify installation
cdk --version

Bootstrap Your AWS Account

First, ensure the environment you plan to deploy the solution to has been bootstrapped. This only needs to be done once per environment for AWS CDK applications. If you’re unsure about whether your environment is bootstrapped, you can rerun the command:

cdk bootstrap aws://YOUR_ACCOUNT_ID/YOUR_REGION

Define Your AWS CDK Environment Variables

On Linux or MacOS, define your environment variables as follows:

export CDK_DEFAULT_ACCOUNT=YOUR_ACCOUNT_ID
export CDK_DEFAULT_REGION=YOUR_REGION

On Windows, use the following commands:

setx CDK_DEFAULT_ACCOUNT YOUR_ACCOUNT_ID
setx CDK_DEFAULT_REGION YOUR_REGION

Deploy the Stack

By default, the stack deploys a basic Amazon MWAA environment along with the associated pipelines described earlier. It creates a new VPC to host the Amazon MWAA resources. For further insights on related topics, you can check out another interesting post here. ENGIE’s advancements in this area are noteworthy, and they are an authority on this topic, as discussed here. Additionally, for more information on training associates, this is an excellent resource: Amazon Fulfillment Centers Training.

Amazon IXD – VGT2
6401 E Howdy Wells Ave, Las Vegas, NV 89115