Creating a Serverless Trigger-Based Data Movement Pipeline with Apache NiFi, DataFlow Functions, and AWS Lambda

Creating a Serverless Trigger-Based Data Movement Pipeline with Apache NiFi, DataFlow Functions, and AWS LambdaMore Info

Organizations today face diverse data processing challenges, collecting information from various sources, transforming it, and distributing it to meet different business requirements. The key challenge lies in managing data collection and distribution in a streamlined, secure, scalable, and cost-effective manner.

To address these challenges, Cloudera offers Cloudera DataFlow for the Public Cloud (CDF-PC). Cloudera is an AWS Data and Analytics Competency Partner that provides a fast, easy, and secure platform for customers to leverage data to tackle complex business problems.

CDF-PC is a cloud-native service for Apache NiFi integrated into the Cloudera Data Platform (CDP). This service empowers organizations to manage their data flows effectively and eliminate ingestion silos by enabling developers to connect to any data source, regardless of structure, process it, and send it to any destination—all through a low-code authoring experience.

With the capability to rapidly deliver data anywhere, businesses can process real-time data more efficiently, deploy applications quicker, and derive insights from data more effectively.

Recently, CDF-PC introduced support for DataFlow Functions (DFF), allowing developers to create data movement pipelines using Apache NiFi’s user-friendly Flow Designer and deploy them as serverless functions. With DFF, users can choose to deploy NiFi flows as long-running, auto-scaling Kubernetes clusters on Amazon EKS or as temporary functions on AWS Lambda. DFF offers a low-code user interface (UI) that increases agility and broadens the range of use cases.

In this article, we will explore how to construct a cost-effective, trigger-based, and scalable serverless application using NiFi flows to operate as Cloudera DataFlow functions within AWS Lambda.

Use Cases: Event-Driven, Batch, and Microservices

Since its inception in 2015, Cloudera DataFlow, powered by Apache NiFi, has been adopted by over 400 enterprise organizations to address their data distribution challenges. These scenarios often involve high-throughput streaming data that needs to be delivered to destinations with minimal latency, necessitating always-running clusters.

There are also use cases that do not require continuous NiFi flows. These include event-driven processing for object storage, microservices for serverless web applications, Internet of Things (IoT) data handling, asynchronous API gateway request management, batch file processing, and job automation using cron/timer scheduling.

For these specific scenarios, NiFi flows must be treated as jobs with a defined start and end. The trigger for these jobs could be an event such as a file being uploaded to Amazon S3, the initiation of a cron event, or an API gateway being accessed. Once the job is completed, the compute resources should be deactivated.

One of the primary applications of DataFlow Functions is processing files upon their upload to Amazon S3 through S3-based event triggers. For instance, telemetry data from various sensors is batched into files and directed to a landing zone in an S3 bucket. Each line in the telemetry file corresponds to a single sensor event.

These telemetry files are sent periodically throughout the day. When a telemetry file arrives in S3, the events within must be routed, filtered, enriched, and transformed into Parquet format before being stored back in S3. Since the files are transmitted at intervals and do not necessitate constant resource availability, a genuine pay-for-compute model is essential. After processing, the function and all associated resources should be deactivated, and costs should only be incurred for the duration of the function’s execution. This use case emphasizes the need for an economical solution with a compromise on high throughput.

Key Functional Requirements:

  • Routing: Direct events in the telemetry file to various S3 locations based on the “eventSource” value.
  • Filtering: Exclude certain events based on specific criteria (e.g., speed > x).
  • Enrichment: Augment geo events with geographical data based on latitude/longitude using a lookup service.
  • Format Conversion: Transform events from JSON into Parquet format according to the specified schema.
  • Delivery: Ensure the filtered and enriched data in Parquet format is sent to different S3 locations.

Key Non-Functional Requirements:

  • Agile Low-Code Development: Provide a low-code environment for developing processing logic, including strong SDLC capabilities for local development and testing with sample data before cloud deployment.
  • Serverless Operation: Run the telemetry processing code without the need to provision or manage infrastructure.
  • Trigger-Based Processing: Activate the processing code and related resources only when a new file is uploaded, shutting down all resources post-processing.
  • Pay for Compute Time: Charge only for the compute time consumed by the processing code, eliminating the requirement to provision infrastructure in advance for peak capacity.
  • Scalability: Accommodate any scale from processing a handful of files per day to hundreds of files each second.

Implementation

In the following use case, a company collects data from numerous telemetry sensors. The telemetry data is batched and sent to an S3 data lake throughout the day. This data needs to be processed, converted into Parquet format, and sent to another set of S3 buckets for further analytics. The objective is to provide an agile low-code development environment for building a cost-efficient, trigger-based, scalable serverless architecture.

Cloudera implemented this use case using key components and services:

  • Apache NiFi: The UI flow designer of Apache NiFi was utilized to develop and locally test the flow. Cloudera catered to the functional requirements regarding filtering, routing, enrichment, format conversion, and delivering the transformed data in Parquet format to various S3 buckets.

For further insights into similar implementations, check out this blog post that discusses related topics. Additionally, this resource is an authority on the subject and can provide more in-depth knowledge. For those looking for job opportunities, this link offers excellent resources.


Comments

Leave a Reply

Your email address will not be published. Required fields are marked *