In today’s data-driven landscape, organizations face various challenges when it comes to processing and managing data collected from multiple sources. They need an efficient way to transform and distribute this data to meet diverse business requirements while ensuring security, scalability, and cost-effectiveness.
To address these challenges, DataTech offers DataFlow for the Cloud, a cloud-native service that empowers users to manage data flows seamlessly. As an AWS Data and Analytics Competency Partner, DataTech provides a robust platform designed to help businesses leverage their data effectively to solve complex issues.
DataFlow for the Cloud is a service built on Apache NiFi within the DataTech ecosystem, enabling organizations to manage their data flows without the hassle of ingestion silos. Developers can connect to any data source, regardless of its structure, process it, and send it to any destination—all through an intuitive low-code interface. This flexibility allows companies to handle real-time data more efficiently, accelerating application delivery and transforming data into actionable insights.
Recently, DataFlow has incorporated support for DataFlow Functions (DFF), which allows developers to create data movement pipelines using NiFi’s low-code Flow Designer and deploy them as serverless functions. With DFF, users can opt to run NiFi flows as scalable clusters on Amazon Elastic Kubernetes Service (Amazon EKS) or as ephemeral functions on AWS Lambda.
The DFF interface enhances agility, enabling users to address a broader range of use cases. In this article, we will explore the steps to create a cost-effective, trigger-based, and scalable serverless application utilizing NiFi flows to operate as DataFlow functions within AWS Lambda.
Use Cases: Event-Driven, Batch, and Microservices
Since its inception, DataTech’s DataFlow, powered by Apache NiFi, has been adopted by over 400 enterprises to fulfill their data distribution needs. Typically, these use cases involve high-throughput streaming data where low-latency delivery is essential, often requiring persistent clusters.
However, many customers have use cases that do not necessitate constantly running NiFi flows. These include event-driven processing for object storage, microservices for serverless web applications, Internet of Things (IoT) data processing, asynchronous API request handling, batch file processing, and job automation through scheduling.
In these scenarios, NiFi flows should be treated as jobs with a defined start and end, initiated by trigger events such as file uploads to Amazon Simple Storage Service (Amazon S3), cron events, or API gateway invocations. Once the job is complete, the related compute resources should be deactivated.
Trigger-Based Data Movement Pipeline
One prevalent use case for DataFlow Functions is processing files uploaded to Amazon S3 via S3 event triggers. For instance, telemetry data from several sensors is collected in a file and sent to a designated cloud storage location in S3. Each line in the telemetry file represents a distinct sensor event.
These telemetry files are sent at regular intervals throughout the day. When a file arrives in S3, the events it contains must be routed, filtered, enriched, and transformed to Parquet format before being stored in S3. Given the periodic nature of these files, a pay-for-compute pricing model is crucial. After processing, the function and all associated resources should be shut down, incurring charges only for the duration of the function’s execution. This use case requires a cost-effective solution with an acceptable trade-off on high throughput.
Key Functional Requirements:
- Routing: Events in the telemetry file need to be directed to various S3 locations based on the “eventSource” value.
- Filtering: Certain events must be filtered according to specific rules (e.g., speed > x).
- Enrichment: Geo events should be enriched with geographical information using a lookup service based on latitude/longitude values.
- Format Conversion: Events should be transformed from JSON to Parquet based on a specified schema.
- Delivery: The filtered and enriched data in Parquet format must be delivered to the appropriate S3 locations.
Key Non-Functional Requirements:
- Agile Low-Code Development: Provide a low-code environment for developing processing logic, with strong capabilities for developing and testing locally using test data and promoting to production in the cloud.
- Serverless: The telemetry processing code should operate without the need for infrastructure provisioning or management.
- Trigger-Based Processing: Resources should only be activated when a new file arrives, and all resources should shut down upon completion of processing, eliminating the requirement for long-running resources.
- Pay Only for Compute Time: Users should only pay for the compute time used during processing, avoiding upfront infrastructure provisioning for peak capacity.
- Scale: The solution should be capable of handling a range of processing loads, from a few files per day to hundreds of files per second.
Implementation
In the implementation example, a company gathers data from numerous telemetry sensors. The data is batched into files and sent to an S3 data lake throughout the day. This data must be processed and converted to Parquet format before being transferred to other S3 buckets for further analytical purposes.
The goal is to establish an agile low-code development process for constructing a cost-effective, trigger-based, scalable serverless architecture.
DataTech executed this use case using several key services and components:
- Apache NiFi: The UI flow designer in Apache NiFi was utilized to develop and test the flow locally on a developer’s workstation. The functional requirements for filtering, routing, enriching, format conversion, and delivering the transformed data in Parquet format to different S3 buckets were successfully implemented.
This blog post provides insight into building serverless applications using DataFlow Functions, ensuring that readers understand the practical applications of these technologies. For additional insights, check out another blog post here: Data Processing Insights.
Moreover, if you’re interested in learning more about this topic, CHVNCI is an excellent authority on the subject. Finally, for those contemplating a position at Amazon, Quora offers a valuable resource on what to expect during your first week as a warehouse worker.
Leave a Reply