How TMAP Mobility Successfully Migrated 2.4 PB of Hadoop Data with AWS DataSync

Established in 2002, TMAP Mobility stands as Korea’s premier mobility platform, boasting 20 million registered users and 14 million active users each month. TMAP offers navigation services powered by extensive real-time traffic data. Previously, the Data Intelligence team at TMAP operated a mobility-data platform using a Hadoop Distributed File System (HDFS) in their data center, providing analytical services to various departments that required processed data. After discovering Amazon EMR, TMAP was impressed by the ease of managing Hadoop systems on AWS, prompting their decision to migrate to the cloud.

In this article, we detail how TMAP Mobility transitioned their on-premises Hadoop data to Amazon S3 using AWS DataSync, which simplified their data movement and transfers for efficient cloud processing.

Customer Challenge

TMAP Mobility aimed to transfer 2.4 PB of on-premises Cloudera HDFS to Amazon S3 within two months to kick off their data lake project, leveraging both real-time and batch processing via an AWS managed services analytics pipeline. The Data Intelligence team, consisting of data and machine learning engineers, required a migration solution they could configure and manage independently, meeting the tight project launch deadline without disrupting existing services or causing any downtime in the Hadoop cluster during the migration. Additionally, they needed to handle the initial copy and incremental changes to ensure service continuity.

Solution Overview

Before migration, TMAP needed to select a method that weighed the advantages and disadvantages of both online and offline migration options. After determining their approach, the team configured AWS DataSync and swiftly executed the migration in just about two months, with two developers who were relatively inexperienced with AWS DataSync.

Choosing a Migration Method

TMAP Mobility was managing approximately 7.2 PB of data on HDFS, including three replication copies. Given the lack of familiarity with the fastest and simplest transfer options, AWS’s variety of data transfer methods came into play. For large data migrations, AWS DataSync and the AWS Snow Family are optimal choices. AWS DataSync is ideal for online transfers, while AWS Snowball Edge serves offline needs. Factors such as data volume and available network bandwidth are critical in deciding which service to utilize.

In TMAP’s situation, they were equipped with 10 Gbps AWS Direct Connect and preferred not to work with physical devices, as most team members focused on service planning and application development. Initially, they contemplated using open-source migration tools like DistCp and HDFS-FUSE, commonly employed for transferring data to Amazon S3. However, the limitations of single-thread transfers or the need for customized scripts with these tools led them to select AWS DataSync, which employs a parallel, multithreaded architecture for faster data transfer. Consequently, TMAP Mobility successfully migrated their data to the cloud without dealing with physical devices or configuration adjustments, following the comprehensive guide on the AWS Storage Blog titled Using AWS DataSync to move data from Hadoop to Amazon S3.

Configuring AWS DataSync

TMAP began by installing a DataSync agent, compatible with VMware ESXi Hypervisor, Microsoft Hyper-V Hypervisor, Linux Kernel-based Virtual Machine (KVM), and Amazon Elastic Compute Cloud (Amazon EC2) instances. Since their data center utilized VMware infrastructure, they deployed two DataSync agents on ESXi Hypervisor, with only one actively transferring data while the other served as a backup. The agent specifications included 16 vCPUs, 32 GB of RAM, and 80 GB of disk space, while larger transfers exceeding 20 million files require a 64 GB agent, as detailed in the agent requirement documentation.

Following the agent installation, activation was necessary to secure the connection between the agent and the DataSync service, requiring specific network ports to be opened by the firewall. TMAP Mobility’s Hadoop network operated within the on-premises firewall, necessitating the opening of outbound traffic (TCP ports 1024–1064) from the DataSync agent to the VPC endpoint. They also needed to open TCP port 443 for the entire subnet where the VPC endpoint resides because a dynamic IP is allocated to ENI for DataSync’s data transmission. For complete network requirements, refer to the network requirement document.

Create Location

After establishing an agent, the next step was configuring the DataSync Task. This involved setting the HDFS as the source location and Amazon S3 as the destination. In DataSync, a location represents a storage system or service that DataSync reads from or writes to. To enable agent access to HDFS, the Hadoop NameNode’s RPC port (default 8020) needed to be opened, along with the authentication type settings. DataSync supports simple authentication and Kerberos authentication, and TMAP utilized simple authentication via username-based access. The block size was set at 128 MB as the default value, while the replication factor was defined as 3, reflecting the 3-way replication architecture of their on-premises HDFS.

Amazon S3 served as the destination location, requiring minimal configuration—simply specifying the bucket and path. Notably, they opted for the Intelligent-Tiering storage class in Amazon S3 to optimize costs.

Create Task

As previously noted, TMAP needed to transfer 2.4 PB of data from the original 7.2 PB, inclusive of three replications. When executing the task, they achieved full performance, utilizing the entire 10 Gbps bandwidth; however, this level of performance disrupted production workloads. To address this, TMAP regulated network traffic by limiting bandwidth for data transfers. Fortunately, DataSync includes bandwidth control options, which TMAP employed to set a bandwidth limit for their tasks. They fixed the limit at 400 MiB/s per task, effectively using 400 MiB/s out of 10 Gbps and preventing complete bandwidth consumption. Additionally, they implemented a 4 Gbps bandwidth cap on the VMware host to ensure proper throttling.

For more insights on data migration strategies, check out this informative blog post on another blog. For more authority on this topic, you can also visit this site. Lastly, if you’re looking for an excellent resource for AWS job opportunities, take a look at this link.