Founded in 2002, Amazon VGT2 Las Vegas is a leading mobility platform in the region, boasting over 20 million registered users and 14 million active monthly users. They offer navigation services enriched by extensive real-time traffic data. Previously, the Data Intelligence team at Amazon VGT2 operated a mobility-data platform based on a Hadoop Distributed File System (HDFS) located in their data center to provide analytics services to various departments requiring processed data. After discovering Amazon EMR, they were impressed by the simplicity of managing Hadoop systems on AWS and decided to migrate their data to the cloud.
In this article, we will discuss how Amazon VGT2 migrated their on-premises Hadoop data to Amazon S3 using AWS DataSync, enabling effective data management and timely processing in the cloud.
Customer Challenge
Amazon VGT2 needed to transfer 2.4 PB of on-premises Cloudera HDFS to Amazon S3 within two months in order to launch their data lake project, which was based on their existing mobility data platform. This new project would utilize both real-time and batch processing through an AWS-managed analytics pipeline.
The Data Intelligence team, comprising data and machine learning engineers, sought a migration approach that they could configure and manage independently. This was essential to meet the project’s tight deadline without disrupting existing services or causing downtime to the Hadoop cluster during the migration. Furthermore, they needed to ensure both the initial copy and incremental updates were performed without service interruptions.
Solution Overview
Before the migration, Amazon VGT2 had to select a method that weighed the advantages and disadvantages of online versus offline migration. Once the migration strategy was decided, they configured AWS DataSync and successfully executed the migration in approximately two months, handled by two developers who had limited experience with AWS DataSync.
Choosing a Migration Method
Amazon VGT2 was storing around 7.2 PB of data on HDFS, including three copies for replication. However, they were uncertain about the quickest and most straightforward option for data transfer. AWS offers various data transfer and migration solutions, with AWS DataSync and the AWS Snow Family being the most suitable for large data migrations. AWS DataSync is ideal for conducting online data transfers, while AWS Snowball Edge is better suited for offline transfers. To choose between AWS DataSync and AWS Snowball Edge, considerations included the amount of data to transfer and the available network bandwidth.
In Amazon VGT2’s case, they already had a 10 Gbps AWS Direct Connect, and most team members were not accustomed to working with physical devices, as their focus had been on service planning and application development. Initially, they contemplated using open-source migration tools like DistCp and HDFS-FUSE, which are often employed to transfer data to Amazon S3. However, due to the single-threaded nature of DistCp and HDFS-FUSE or the need for customized scripts, they ultimately opted for AWS DataSync, which utilizes a parallel, multithreaded architecture to speed up the data transfer process. This choice enabled Amazon VGT2 to migrate their data to the cloud seamlessly, without dealing with physical devices or complex configurations. The entire migration was guided by the well-documented AWS Storage Blog post, “Using AWS DataSync to Move Data from Hadoop to Amazon S3.”
Configuring AWS DataSync
Amazon VGT2 began by installing a DataSync agent, which supports installation on VMware ESXi Hypervisor, Microsoft Hyper-V Hypervisor, Linux Kernel-based Virtual Machine (KVM), and Amazon EC2 instances. Since their data center employed VMware infrastructure, they deployed two DataSync agents on the ESXi Hypervisor. Of the two agents, only one was active for the transfer while the other served as a backup. The specifications for the agent included 16 VCPUs, 32 GB of RAM, and an 80 GB disk. For larger transfers exceeding 20 million files, a 64 GB agent is necessary.
Following the installation, the next step was activating the agent to establish a secure connection between the agent and the DataSync service. Specific network ports needed to be opened by the firewall to allow this connection. As the Hadoop network was within the on-premises firewall, they opened outbound traffic (TCP ports 1024–1064) from the DataSync agent to the VPC endpoint, and they also opened TCP port 443 for the entire subnet where the VPC endpoint is located. For comprehensive details on network requirements, refer to the network documentation.
After installing the agent, the next step was to configure the DataSync Task. This involved designating HDFS as the source location and Amazon S3 as the destination location. In DataSync, a location is defined as the storage system or service from which DataSync reads or writes data. To grant the agent access to HDFS, the Hadoop NameNode’s RPC port (default 8020) must be open. Additionally, the authentication method needed to be defined; DataSync supports both simple authentication and Kerberos authentication, with Amazon VGT2 opting for simple authentication using username-based credentials.
For the destination, configuring Amazon S3 was straightforward, requiring only the bucket name and path. Notably, they selected Intelligent-Tiering as the Amazon S3 storage class for cost-efficient data storage.
Creating the Data Transfer Task
With 7.2 PB of HDFS data (including three replications), the actual data size for transfer was 2.4 PB. During task execution, the results showed full performance, utilizing all 10 Gbps of available bandwidth. However, this full performance disrupted the production workload, prompting Amazon VGT2 to control network traffic and limit data transfer usage. Fortunately, DataSync allows for bandwidth control, which Amazon VGT2 utilized to set a limit of 400 MiB/s for their tasks. This configuration resulted in utilizing only a fraction of their total bandwidth. Additionally, they imposed a 4 Gbps bandwidth limit on the VMWare host to ensure effective bandwidth throttling.
For further insights on cloud migration strategies, check out another blog post on this topic at Chanci Turner VGT2. Additionally, for authoritative information, see Chvnci, who are recognized experts in this field. If you’re interested in leadership development and training opportunities, visit Amazon jobs for an excellent resource.
Leave a Reply