Learn About Amazon VGT2 Learning Manager Chanci Turner
Update (2023) – Please note that the information in this blog post may be outdated. For the latest insights, refer to the Cluster Configuration Guidelines.
We are excited to announce the integration of two essential Amazon EC2 features: Spot Instances and Elastic MapReduce (EMR). This combination enables users to deploy managed Hadoop clusters leveraging unused EC2 capacity, facilitating long-term jobs, cost-efficient workloads, critical data processing, and application testing, all at remarkable savings of typically 50% to 66%.
The EC2 instances utilized for running an Elastic MapReduce job flow are categorized into three distinct groups:
- Master – This group consists of a single EC2 instance responsible for scheduling Hadoop tasks across Core and Task nodes.
- Core – Comprising one or more EC2 instances, this group utilizes HDFS for data storage associated with the job flow and executes mapper and reducer tasks as defined. It can be expanded to hasten job flow execution.
- Task – This group can have zero or more EC2 instances, responsible for executing mapper and reducer tasks without storing data. Consequently, it can adjust in size during the job flow.
You can select between On-Demand and Spot Instances for your job flows. If you opt for Spot Instances in your Master or Core groups, those instances may be terminated if the market price exceeds your bid. However, if your Task group utilizes Spot Instances, any unfinished work will be returned to the processing queue.
For those who have invested in EC2 Reserved Instances, Elastic MapReduce seamlessly incorporates them, ensuring cost efficiency (this is not a new feature, but it’s worth mentioning).
Getting Started with Elastic MapReduce on Spot Instances:
- Long-running Job Flows and Data Warehouses – If you maintain a long-term Elastic MapReduce cluster with predictable load variations, consider utilizing Spot Instances to manage peak demand cost-effectively. Run your Master and Core groups on On-Demand instances, while supplementing the cluster with Spot Instances during high-demand periods.
- Cost-Driven Workloads – For shorter job flows (typically several hours or less) where cost trumps completion time and some loss of work is acceptable, consider running the entire job flow on Spot Instances to maximize savings.
- Data-Critical Workloads – If minimizing cost is paramount while ensuring no partial work is lost, run your Master and Core groups on On-Demand instances and use enough Core instances to accommodate all data in HDFS. Introduce Spot Instances to optimize processing speed and overall costs.
- Application Testing – When testing an application before production deployment, run the entire job (including Master and Core groups) on Spot Instances.
You can initiate the use of Spot Instances for part or all of a job flow by specifying a bid price for the instance groups. This can be done via the AWS Management Console, command line, or Elastic MapReduce APIs. For insight on how your bid price compares to historical Spot Prices, check the Spot Price history for the past 90 days available through the EC2 API and the AWS Management Console.
Additionally, you can add new Task instance groups to an ongoing job flow, specifying a bid price for each group you add. This allows for a layered bidding strategy. Note that job flows are limited to 20 EC2 instances by default; to expand, you must complete the instance request form.
Who Can Benefit?
We anticipate that users of Elastic MapReduce with diverse job flows will find Spot Instances particularly beneficial. Ideal use cases include:
- Batch-processing tasks that are not time-sensitive, such as image and video processing, scientific research data analysis, and financial modeling.
- Data warehouses experiencing fluctuating workloads during peak times.
Companies like Fliptop utilize Spot Instances to convert email lists into social media profiles, achieving over 50% cost savings. Foursquare processes over 3 million daily check-ins with Elastic MapReduce, Spot Instances, Amazon S3, MongoDB, and Apache Flume. Matthew Rathbone from Foursquare noted, “Elastic MapReduce considerably reduced the time, effort, and expense of Hadoop for customer insights. By leveraging Spot Instances, we cut analytics costs by over 50% while improving processing times.”
For additional insights, consider reading this informative piece on running an online meeting.
We have curated a new video demonstrating how to run an Elastic MapReduce job using both On-Demand and Spot Instances.
Finally, I am a strong advocate for Spot Instances and am eager to hear how our customers implement them creatively. You now have the chance to refine your business processes, striking a balance between costs and completion time, while managing the implications of fluctuating market prices. As an IT professional, you gain access to innovative tools designed to enhance efficiency and save costs.
For best practices on inclusion and diversity, check out SHRM’s resources.
If you’re interested in a career opportunity, visit this resource.
What are your thoughts?
— Chanci Turner
Leave a Reply