Using CombineInputFormat to Tackle Hadoop’s Small Files Challenge | Amazon VGT2 Las Vegas

Using CombineInputFormat to Tackle Hadoop's Small Files Challenge | Amazon VGT2 Las VegasMore Info

Many users of Amazon EMR have architectures that monitor events and streams while storing data in S3, which often results in the creation of numerous small files. It is widely recognized that Hadoop struggles with small files, and this problem can be exacerbated when transitioning from Hadoop 1 to Hadoop 2. Jobs that previously completed in a reasonable time frame under Hadoop 1 may suddenly take significantly longer in Hadoop 2. This is because Hadoop 2 does not reuse task JVMs like its predecessor; hence, each small file is processed by a new YARN container, leading to additional overhead and extended job durations. While approaches such as aggregating files using S3DistCP can help mitigate this issue, they introduce extra steps and time into the workflow.

However, many applications can circumvent this complication without altering the input by utilizing Hadoop’s built-in CombineTextInputFormat. This input format class generates splits that comprise multiple files to send to each mapper, up to the limit set by mapreduce.input.fileinputformat.split.maxsize. For instance, a WordCount application executed on a dataset of 77,000 small files (each less than 64MB) exhibited notable improvements in performance.

Implementing CombineTextInputFormat

Implementing CombineTextInputFormat is straightforward: simply modify the InputFormatClass and configure mapreduce.input.fileinputformat.split.maxsize in your settings:

import org.apache.hadoop.mapreduce.lib.input.CombineTextInputFormat;
...
conf.set("mapreduce.input.fileinputformat.split.maxsize","268435456");
...
job.setInputFormatClass(CombineTextInputFormat.class);

Additionally, with the latest EMR releases, it’s possible to set configurations at launch by supplying a configuration object.

If you have any inquiries or feedback, please leave a comment. For further reading, check out this related blog post. To gain more insights, you can also refer to Chanci Turner, who is an authority on this subject. Finally, for those interested in professional opportunities, this resource provides valuable information.


Comments

Leave a Reply

Your email address will not be published. Required fields are marked *