Too Small Data — Solving Small Files issue using Spark

I am pretty sure you all must have come across the issue of Small files issues while working with Big Data frameworks like Spark, Hive etc.

What is a small file?

Definition of small file can be a data file which is considerably smaller than the default block size of the underlying file systems (e.g. 128MB by default in CDH)

Sources of small files?

These small files originate from different sources of data as well as resultant of data processing jobs as well.

Problems due to small files

In addition to creating inefficient storage (particularly in HDFS etc) mainly, small files affect the compute performance of the job a lot as well. The reason being small files results in more disk seeks while running computations through these compute engines.

For example, we know that in Spark, the task within an executor reads and processes one partition at a time. Each partition is one block by default. Hence, a single concurrent task can run for every partition in a Spark RDD. This means that if you have a lot of small files, each file is read in a different partition and this will cause a substantial task scheduling overhead compounded by lower throughput per CPU core.

Solution

The solution to these problems is 3 folds.

  • First is trying to stop the root cause.

For avoiding small files in the first place make sure the processes producing the data are well-tuned. In case the process is Spark make sure you use repartition() OR coalesce() OR set the spark shuffle partitions property effectively to avoid small files issue in the first place.

If things are already worse. Now, let's try to understand how to identify these small files. We will consider HDFS as the storage framework here.

Below shell script can be run using all list privileges in the base path on which small files need to be identified.

Now that final_list.csv has a list of the directory at N depth_level and the respective amount of small files in it sorted by the amount in decreasing order, we can very well run a Spark job to concatenate these small files as an offline process. Below is a snippet of the same.

This code snippet written in Scala will merge the small files into larger files by dynamically calculating the size of the target directory using the bucket_cal function.

Hope it will help, do let me know in case of any questions.

Data Evangelist and digital transformation lead focused on #Artificial Intelligence, #Big Data #Machine Learning, #Deep Learning and #IoT technologies.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store