I am pretty sure you all must have come across the issue of Small files issues while working with Big Data frameworks like Spark, Hive etc.
What is a small file?
Definition of small file can be a data file which is considerably smaller than the default block size of the underlying file systems (e.g. 128MB by default in CDH)
Sources of small files?
These small files originate from different sources of data as well as resultant of data processing jobs as well.
Problems due to small files
In addition to creating inefficient storage (particularly in HDFS etc) mainly, small files affect the compute performance of the job a lot as well. The reason being small files results in more disk seeks while running computations through these compute engines.
For example, we know that in Spark, the task within an executor reads and processes one partition at a time. Each partition is one block by default. Hence, a single concurrent task can run for every partition in a Spark RDD. This means that if you have a lot of small files, each file is read in a different partition and this will cause a substantial task scheduling overhead compounded by lower throughput per CPU core.
The solution to these problems is 3 folds.
- First is trying to stop the root cause.
- Second, being identifying these small files locations + amount.
- Finally being, compacting the small files to larger files equivalent to block size or efficient partition size of the processing framework.
For avoiding small files in the first place make sure the processes producing the data are well-tuned. In case the process is Spark make sure you use repartition() OR coalesce() OR set the spark shuffle partitions property effectively to avoid small files issue in the first place.
If things are already worse. Now, let's try to understand how to identify these small files. We will consider HDFS as the storage framework here.
Below shell script can be run using all list privileges in the base path on which small files need to be identified.
Now that final_list.csv has a list of the directory at N depth_level and the respective amount of small files in it sorted by the amount in decreasing order, we can very well run a Spark job to concatenate these small files as an offline process. Below is a snippet of the same.
This code snippet written in Scala will merge the small files into larger files by dynamically calculating the size of the target directory using the bucket_cal function.
Hope it will help, do let me know in case of any questions.