In Hadoop ecosystem we are storing files under folders in HDFS, most of the time the folder name we are giving based on Application Name . When we are talking about small files it should be lesser then the block size, for example if the block size is 64mb or 128mb then smaller files are considered lesser then the block size. When files are smaller than the block size at that time we are facing problem in HDFS level as well as Map Reduce Level.
In HDFS when we are storing files/Directories, corresponding metadata will be stored in Name Node, each file, directory, block metadata information will approximately occupy 150 bytes. Suppose if you have 1 million files and each file are using approximately a block size or lesser then the block size then metadata size of the corresponding files/directories are approximately 300MB of memory, In such case lot of memory is occupied in the name node and after some time threshold will be reached and further it will be a problem with the current hardware. Certainly performance will be a downgrade.
During the execution of Map reduce, when the file size is lesser than or equivalent to the block size, for each block size or equivalent split size one Mapper will launch so approximately large number of Mapper will launch for large number of small files in this case processing time will be more for each file having small chunk of data .when we are reading and writing large number of small files seek time(the time taken for a disk drive to locate the area on the disk where the data to be read is stored.) will be more which will impact performance and seeks are generally expensive operation. Since Hadoop is designed in such a way to run over your entire dataset, it is best to minimize seeks by using large files.
Remediation plan
We can merge all the small files using HDFS getmerge command into a big file. Getmerge command can copy all the files available in HDFS folder to a single concatenated file in the local system. After concatenated in the local system you can place the same file from local to HDFS using HDFS PUT command. Please find the example as mentioned below.
hadoop fs -getmerge /hdfs_path/pbibhu/school_info_* /local_path/pbibhu/school_inf.txt
hadoop fs -put school_inf.txt /hdfs_path/pbibhu/school_inf.txt