How to get the list of only files and not directories in hadoop

Monday, March 23, 2020 By Avdhesh Bansal

Guide to get the list of only files and not directories from hdfs data storage
Logic : hdfs output | linux (ubuntu/centos) commands to get the desired results

Differentiator:
1- Hadoop returns the output of the ls command in a 8 column form
2- Directories versus regular files can be identified using the first column of the o/p
3- In the o/p the directories starts with d and then the permissions of the directory
4- In the o/p the regular files starts with a –

Following command is to list only regular files in hdfs :

hadoop fs -ls -R <source_directory> | sed 's/  */ /g' | cut -d\  -f <number_of_column_from_op> --output-delimiter=<output_to_delimit> | grep ^- | cut -d, -f<number_of_column_from_op>

Example :

hadoop fs -ls -R /tmp | sed 's/  */ /g' | cut -d\  -f 1,8 --output-delimiter=',' | grep ^- | cut -d, -f2

Following command is to list only directories in hdfs :

hadoop fs -ls -R <source_directory> | sed 's/  */ /g' | cut -d\  -f <number_of_column_from_op> --output-delimiter=',' | grep ^d | cut -d, -f<number_of_column_from_op>

Example :

hadoop fs -ls -R <source_directory> | sed 's/  */ /g' | cut -d\  -f 1,8 --output-delimiter=',' | grep ^d | cut -d, -f2

Efficiently using the cut command : http://www.folkstalk.com/2012/02/cut-command-in-unix-linux-examples.html
& sed command : https://www.tutorialspoint.com/unix/unix-regular-expressions.htm