Others

How to get the list of only files and not directories in hadoop

Logic : hdfs output | linux (ubuntu/centos) commands to get the desired results

Differentiator:
1- Hadoop returns the output of the ls command in a 8 column form
2- Directories versus regular files can be identified using the first column of the o/p
3- In the o/p the directories starts with d and then the permissions of the directory
4- In the o/p the regular files starts with a –

Following command is to list only regular files in hdfs :

hadoop fs -ls -R <source_directory> | sed 's/  */ /g' | cut -d\  -f <number_of_column_from_op> --output-delimiter=<output_to_delimit> | grep ^- | cut -d, -f<number_of_column_from_op>

Example :

hadoop fs -ls -R /tmp | sed 's/  */ /g' | cut -d\  -f 1,8 --output-delimiter=',' | grep ^- | cut -d, -f2

Following command is to list only directories in hdfs :

hadoop fs -ls -R <source_directory> | sed 's/  */ /g' | cut -d\  -f <number_of_column_from_op> --output-delimiter=',' | grep ^d | cut -d, -f<number_of_column_from_op>

Example :

hadoop fs -ls -R <source_directory> | sed 's/  */ /g' | cut -d\  -f 1,8 --output-delimiter=',' | grep ^d | cut -d, -f2

Efficiently using the cut command : http://www.folkstalk.com/2012/02/cut-command-in-unix-linux-examples.html
& sed command : https://www.tutorialspoint.com/unix/unix-regular-expressions.htm

Leave a Reply

Your email address will not be published. Required fields are marked *