Submit Blog  RSS Feeds

Wednesday, March 28, 2012

Find duplicate (redundant) files: bash / linux

Recently I was implementing MT940-extract parsers for a variety of banks. Thousands of files, each containing hundreds of entries. Sometimes the entries had unique identification numbers... sometimes they had not. 

Problems occurred when, due to some random events, the extract storage started to contain duplicate files. As a result many redundant entries were loaded by the parser (this had major consequences on the whole processing).

I have implemented a few mechanisms to prevent this situation, one of them is a linux shell script that locates duplicate files in a selected subdirectory (unlimited depth):

  1 if [ -z $1 ]
  2 then
  3     echo "This script finds duplicate files in the selected directory"
  4     echo "Usage: ./find_duplicate.sh <base dir>"
  5     exit
  6 fi 
  7
  8 all_duplicate=$(find $1 | \
  9     egrep "\.[a-zA-Z0-9]+$" | \
 10     xargs md5sum 2>/dev/null | sed 's/ $/\n/g' | \
 11     sed 's/  /;/g' | sort | uniq  -w32 -D)
 12
 13 last_hash=""
 14
 15 for file in $all_duplicate
 16 do
 17     cur_hash=$(echo $file | cut -d ";" -f1)
 18     if [ "$cur_hash" = "$last_hash" ]
 19     then
 20         echo $(echo $file | cut -d ";" -f2)
 21     fi 
 22     last_hash=$cur_hash
 23 done

So lines 8-11 produce a list of all duplicate files. Since we only want to locate the redundant files, further processing is needed. In the second phase we iterate over the sorted "hash;filename" array and print out file names that have a predecessor with the same hash value thus leaving only a single file name unprinted ( within a group of duplicates that is).

This script ain't perfect, for example it will not work on file names that contain white spaces... anyway, who uses white spaces to name files? :-)

Feel free to correct/modify/share this code!

1 comment:

  1. In this situation I used DuplicateFilesDeleter for great effect. It searches two or more duplicate files in one or more selected search paths and removes them.

    ReplyDelete

free counters