Recently I was implementing MT940-extract parsers for a variety of banks. Thousands of files, each containing hundreds of entries. Sometimes the entries had unique identification numbers... sometimes they had not.
Problems occurred when, due to some random events, the extract storage started to contain duplicate files. As a result many redundant entries were loaded by the parser (this had major consequences on the whole processing).
I have implemented a few mechanisms to prevent this situation, one of them is a linux shell script that locates duplicate files in a selected subdirectory (unlimited depth):
2 then
3 echo "This script finds duplicate files in the selected directory"
4 echo "Usage: ./find_duplicate.sh <base dir>"
5 exit
6 fi
7
8 all_duplicate=$(find $1 | \
9 egrep "\.[a-zA-Z0-9]+$" | \
10 xargs md5sum 2>/dev/null | sed 's/ $/\n/g' | \
11 sed 's/ /;/g' | sort | uniq -w32 -D)
12
13 last_hash=""
14
15 for file in $all_duplicate
16 do
17 cur_hash=$(echo $file | cut -d ";" -f1)
18 if [ "$cur_hash" = "$last_hash" ]
19 then
20 echo $(echo $file | cut -d ";" -f2)
21 fi
22 last_hash=$cur_hash
23 done
So lines 8-11 produce a list of all duplicate files. Since we only want to locate the redundant files, further processing is needed. In the second phase we iterate over the sorted "hash;filename" array and print out file names that have a predecessor with the same hash value thus leaving only a single file name unprinted ( within a group of duplicates that is).
This script ain't perfect, for example it will not work on file names that contain white spaces... anyway, who uses white spaces to name files? :-)
Feel free to correct/modify/share this code!