Submit Blog  RSS Feeds

Wednesday, March 28, 2012

Find duplicate (redundant) files: bash / linux

Recently I was implementing MT940-extract parsers for a variety of banks. Thousands of files, each containing hundreds of entries. Sometimes the entries had unique identification numbers... sometimes they had not. 

Problems occurred when, due to some random events, the extract storage started to contain duplicate files. As a result many redundant entries were loaded by the parser (this had major consequences on the whole processing).

I have implemented a few mechanisms to prevent this situation, one of them is a linux shell script that locates duplicate files in a selected subdirectory (unlimited depth):

  1 if [ -z $1 ]
  2 then
  3     echo "This script finds duplicate files in the selected directory"
  4     echo "Usage: ./ <base dir>"
  5     exit
  6 fi 
  8 all_duplicate=$(find $1 | \
  9     egrep "\.[a-zA-Z0-9]+$" | \
 10     xargs md5sum 2>/dev/null | sed 's/ $/\n/g' | \
 11     sed 's/  /;/g' | sort | uniq  -w32 -D)
 13 last_hash=""
 15 for file in $all_duplicate
 16 do
 17     cur_hash=$(echo $file | cut -d ";" -f1)
 18     if [ "$cur_hash" = "$last_hash" ]
 19     then
 20         echo $(echo $file | cut -d ";" -f2)
 21     fi 
 22     last_hash=$cur_hash
 23 done

So lines 8-11 produce a list of all duplicate files. Since we only want to locate the redundant files, further processing is needed. In the second phase we iterate over the sorted "hash;filename" array and print out file names that have a predecessor with the same hash value thus leaving only a single file name unprinted ( within a group of duplicates that is).

This script ain't perfect, for example it will not work on file names that contain white spaces... anyway, who uses white spaces to name files? :-)

Feel free to correct/modify/share this code!

Monday, March 26, 2012

iPod music recovery based on ID3 tags

I'm not really a fan of 'Apple stuff', but a few years ago I acquired an iPod (5G). The process of uploading music files using iTunes wasn't exactly pleasant, but eventually the music was stored on the device.

The problem I would like to mention occurred when I unrecoverably lost all music from my hard drive. I was disappointed, but hey! I had a backup on my iPod...

When I scanned the device for mp3 files I discovered, that the file names have not been preserved in the upload process:

~ find  [ipod mount point] | grep mp3$


As it occurred later, iTunes does not support fetching the music back from the iPod, if one wants to do it he has to purchase 3rd-party applications. Since I was a student I did not intend to pay for it... so I wrote my own peace of software that helped me out. I came upon this code lately and decided to share it.

The cleaned and lately tuned version is available at:

I don't claim it is going to work with every iPod there is, but the general approach should prove usable.

Wednesday, March 21, 2012

A django login_required decorator that preserves URI query parameters

The default django auth module provides some awesome function decorators - login_required and permission_required. These decorators provide a flexible authorisation mechanism, each time a user tries to access a resource he is not permitted to view (modify or do anything with it) he is redirected to the login page. The user has a chance authenticate himself or provide new credentials (in case he was already authenticated but was lacking permissions), and if the authentication process is completed successfully the resource may be accessed.

In most cases using these decorators solves the problem of protecting resources while keeping the code (and the user interface) clean.

A problem occurs when we try to access a protected resource while attaching some URI query parameter, ex. We get redirected to the login page, and after providing our credentials we get redirected back to ... and the query parameters are gone!

Sadly the default login_required decorator does not preserve them... we have to provide our own decorator:

1. def resource_login_required(some_view):
2.     def wrapper(request, *args, **kw):
3.         if not request.user.is_authenticated():
4.             params = map(lambda x: "%s=%s&" % \
                  (x[0],x[1]), request.GET.items())
5.             return HttpResponseRedirect( \
                  "/login/?%snext=%s" % \              
6.         else:
7.             return some_view(request, *args, **kw)
8.     return wrapper

Line 4 is the key instruction, the presented lambda expression maps the parameter key-value pairs to an URI scheme query parameter representation. Next we concatenate the parameters with the original request path - this is it, after performing a successful login we should be redirected to the requested resource along with the request query parameters.

This decorator may by used just like the prevoiusly mentioned ones:

1.  @resource_login_required
2.  def my_view(request):
3.      #your view code
4.      pass

Feel free to adjust this decorator to your needs. 

Saturday, March 17, 2012

Cross-site ajax requests and security issues

Today I would like to share some of my experience regarding AJAX.

Ajax is commonly used to asynchronously fetch data from web applications deployed on the same domain (the user is not forced to reload the whole page while awaiting new data). So what happens when the the target URI domain differs from the domain on which the ajax script is running (cross-site)?

A web browser makes a request using the OPTIONS method and the following headers are set:

X-Requested-With: XMLHttpRequest

The browser informs the requested resource about the type of the request (Ajax), and its origin ( If the target site accepts such requests it should attach the following response headers:

Access-Control-Allow-Headers: X-Requested-With
Access-Control-Allow-Methods: GET, OPTIONS        
Access-Control-Max-Age: 3600

The first three headers determine the resource accessibility for certain requests: in this example only GET / OPTIONS Ajax requests from may be performed. The fourth header is related to response caching (seconds). So if the potential AJAX request meats these requirements, the browsers sends a following request using the appropriate method, and processes the response. 

If the response headers do not indicate that the request is allowed or there are no such headers at all (not all are obligatory), the second request is not performed. This security mechanism is implemented in all commonly used browsers... lets face it, without it AJAX would be a very dangerous technology...

Summing up: if you ever want to enable cross-site ajax requests for your site you must configure the Access-Control-Allow-* policy on the target server.

Tip: If you want your server to accept such requests from all sites, you may use the following policy:

Access-Control-Allow-Origin: *
Access-Control-Allow-Methods: GET, POST, DELETE, HEAD, OPTIONS        
Access-Control-Max-Age: 1209600

Wednesday, March 7, 2012

Logging mercurial (hg) update/merge history

Greetings fellow readers!

I finally managed to get my blog running... and no, this ain't another fashion blog... and this ain't another blog about cooking... Is it about voyages? Nice try... but it's not. This blog is about old-school programming :-)

Considering that this is my first post I'll start with something simple, yet very helpful -- a hook for logging the hg udpate and hg merge commands.

Why should this feature be helpful? Imagine a production server dependant on a large code repository (many programmers, many branches, you can hardly commit some changes without merging changesets). Having an update/merge history on such a server could save a lot of time and nerves when something goes wrong -- we have a list of previous stable versions.

In order to make log the mentioned activities you have to insert two lines in the [hooks] section of the .hgrc file from your home directory (or .hg/hgrc in your project directory):

~ cat ~/.hgrc

preupdate.pre_up = echo $(date +%D\ %T) "update [$(hg branch)]" $(hg id -i) >> .hg_update.log
update.post_up =  if [ $HG_ERROR -eq 0 ] ; then echo "$(date +%D\ %T) success: [$(hg branch)]" $HG_PARENT1 $HG_PARENT2; else echo "Errors occured" ; fi >> .hg_update.log

 These hooks log a mesage containing the current date, time, branch and changeset before and after the update/merge is performed. The history is stored in .hg_update.log (current directory). And a test:

~ hg update -C new_feature
~ cat .hg_update.log

03/07/12 22:22:12 update [default] 19478f194c18
03/07/12 22:22:12 success: [new_feature] 2c050459a85f

You can now easily revert those changes by executing:

~ hg update -C  19478f194c18

Hope you find my solution useful.

free counters