Submit Blog  RSS Feeds

Thursday, August 30, 2012

Extracting data from excel spreadsheets (xls) using xlrd

So what do you usually say when somebody asks you to extract some data from an excel spreadsheet for additional processing? I believe something like: "--beep--, can't you convert it to CSV?" (censored). Usually converting a spreadsheet to CSV is THE way to go.

Unfortunately sometimes such spreadsheets have to stay up to date and people keep modifying them. And when it comes to using formulas, maintaining a spreadsheet is a must (CSV preserves only final values).

I had such a situation in one of my past projects, clients uploaded XLS files via a web form and a python script had to decode appropriate evaluated cells and load them to the database. There are many platform dependent (windows) packages for this purpose requiring some M$ Office DLLs present, but since I didn't want to deploy this application on windows I started looking for a cross-platform solution. I decided to used xlrd, a cross-platform tool compatible with python 2.7.  You can get it using pip:

~ $ sudo pip install xlrd

Now for an example:

import xlrd

#open the spreadsheet
workbook = xlrd.open_workbook("test.xls")

#select the first sheet
sheet = workbook.sheet_by_index(0)

#get data from column C (2), row 10 (9)
#non-negative numeration
data = sheet.cell(colx=2,rowx=9).value
print "Cell C,10: %s" % data

It's quite easy, however its best to implement some mapping function, so you don't have to evaluate the column number for every letter (cell AXC, 19221 could be quite tricky ;-))


Passing django variables to javascript via templates

I can fairly say, that reasonable web applications can't exist without a bit of javascript. We all remember those static web pages that were popular in the late 90'. Well their time is up, professional websites frontend is built using HTML/XHTML/HTML5 + CSS + JS (sometimes Flash).

AJAX requests or newer synchronous pull/push methods enable achieving some dynamic connectivity between frontends and backends. It is also possible to pass some initialization data this way, however if your website architecture is not a real-time client-server (ex. WebSockets) this may be inefficient.

A good practice is initializing JS variables along with rendering pages, so additional initialization request are not required. A simple way of implementing this mechanism is passing JSON encoded data to the template and decoding it in JavaScript. Lets implement the appropriate django view:

def init_js_template(request):
    import json
    json_data = json.dumps({'test' : '12345', 'arr' : [1,2,3]})
    return render_to_respons('init_json.html', {'json_data':json_data})

And the corresponding javascript initialization in the init_json.html template:

This way you will save some resources by reducing the number AJAX calls. Just keep in mind that this will only work for arrays/dictionaries containing primitive types or other arrays.


Sorting digital photos according to the real creation time using EXIF tags.

Recently I was on a holiday with my friends. Since there were a lot thrilling places to see, almost everybody took a camera. Finally after 8 days of intensive hiking we had over 900 photos taken and we wanted to make a chronological album. 

The first problem that occurred to us is that the global image namespace was not continuous (5 different cameras / cellphones with the ability to take photos), and sorting images by their generated names did not solve the it. 

Next we tried sorting the images by the file modification date - this was a reasonable approach, but there a few photos which were rotated after fetching them from the memory card causing their modification date to be inadequate for the mentioned problem. Furthermore 2 cameras were configured with an incorrect date / time value. 

A few years ago I was implementing an EXIF tag viewer for images, and I remembered that there are tags that store the original image creation time (and quite a few other parameters). I made some research and located a library with python bindings that handles these tags well: pyexiv2. 

You may install it using apt-get:

~ $ apt-get install python-pyexiv2

Next I implemented a simple script updates file names so they can be sorted lexicographically  according to the real creation time, have a look at the following code:
import pyexiv2
import sys
import os
import shutil
import datetime
from dateutil.relativedelta import relativedelta

class ExifSort():
    def _test_extenstion(cls, filename):
        return filename.split(".")[-1].lower() in ['jpg','jpeg','png']
    def sort_images(cls, path, prefix=None):
        files = []
        for file in os.listdir(path):
            if not cls._test_extenstion(file):
            metadata = pyexiv2.ImageMetadata(os.path.join(path, file))
                d = metadata['Exif.Photo.DateTimeOriginal'].value
                if metadata['Exif.Image.Make'].value == 'Canon':
                elif metadata['Exif.Image.Make'].value == 'FUJIFILM':
                    d += relativedelta(years=1)    
                    d -= datetime.timedelta(hours=39)
                files.append((file, d, )) 
                print file

        files.sort(key=lambda x: x[1])
        mag = len(str(len(files))) 
        counter = 1
        tmp_prefix = "tmp_"
        for data in files:
            shutil.move(os.path.join(path, data[0]), os.path.join(path, tmp_prefix + data[0]))
        for data in files:
            ext = data[0].split(".")[-1]
            shutil.move(os.path.join(path, tmp_prefix + data[0]),\
                    os.path.join(path, "%s%s.%s" % (prefix or "",\
                    str(counter).rjust(mag,'0'), ext)))
            counter += 1

if __name__ == '__main__':
        print "Usage: %s [directory]" % sys.argv[0]

This script iterates all files in a specific directory ( sys.argv[1] ), filters jpg/png files, reads their exif date and stores pairs: filename, creation datetime in an array. Finally the array is sorted using the real creation datetime, and the images are renamed (with or withour a specific prefix).   

Lines 22-26 are a specific exception - we had 2 cameras with an incorrect date, so we have to correct the read value so it may be compared with images from other cameras (evaluated empirically) . 


Friday, August 17, 2012

A quick peek into new django 1.4 feature

Its been a few months since a stable release of django 1.4 has been published. Many of my projects were based on django 1.3, and some even on 1.1. Since they worked there was no sense migrating to the latest version, but recently I started a new project and decided to try out the latest stable version.

The first visible difference is the directory structure, if you for example execute:

~ $ startproject test

a following project directory will be created:


You may now add you apps under the test/test or test directories, this affects the package namespace, this is a nice improvement since it allows you to share your modules not only between your applications but other django projects.

Now lets go to some other major improvements, that have a greater impact on process of building/managing web applications.

The first feature that significantly improves the quality of your code is the ability to implement in-browser test cases. In the previous editions you had to setup curl or write dedicated scripts that simulate user behavior. Django 1.4 is integrated with selenium - one of the most popular browser automation systems, designed for website testing. Yeah, no more situations when somebody actually forgot to commit a template to the production repository.

Another important feature improves the ORM, as from django 1.4 QuerySet's  have a method select_for_update(). Now what does it do? I'm speaking rhetorically, of course it  implements the SQL: SELECT ... FOR UPDATE  which locks the selected rows to prevent other threads modifying  them. Now this greatly improves transactional processing.

More features can be found on the official release page

As for requirements, the first thing I need to mention is that django 1.4 drops support for python 2.4. Most projects nowadays run on python 2.6 or 2.7, so this shouldn't be a problem. Support for python 3.X is not included, and it probably won't be in the next release aswell.  So for now I guess python 2.7 and django 1.4 is the strongest choice.


Tuesday, August 7, 2012

Creating multiple isolated python environements on a single OS using virtualenv

If you're doing a lot of coding in python, you probably came to a situation where you had to maintain many projects concurrently. In such situations it is hard to provide apropriate dependant library versions    for each project (especially if by any chance those different versions of a particular library lack a compatible API, just like python 2.X and 3.X).

A naive solution would require creating a seperate user profile (or using a virtual machine) for each maintained project. Ok seriously - don't even consider it.

If you want to have mutliple python development/deplyoment envirements you should have a place to store each appropriate set of libraries and interpreters. Secondly you should implement some scripts that run your project with the right environment variables (using the right interpreter).

This approach is better, but it sounds like a lot of work. Happily someone made the job for you. All you need to do is install virtualenv:

~ $ sudo pip install virtualenv

Now to you can configure your isolated python environment:

~ $ virtualenv my_env --quiet --no-site-packages && find my_env -maxdepth 2


This is a fully functional python environment, in order to activate  it you have to load it with source  command:

~ $ source my_env/bin/activate

You command prompt should now look similar to this:

(my_env) ~ $
The environment has now been activated, you may now install new  libraries using pip or easy install without root privileges. All files will be installed in an appropriate directory in my_env. Running the python command will run an interpreter from the currently active virtual environment.

When you're done having fun, and you want to resume your system environment, just run:

(my_env) ~ $ deactivate 
~ $ 

You can have as many virtual environments as you like, the current version should even support moving/copying the environment.

free counters