My thoughts on computers, programming, ancient spell scrolls and other magic devices...: May 2012

Wednesday, May 30, 2012

Heterogeneous system administration issues

I know heterogeneous environments became popular lately, but hey - let's talk about the drawbacks of such systems. So for example let us visualise a process that is dependant on Windows, Linux and MacOS that run on three separate machines.

First of all each of these machines has to be properly configured (services, security, performance). System administrators are usually commited to a specific platform so setting up other configurations is more time consuming.

For example, it took me some time to get familiar with using Darwin's launchctl service management framework. Well all I wanted to do is run my task periodically... cron is a great and simple tool capable of achieving it... however all other services/tasks were configured via launchctl, for me writing a huge XML configuration file instead of a simple crontab entry is an overkill.

But anyway, lets us presuppose that this Mac enables it's HDD as a network drive available via SFTP and AFP. This hard drive contains data that needs to be processed by a dedicated piece of software that runs only on Windows... in case you found a fast way of solving this problem before I actually stated it: no you can't change that piece of software.

So the user responsible for processing the data mounts the SFTP shares and proceeds with his task. However the user also needs access to his own resources located in another location, which requires providing other credentials. It seems like Windows7 is not capable of keeping multiple SFTP sessions with a single machine (in fact it caches it, thus making it difficult to re-log again). Since we don't want to restart Windows every ten minutes (clear the system cache) let's keep Windows on a virtual machine that may be accessed by rdesktop.

We execute a python script responsible for preprocessing the data and starting the OS specific processing software:

#some data preprocessing
os.system('''<some operations> &&
<a full path to the`specific software`.exe>
<many parameters> %s ''' % a_long_list_of_arguments)

And what do we get:

> The input line is to long

I managed to google out that MS maximum command prompt length varies from 2047 to 8191 characters, depending on the OS version. Now this is hilarious...

After all these adventures I installed cygwin (a tool that makes Windows reassemble an operating system) on the virtual machine and configured the script to run under the linux-like environment (finally it worked).

So next time, remember: think twice (bah, thrice) before you intend to configure a sophisticated process on a heterogeneous environment :-)

~KR

Monday, May 28, 2012

Make websites recognise python/urllib as a webbrowser - (part 1: http headers)

Today I want to share some of my expirience regarding web crawlers and web robots / bots. Having access to high level http client libraries (like urllib/urrllib2) and basic HTTP protocol knowledge it is easy to implement such a program in a relatively short time. You don't have to respect robots.txt (in most cases, especially regarding web crawling you should!), although a web application may still detect that your script is not a web browser. What if you actually want to pretend that your script is a web browser?

Let's write a simple program that requests some resource on localhost via HTTP protocol:

1 import urllib2
2
3 req = urllib2.Request("http://127.0.0.1:3333/")
4 urllib2.urlopen(req)

Now make netcat listen on port 3333:

~ netcat -l 3333

... and execute your script. Netcat captures the following data (it may differ from your output, but will be simillar in general):

GET / HTTP/1.1
Accept-Encoding: identity
Host: 127.0.0.1:3333
Connection: close
User-Agent: Python-urllib/2.7

Let's reenact this experiment, but instead of using the presented python script, open a web browser and navigate to http://127.0.0.1:3333/ .

This time the output generated by netcat looks lilke this:

GET / HTTP/1.1
Host: 127.0.0.1:3333
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.2.24) Gecko/20111107 Linux Mint/9 (Isadora) Firefox/3.6.24
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: pl,en;q=0.7,en-us;q=0.3
Accept-Encoding: gzip,deflate
Accept-Charset: ISO-8859-2,utf-8;q=0.7,*;q=0.7
Keep-Alive: 115
Connection: keep-alive

The output may differ depending on your OS and browser version. Some headers may be manually configured in your browser aswell (accepted charsets, language etc.). The header that is usually used to identify the requesting client (browser, bot, crawler) is User-Agent. A way to cheat web applications into thinking that your script is to provide a User-Agent header extracted from a real web browser. You can achieve this as follows:

1 import urllib2
2
3 headers = {
4 "User-Agent" : "Mozilla/5.0 (X11; U; Linux i686; " + \
5 "en-US; rv:1.9.2.24) Gecko/20111107 " + \
6 "Linux Mint/9 (Isadora) Firefox/3.6.24",
7 }
8
9 req = urllib2.Request("http://127.0.0.1:3333/", headers=headers)
10 urllib2.urlopen(req)

You may also include other headers - the more the better (so the request looks just like the one from FireFox). However, keep in mind that some headers may have an impact on the recieved content, for example: if you specify the Accept-Encoding header as follows: Accept-Encoding: gzip,deflate you shouldn't be suprised if the server servses you gzip compressed data. More information about http headers can be found in RFC-2616.

This is just one trick, stay tuned - there will be more.

~KR

Tuesday, May 22, 2012

LastPy - a simple last.fm scrobbler

Anybody that loves music must have heard about a portal called last.fm. There are many social-music portals, but last.fm is a bit different - mainly the music you listen to is not entered via a form 'my favorite genres', you submit your music preferences by actually listening to music (plugins for music players, embedded radio stations etc.).

It's cool, most popular music players support last.fm, however problems may occur while you're listening Internet radio stations via a browser (i.e. using flash plugins). I really loved a radio station called Epic Rock Radio, and I always wondered how to scrobble the songs that I am currently listening.

When I accidentally lost all my music files I was encouraged (besides recovering music from my iPad) to implement a scrobbler that could handle online radio stations.

First I made some research, searched the last.fm discussion groups, developers guides and I found just what I was looking for: the Audioscrobbler Realtime Submission Protocol (v1.2 specification available here).

It occurred that last.fm has a RESTful API for purpose of scrobbling songs. A simplified session may be presented as follows:

1. HANDSHAKE / AUTHENTICATION
2. "Now Playing"
3. SUBMIT SONG

Where steps 2-3 may be repeated (in any order) as long as the session key (obtained via HANDSHAKE) is up to date.

After I had a working audioscrobbler it was much easier to implement one for the mentioned radio station (data available in XML from live365.com) :-)

You can find my prototype implementation on github: LastPy

P.S. I wrote it a few years ago, it ain't clean and tidy, but it works ;-)

Wednesday, May 16, 2012

Interactive debugging in Django

A long time ago in a galaxy far far away... programmers were debugging PHP applications using the echo function. Well it actually wasn't that long ago nor was the galaxy far away - we're talking about the late 90'; planet Earth.

Nowadays it is rather unthinkable, time is money and spending whole days placing and removing echo's is expensive (not mention ineffective). So how to efficiently debug Django views?

The first thing you need to do is set the following variable in settings.py:

   DEBUG = True

This option is enabled by default. If it is set, each time a view returns a HTTP 500 Internal Server Error, a debug view will be presented. It contains your current settings, request parameters and finally a stack trace - usually this is enough to solve most problems.

If it is not enough you may want to try the second option: pdb the standard python debugger tool. To insert a break point you should put the following line in your code and start the application via runserver:

    import pdb ; pdb.set_trace()

If you master the short-cut commands this is a very pleasant tool. You can also check out the interactive version ipdb (requires ipython).

However if you do not feel like debugging in the console, or you want to have access to the whole stack trace without inserting hundreds of break-points - werkzeug is the tool for you! It is an awesome interactive JavaScript based in-browser debugger. You can get it with pip (along with dependencies):

~ pip install django-admin-tools
~ pip install werkzeug

Now, instead of using the runserver command, you use the following:

~ ./manage.py runserver_plus

After this, each time you encounter an exception debug view appears... however this is no ordinary Django debug view, it contains an in-browser debugger like pdb which is capable of jumping between every point of the stack trace. This is just great, if I could also integrate vim with FireFox aswell... *kidding* :-)

~KR

Thursday, May 10, 2012

Droping a multi-column unique constraint in MySQL with Django South

South is a great tool for managing database migrations (compatible with Django). It generates migrations by analysing the difference between the current data model and the previous one (stored in migration script files). However strange things may occur if you try do drop a multi-column unique constraint ( the django model defines it as: unique_together). For example we have:

1 class MyModel(models.Model):
2 classs Meta:
3 app_label = 'myapp'
4 unique_together = (('field1', 'field2',),)

We remove line 4, and run schemamigration:

~ ./manage.py schemamigration myapp --auto

South generates the following forward migration:

23 class Migration(SchemaMigration):

24 def forwards(self, orm):

25 db.delete_unique('myapp_mymodel', ['field1', 'field2'])

Let's try to execute it:

~ ./manage migrate kantor

And we get something like this:

ValueError: Cannot find a UNIQUE constraint on table myapp_mymodel, columns ['field1', 'field2']

We can investigate it using the mysql client:

mysql> SHOW CREATE TABLE myapp_mymodel;

(...)

UNIQUE KEY `myapp_mymodel_field1_667dc28f4f7b310_uniq` (`field1`,`field2`),

(...)

So the unique constraint really exists, but south fails to drop it. To solve this problem we have to drop that index ourselves. We should modify the forward migration the following way:

23 class Migration(SchemaMigration):

24 def forwards(self, orm):

25 import south

26 south.db.db.execute('drop index \
27 myapp_mymodel_field1_667dc28f4f7b310_uniq on myapp_mymodel;')

That does the trick! Farewell multi-column unique constraints!

~KR