My thoughts on computers, programming, ancient spell scrolls and other magic devices...: June 2012

Thursday, June 28, 2012

Generating sounds using the Open Sound System audio interface

Recently I started implementing a tool for learning music scales. I've done some research and found a way of utilising /dev/dsp in python. If you don't know what is that device file responsible I'll give you a hint. Make sure you have your speakers on, and type in:

~ cat /dev/urandom > /dev/dsp

If the device ain't busy or otherwise locked you should hear a hum... it ain't bad for a random input stream. But with a little help from python and some basic knowledge about digital sound processing we can get much more than that.

Let's start with some basics. To generate a random hum (just like in the example above) using python ossaudiodev module you could try the following:

1 import ossaudiodev
2 import os
3
4 dsp = ossaudiodev.open('/dev/dsp', 'w')
5 dsp.write(os.urandom(5000))
6 dsp.close()

Let's move on to some signal processing theory. In order to generate a specific tone we should pass a discrete approximation of a sine function for the analysed time ranged instead of a random array. In presented source I will assume 44.1k samples per second (why is that? check the Shanonn's Law), a frequency of 440Hz (also known as A440, it serves as a general instrument tuning standard), and tone duration of 5 seconds.

1 import ossaudiodev
2 import math
3 import wave
4
5 freq, sr, t = 440.0, 44100.0, 5.0
6 total_samples = sr*t
7 period = sr / freq
8 natural_freq = 2.0*math.pi/period
9 #evaluate x-axis / time-axis positions
10 time_axis = map(lambda x: float(x)*natural_freq, range(int(period)))
11 #evaluate singal amplitudes for the period
12 period_amp_data = map(lambda x: 16*math.sin(x), time_axis)
13 #repeat the singal, and pack as short, 16 bit values
14 output_signal = ''
15 for i in range(int(total_samples/period)):
16 for j in range(len(period_amp_data)):
17 output_signal += wave.struct.pack('h', period_amp_data[j])
18 dsp = ossaudiodev.open("/dev/dsp", "w")
19 #16 bit big endian coding, 1 channel, 44.1kHz
20 dsp.setparameters(ossaudiodev.AFMT_U16_BE, 1, sr)
21 dsp.write(output_signal)
22 dsp.close()

You can easily refactor this code making it possible to produce any tone you want (generating multi-tone sounds and effects is a bit more tricky). It may be a bit hard to get through without some background in digital signal processing, but having a working example programmers can do magic.

A good way of optimising this code is using numpy arrays instead of python lists, because they support many matrix like transformations (no mapping function would be needed).

P.S. I'm still using Mint 9 Isadora LTS version and I am happy to have a /dev/dsp device file, however Ubuntu users are not so lucky: /dev/dsp is not present in the kernel since v. 10.10... well... you can always try recompiling it :-)

~KR

Tuesday, June 26, 2012

Make websites recognise python/urllias a webbrowser - (part 2: following links)

A key feature of every bot/crawler is the ability to follow hyperlinks and post data to the web application. If you won't implement additional features for such actions your bot is likely to get exposed. A typical bot implementation would look like this:

1 import urllib2
2
3 class MyBot():
4     @classmethod
5     def _get_default_headers(cls):
6         return {'User-Agent': 'Mozilla/5.0 (X11; U; Linux i686; en-US;\
7                 rv:1.9.2.24)'}
8
9     def execute_request(self, uri, post_data=None,additional_headers={}):
10         req = urllib2.Request(uri, post_data, \
11                 self._get_default_headers().update(additional_headers))
12         return urllib2.urlopen(req)

MyBot class implements methods responsible for providing default HTTP headers: _get_default_headers and executing a request: execute_request.

The first thing that is quite suspicious, is that all requests made by the bot are performed instantly. When you request a few resources one by one there will hardly be a pause between consequent requests - it is almost impossible to achieve such statistics using a regular browser (human action delay). The delay needs to be simulated by the script:

9     def execute_request(self, uri, post_data=None,additional_headers={}):
10         import time, random
11         req = urllib2.Request(uri, post_data, \
12                 self._get_default_headers().update(additional_headers))
13         if post_data:
14             time.sleep(random.uniform(1.5,5))
15         else:
16             time.sleep(random.uniform(1,2.5))
17         return urllib2.urlopen(req)

The improved execute_request method sleeps for a period of time before performing each request. The sleep time is dependent on the request method, usually filling up web forms takes more time than clicking a hyper-reference. In the presented code the sleep time varies from 1-2.5 to 1.5-5 seconds for GET and POST requests respectively.

Another thing that we have to be aware of is that a browser sets the Referer header every time you navigate to an URL. According to the RFC-2616 section 14:

The Referer[sic] request-header field allows the client to specify, for the server's benefit, the address (URI) of the resource from which the Request-URI was obtained (...)

So all we have to do is remember the last URI, let's modify the execute_requests method the following way:

9     def execute_request(self, uri, post_data=None,additional_headers={}):
10         import time, random
11         referer = getattr(self, 'referer', None)
12         if not referer:
13             referer = "http://google.com/"
14         additional_headers.update({'Referer' : referer})
15         req = urllib2.Request(uri, post_data, \
16                 self._get_default_headers().update(additional_headers))
17         self.referer = uri
18         if post_data:
19             time.sleep(random.uniform(1.5,5))
20         else:
21             time.sleep(random.uniform(1,2.5))
22         return urllib2.urlopen(req)
Lines 11-14 set the current referer header, if no referer was set, the script will set http://google.com as a default referer. After generating the request object we store the URI as a referer for the next request.

With this two features we can make our script highly undetectable for websites. Just remember, if you're implementing a multi-threaded robot you should provide a MyBot instance for each thread, or the referer field will become a mess.

~KR

Saturday, June 23, 2012

Mac printer pausing problem lp / cups

Recently I encountered another administrative problem regarding the Macintosh platform. My invoice printing script has been running sucesfully for quite a while, but one day the problems started: the network printer started ignoring print request that had the mentioned Mac as its origin. What's more, lp (print files command) did not raise any exceptions (return status) and yet nothing was printed.

After doing some research, it occured that the printer was marked as 'paused' (although it was printing tons of other documents). The cups server logs weren't very helpful, but it became clear, that the 'paused' status was set during a temporal printer unavailability period (busy / lack of paper / offline).

A simple solution would be checking the printer status and switching it to 'ready' in case the printer was paused. So I added the following instruction to the printing script:

1  if [ ! "$(lpstat | grep Lexmark_X363dn | grep -o enabled)" ]

2  then

3     cupsenable Lexmark_X363dn || \

4     echo "Error, printer unavailable" && exit

5  fi

The condition in line 1 is fulfilled if no printer named Lexmark_X363dn is available, or a printer exists but it is not enabled. If so we try to enable the printer or show an error message if the operation was not successful.

Well after running this script the operation was indeed not successful, cupsenable required setting the SUID bit:

~ chmod +s /usr/sbin/cupsenable

This solved the printing issues for now, but I bet it's just a matter of time before something else stops working... Macs...

~KR

Wednesday, June 13, 2012

Django "is substring" like filter (MySQL)

Recently I came by a problem of selecting rows from a table on a condition that one of the fields is a substring of a given phrase. It's a bit hard to explain, but anyway I wanted to achieve the equivalence of:

filter(lambda x: "some phrase".find(x.some_field) >= 0, MyModel.objects.all()) 

on the SQL/ORM level.

So I searched the django documentation again and again and I failed to find anything useful. If the problem cannot be solved on the ORM level it must be solved with raw SQL:

SELECT * FROM myapp_mymodel WHERE "some phrase" LIKE CONCAT('%',some_field,'%');

Since the presented where clause connot be generated using the QuerySet.filter method we have to use the extra method instead. A django equivalent would look like this:

MyModel.objects.extra(where=["%s LIKE CONCAT('%%', some_field, '%%')"], \
params=["some phrase"])

We should remember that the SQL LIKE operator is case insensitive (in this case it is desirable), however if you want a case sensitive filter, try using LIKE BINARY instead.

~KR

Monday, June 4, 2012

Removing accents in unicode strings

It wouldn't be very insightful if I said that the key feature of most programs / scripts is the ability to process data the way it is expected. In many cases the data comes from users, and every programmer should know - users are very creative and could crash your script in the way you didn't think was possible. All in all programmers have to spend a lot of time on boring things like data validation/preprocessing to prevent such situations.

Some problems are caused by regional and special characters. This doesn't concern the storage problem, many databases support utf-8. Problems occur when you need to convert your data to another encoding... well it wouldn't be so bad if you are positive about the target encoding. Some systems however (especially those implemented a few decades ago, i.e. bank systems) accept encodings that you don't see everyday, such as IBM-852 or IBM-850. So if are not 100% sure of what should the target encoding be, ASCII is the safest pick. Although it supports little characters, it is usually enough to keep the data context.

A simple way of getting an ASCII string is to drop all regional/special characters, that require more then 7 bits to encode. Example python code

 1  data = u"aąbcćdef"

 2  ascii_data = "".join(map(lambda x: x if ord(x) < 127 else "", data))

After these operations ascii_data has the following value: "abcdef". This result may also be obtained by performing:

3 ascii_data = data.encode('ASCII', 'ignore')

Using this method we loose the context, especially when processing personal data - first names / last names / cities often have native characters, without them the data is incomplete.

The best way of converting the data to ASCII would be striping the accents. This may be achieved by using the Normal Form Compatibility Decomposition, also known as NFKD (an annex to the unicode standard). This decomposition splits characters containing accents into (usually) two components: the base character and the accent. For example letter ą would be split into a and ̨ (u+0328, known as the combining ogonek). This was just the first part of the solution - there are still non-ASCII characters in the string. The second step is converting the string to ASCII ignoring all special characters (the simple method presented earlier). Have a look at the presented python script:

 1  import unicodedata

2

 3  data = u'aąbcćdef'

 4  ascii_data = unicodedata.normalize("NFKD" , data).encode('ASCII', 'ignore')

After running this script, ascii_data has been assigned the following value: "aabccdef", perfect! Hope you find this solution useful.

~KR