My thoughts on computers, programming, ancient spell scrolls and other magic devices...: November 2012

Thursday, November 29, 2012

NumPy arrays vs regular arrays - common operations

NumPy is a set of scientific computing tools for python. I'm not going to talk about them, I mentioned NumPy, because it comes with an optimized array type numpy.array. Unlike casual python arrays, numpy.arrays work well with matrix/vector operators.

Arrays and numpy.arrays do not share a common interface, but they both support iteration and support the __setitem__ and __getitem__ methods. The first thing that differs is the initialization (obviously):

>>> import numpy as np
>>> a = np.array([1,2,3,4,5])
>>> b = [1,2,3,4,5]
>>> a
array([1, 2, 3, 4, 5])
>>> b
[1, 2, 3, 4, 5]

It's almost the same, you can iterate over both tables, set and get array elements. However when you want to add / remove elements you have to act differently - by using NumPy provided methods:

 >>> np.delete(a, 2)
array([1, 2, 4, 5])
>>> del b[2]
>>> b
[1, 2, 4, 5]

The prime difference here, is that casual array element deletion modifies the object, while np.delete creates a deep copy and than applies the effect. Numpy.append works in a similar maner, but instead it concatenates arrays (or adds elements).

Now lets look at the filter function, these to instructions are equivalent:

>>> filter(lambda x: x>2, b)
[3, 4, 5]
>>> a[a>2]
array([3, 4, 5])

And for the map function:

>>> np_map = np.vectorize(lambda x: x**2)
>>> np_map(a)
array([ 1,  4,  9, 16, 25])
>>> map(lambda x: x**2, b)
[1, 4, 9, 16, 25]

Let's move on, and see where NumPy arrays have some more potential: vector operations. Let's suppose we want to make a sum of 2 vectors. Using regular python arrays you have to combine elements with the zip / sum functions, or just iterate and create sums on the fly. Example solution:

>>> b = [1,2,3,4,5]
>>> d = [3,3,3,4,4]
>>> map(lambda x: sum(x), zip(b,d))
[4, 5, 6, 8, 9]

Now for the NumPy solution:

>>> a = np.array([1,2,3,4,5])
>>> b = np.array([3,3,3,4,4])
>>> a + b
array([4, 5, 6, 8, 9])

And this works for all python binary operators, this is just awesome. When doing 'scientific' stuff, it's good to use appropriate data types. What's more, NumPy is very popular among other projects, so seeing numpy.array's from time to time doesn't surprise nobody. I guess matrices and more complex operations will be covered some other day.

Cheers!
KR

Tuesday, November 27, 2012

Exporting MySQL queries into a CSV file.

Every now and then I need to do some non-standard data processing things, which requires grabbing a bit of data from a database. Usually this "non-standard data processing things" are simple scripts, and enhancing them to get data directly from a database would kill their main advantages: simplicity and a fast setup time.

CSV files are way better for this kind of processing. Since the data is located in a database, not a CSV file, we'll have to export it. Now there are two basic ways:

a) The MySQL geek way.
b) The bash/sh geek way.

So let's start with the pure MySQL way. Usually when queries are made using the standard mysql console client, the retrieved data is formated into something that should reassemble a table... well maybe it does, but "broad" result tables aren't readable anyway.

select * from some_table into outfile "/tmp/output.csv" \
fields terminated by ";" lines terminated by "\n";

All we have to do is change the field and line terminator to match our needs. About the file, you should pass a path that is writable for the mysql user. If the file exists, or the MySQL server has no write permissions, this query will fail.

Let's move on to the bash geek way - personally I don't fancy MySQL trciks, therefore I usually rely on bash. So here we go:

echo "select * from some_table;" | \
mysql -u my_usr -p my_db | sed 's/\t/;/g' | \
tail -n +2 > /home/chriss/output.csv

We execute a query, passing it via an input stream to the MySQL client. The client returns a tab separated list of fields (one per line), sed replaces those with semicolons. Next we dispose of the first line (using tail), since it contains column names. Finally we save the output as a file. Unlike in the MySQL way, you can write the file according to your write permissions, not mysql's users.

Both approaches have some advantages and disadvantages, so their usage should be dependent on the context.

Friday, November 23, 2012

Django admin: handling relations with large tables

One of the reasons, that Django became so popular, is that it provides a set of flexible administrative tools. You may setup your data management dashboard in relatively short time using some predefined classess.

A typical object admin view consists of a ModelForm of an appropriate model, and some basic actions (save, add, delete). As you probably know, foreign keys are often represented as HTML select inputs. This works cool, when the related table is not vertically long, for example: electronic equipment brands. In such cases the combobox works well.

Imagine a situation when, the foreign key points to a user record, a purchase order, or any other relation that does not have a limited number of rows. According to the previous example, the combo box should have thousands of options. This is a real performance killer, the ORM has to fetch the whole user/purchase order table in order to display a single object in an editable admin view. Don't be surprised if your query gets killed with the "2006 - MySQL server has gone away" status.

There is a simple way to solve this: instead of selecting the whole table for presenting available options, we may present only the currently related object (its primary key). To achieve this, we must mark the foreign keys as raw_id_fields. Below sample usage:

class PurchaseReportAdmin(admin.ModelAdmin):                                                                   
    #(...) other fields                                                                                 
    raw_id_fields = ['purchase_order']
admin.site.register(PurchaseReport, PurchaseReportAdmin )

Yes, it's as simple as that. Keep in mind, that in many cases using raw id fields won't be necessary, but when it come to vertically huge tables - it's a good time/performance saving solution.

Cheers!
KR

Wednesday, November 14, 2012

Selenium Firefox WebDriver and proxies.

Selenium is an functional web application testing system. It really helps developing test cases for apps that have a lot of logic implemented in the frontend layer. Selenium provides WebDrivers for most popular browsers, like IE, Chrome and last buy not least: Firefox. Of course Selenium has python bindings, which makes it even better.

Suppose you need to use proxy servers for some of your scenarios (unique visitors). Since each test case may use many proxies, changing the global system proxy settings is a rather bad idea. Fortunately HTTP proxies may be configured in a web browser. If you inspect the Firefox about:config panel, you will notice many properties that configure how your browser really acts. We'll be interested in the network.proxy.* settings.

from selenium import webdriver
from selenium.common.exceptions import TimeoutException

try:
    proxy = "192.168.1.3"
    port = 8080

    fp = webdriver.FirefoxProfile()
    fp.set_preference('network.proxy.ssl_port', int(port))
    fp.set_preference('network.proxy.ssl', proxy)
    fp.set_preference('network.proxy.http_port', int(port))
    fp.set_preference('network.proxy.http', proxy)
    fp.set_preference('network.proxy.type', 1)

    browser = webdriver.Firefox(firefox_profile=fp)
    browser.set_page_load_timeout(15)
    browser.get('http://192.168.1.1/test')
    print browser.find_element_by_id('my_div').text

except TimeoutException as te: 
    print "timeout"
except Exception as ex: 
    print ex.message
finally:
    browser.quit()

First we need to configure our profile by setting the HTTP, and HTTPS proxies. Secondly we have to tell Firefox to start using them: network.proxy.type -> 1. Finally we create an instance of Firefox with our proxy settings and execute out test routine.

Have fun!

~KR

Thursday, November 8, 2012

Python 2.7 forward compatibility with python 3.X

As many of You know, python is such a strange language, that releases are not backward compatible. Ok, it's really not that bad, the problem only occurs when migrating from python 2.X to 3.X. Features introduced in python 3 greatly increase the interpreter performance (such as decimal number processing, function annotations) and provide other useful mechanisms (ex. exception chaining). Some of this things were technically not possible to achieve in python 2.X, so thats where the conflicts come from.

So if python 3 is that awesome, why do people still use python 2.7 (or older versions)? Most existing python frameworks and libraries do not support python 3, they are being ported but it will take some time for the whole python community to move on to the currently proposed version.

So if you plan migrating to python 3 in the future, you best use the __future__ module. This module provides some basic forward compatibility for older versions of python. Two major differences that may crash your code are the print statement and literals encoding. You may observe __future__ below, first the print function:

>>> print "test"
test
>>> from __future__ import print_function
>>> print "test"
File "<stdin>", line 1
print "test"
^
SyntaxError: invalid syntax
>>> print("test")
test
>>>

After importing print_function you are obligated to use print as a function. And now for unicode literals:

>>> isinstance("test", str)
True
>>> isinstance("test", unicode)
False
>>> from __future__ import unicode_literals
>>> isinstance("test", str)
False
>>> isinstance("test", unicode)
True
>>> "test"
u'test'
>>>

All string are now by default encoded as unicode... personally I'm happy with it, since I really hated the "Ascii codec can't decode..." errors.