My thoughts on computers, programming, ancient spell scrolls and other magic devices...: transaction

Monday, December 17, 2012

Using table join with mysqldump

Everybody familiar with MySQL database knows mysqldump, the ultimate MySQL backup tool. This tool has more intereseting features, mainly it can be used to dump only specific tables or even selectively extract data from the database.

So if You want to dump a single table, you should execute:

mysqldump -u db_user -p db_name table_name > table_and_data.sql

This is a default script, quite useful for recreating tables/making backups. But as it was mentioned before, we just want the data. Fortunately mysqldump comes with many options, --no_create_info for instance will drum only the data wrapped by an INSERT INTO statement. Another nice option: --where enables providing custom data filtering rules (typical for MySQL). Let's try a query that dumps specific rows from a table that is related to another table using a WHERE statement on the corresponding data model.

mysqldump -u db_user -p db_name table_name --no_create_info \
--lock-all-tables --where 'id in (select tn.id from table_name as tn \
join related_table as rt on tn.related_table_id = rt.id \
where rt.some_field = 1)' > data.sql

The idea is to use a subselect to get table_name rows that have foreign keys to the related_table rows that meet the rt.some_field = 1 requirement. This works well, however if You access tables not specified in mysqldump's command params, You should lock them yourself. In this example I use --lock-all-tables which is an overkill (if You're using a transactional database engine, such as InnoDB, You can use --single-transaction instead - no table locking will be required).

Cheers!
~KR

Wednesday, April 25, 2012

Managing Django transacion processing (autocommit vs performance)

Django has some great ORM tools embedded that enable safe and simple methods of managing your database. There are also modules responsible for transaction processing, mainly TransactionMiddleware. By default it is present in the settings file, and I see no reason why it shouldn't - this middleware provides a very simple, yet powerful mechanism that considers your view processing as a single transaction. This pseudo-code presents the main idea:

1  try:

2     start_transaction()

3     (your view code)

4     commit() 

5  except:

6     rollback()

This is great, but since its a middleware module its not applicable to background processing. Instead the autocommit on save is used. This may be ineffective when you are processing large amounts of data (using celery, or some cron scheduled script). Each time you commit some changes, the DBMS has to perform some lazy tasks that usually require some I/O operations (recreating indexes, checking constraints etc.). This greatly increases the total processing time, yet if the process dies you still have some data... which tend to be useless without the missing data. So why not apply the above pattern to this problem? Django supports manual transaction management, all you have to do is use the commit_on_success or commit_manually decorators:

1 from django.db import transaction
2
3
4 @transaction.commit_on_success
5 def sophisticated_background_processing():
6     #your code come here :-)
7     #(...)
8     pass
9
10 @transaction.commit_manually
11 def some_other_background_processing():
12     try:
13         #your code
14         #(...)
15         transaction.commit()
16     except SomeException as se:
17         #handle exception
18         transaction.commit()
19     except:
20         #unhandled exceptions
21         transaction.rollback()

The commit_on_success acts just like TransactionMiddleware for view, in most cases it will do. If we need some more flexibility we can always try the commit_manually decorator. It enables commiting/rollbacking data whenever you want. Just make sure all execution paths end with an appropriate action or django will raise an exception.

Using manual-commit instead of auto-commit increased my accounting script performance about 5-10x, depending on the instance size (the processing is specific, and the data model is rather horizontal).