My thoughts on computers, programming, ancient spell scrolls and other magic devices...: April 2012

Thursday, April 26, 2012

Convert HTML/CSS to PDF (preserving layout) using python pisa

A few days ago I published a post regarding PDF file generation. I mentioned that I may be able to say a few words about positioning elements using pisa at-keywords. The first thing you should know, is that although CSS is suppose to be supported, most positioning properties won't work... until you apply them inside an appropriate block. Take a look at this head section:

1 <head>
2     <style type="text/css">
3        @page {
4             size: a4;
5             @frame {
6                 -pdf-frame-content: "footer";
7                 top: 25.8cm;
8                 margin-right: 2.8cm;
9                 margin-left: 2.8cm;
10             }
11             @frame {
12                -pdf-frame-content: 'date_box';
13                top: 2.4cm;
14                left: 2.8cm;
15             }
16             @frame {
17                -pdf-frame-content: "content";
18                top: 16.7cm;
19                margin-right: 2.8cm;
20                margin-left: 2.8cm;
21             }
22        }
23     </style>
24     <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
25 </head>
The @page block contains some general document/page properties, such as page format (line 4) - in this example we want our document to be 210x297 (vertical and horizontal dimensions respectively described in mm). @frame keywords define layout positioning information. For example, lines 5-10 could be represented in regular CSS like this:

1 div#footer {
2     top: 25.8cm;
3     margin-right: 2.8cm;
4     margin-left: 2.8cm;
5 }

So it is now easy to imagine what this layout looks like, it contains a date-box, regular content (in the middle of the document) and a footer.

You can place your data in the corresponding box the following way:

<div id="date_box" style="text-align:right; font-size: 20px;">

2012/04/26

</div>

<div id="content" style="text-align:left; font-size: 14px;">

This is a pisa layout test

  

</div>

<div id="footer" style="text-align:center; font-size: 10px;">

This should be a footer

  

</div>

In order to generate the PDF from the command line, simply run:

~ pisa my_html_file.html

This will create my_html_file.pdf in the same director.
You may also achieve this the pythonic way - see my previous post about PDF generation.

For more information about layout definition and supported CSS styles check the official pisa documentation.

~KR

P.S. I don't usually do front-end stuff, but I really like this tool.

Wednesday, April 25, 2012

Managing Django transacion processing (autocommit vs performance)

Django has some great ORM tools embedded that enable safe and simple methods of managing your database. There are also modules responsible for transaction processing, mainly TransactionMiddleware. By default it is present in the settings file, and I see no reason why it shouldn't - this middleware provides a very simple, yet powerful mechanism that considers your view processing as a single transaction. This pseudo-code presents the main idea:

1  try:

2     start_transaction()

3     (your view code)

4     commit() 

5  except:

6     rollback()

This is great, but since its a middleware module its not applicable to background processing. Instead the autocommit on save is used. This may be ineffective when you are processing large amounts of data (using celery, or some cron scheduled script). Each time you commit some changes, the DBMS has to perform some lazy tasks that usually require some I/O operations (recreating indexes, checking constraints etc.). This greatly increases the total processing time, yet if the process dies you still have some data... which tend to be useless without the missing data. So why not apply the above pattern to this problem? Django supports manual transaction management, all you have to do is use the commit_on_success or commit_manually decorators:

1 from django.db import transaction
2
3
4 @transaction.commit_on_success
5 def sophisticated_background_processing():
6     #your code come here :-)
7     #(...)
8     pass
9
10 @transaction.commit_manually
11 def some_other_background_processing():
12     try:
13         #your code
14         #(...)
15         transaction.commit()
16     except SomeException as se:
17         #handle exception
18         transaction.commit()
19     except:
20         #unhandled exceptions
21         transaction.rollback()

The commit_on_success acts just like TransactionMiddleware for view, in most cases it will do. If we need some more flexibility we can always try the commit_manually decorator. It enables commiting/rollbacking data whenever you want. Just make sure all execution paths end with an appropriate action or django will raise an exception.

Using manual-commit instead of auto-commit increased my accounting script performance about 5-10x, depending on the instance size (the processing is specific, and the data model is rather horizontal).

Monday, April 23, 2012

Django view serving dynamically generated PDF files.

Serving static files is cool. However, static files have a drawback - mainly they tend to be static. I'm not saying that you should avoid serving static content or anything similar - static files have variety of important applications , which are not the topic of this post.

So what can you do when you want to serve a dynamically generated PDF file to the user? First of all you have to provide a package capable of generating such files, so unless you want to spend quite a few days implementing your own tool, you should try using ReportLab. ReportLab is quite powerful, but it takes a lot of effort to create a good layout. If you also want your PDF file to look sexy - you should choose pisa (it requires ReportLab as a dependency). Pisa is a HTML/CSS to PDF converter - which is just what we need (be warned - not all CSS styles are supported, but that's another history).

Let's have a look at this code:

1 import ho.pisa
2 import cStringIO
3
4 from django.http import HttpResponse
5 from django.template.loader import get_template
6 from django.template import Context
7
8
9 def generate_pdf(template_file, context={}):
10     #to avoid using a temporary file StringIO has to be used
11     pdf = cStringIO.StringIO()
12     template = get_template(template_file)
13
14     html_response = template.render(Context(context))
15
16     pdf_status = ho.pisa.pisaDocument(cStringIO.StringIO(html_response), pdf)
17
18     if pdf_status.err:
19         #catch it in the invoking view
20         raise Exception('Oops! Something went wrong!')
21     return HttpResponse(pdf.getvalue(), mimetype='application/pdf')

This is a generic function that can render any template with an apropriate context and return a HttpResponse, this function may be used the following way:

24 def generate_report_view(request):
25     '''
26         (...) some code
27     '''
28     return generate_pdf('reports/sample_report.html',\
29         {'owner' : request.user, 'some_param' : 'Hello PDF Generator'})

There are two tricks here, that enable downloading the generated PDF directly from the view. Firstly, the cStringIO (faster version of StringIO) is used instead of a file, so that we do not have to make any HDD IO operations. Secondly we set the response mimetype to application/pdf, which informs a browser that this is not a regular site.

~KR

Thursday, April 19, 2012

Search and replace many files based on regular expressions

In the past weeks our team has put a lot of effort into scaling our system, so it can handle greater amounts of traffic. A good way to decrease the load of the primary server is to serve static files from another machine. Django supports such mechanisms, one can specify the MEDIA_URL parameter in the settings file, which acts as a prefix when it comes to loading media files, including images, css and js files... well at least it should. As it occurred the template files (HTML) contained absolute URI paths. I decided this would be a great opportunity to modify those templates. The number of files that should be checked was rather high, and each of those files contained many media file references.

Here is where bash comes with a helpful hand. Since I did not intend to spend the whole morning on copy-pasting through hundreds of entries I wrote a script that does the magic thing for me:

1 for f in $(find . | grep html$ | xargs egrep '"/media[^""]+[a-z]"' | cut -d ":" -f1 | sort | uniq)
2 do
3     echo $f
4     cat $f | sed -r 's/(src|href)=\"\/media\/([^""]*\.[a-z]+)\"/\1=\"{{MEDIA_URL}}\2\"/g' > $f.tmp
5     mv $f.tmp $f
6 done

So what does it do? The script iterates over all html files found in subdirectories that have an absolute path starting with /media surrounded by quotation marks (the pipes ensures that each file is processed at most once). Line no. 4 is responsible for replacing the absolute path with a template variable. For example it changes:

(...) src="/media/some_path/some_image.jpg" (...), to
(...) src="{{MEDIA_URL}}some_path/some_image.jpg" (...)

Using the back references (groups) is essential, without it the search/replace context would be insufficient, which could result in modifying parts that don't refer to static media content. The first group covers the attribute (href or src), while the second group covers the file name.

Cheers!

Wednesday, April 4, 2012

Preventing python script from running multiple times (unix)

Recently I was implementing many command line scripts that were scheduled in crontab. What a programmer has to know is that cron is a ruthless daemon and it doesn't care about your tasks... they are executed periodically according to a schedule - with no exceptions... even if the previous task is still running. In most cases this shouldn't be a problem, yet problems may occur when the following conditions are met:

the average execution time of script s is greater then the defined crontab execution period.
multiple instances of script s can't be executed simultaneously

Critical sections are cool for solving similar problems, but processes spawned by cron don't share memory. A frequently used mechanism for solving such cases is file locking. Take a look at this python decorator (comments provided):

1 import os
2
3 class SingleRun():
4     class InstanceRunningException(Exception):
5         pass
6     def __init__(self, lock_file):
7         #define the lock file name
8         self.lock_file = "/tmp/%s.pid" % lock_file
9     def __call__(self, func):
10         def f(*args, **kwargs):
11             if os.path.exists(self.lock_file):
12                 #get process id, if lock file exists
13                 pid = open(self.lock_file, "rt").read()
14                 if not os.path.exists("/proc/%s" % pid):
15                     #if process is not alive remove the lock file
16                     os.unlink(self.lock_file)
17                 else:
18                     #process is running
19                     raise self.InstanceRunningException(pid)
20             try:
21                 #store process id
22                 open(self.lock_file, "wt").write(str(os.getpid()))
23                 #execute wrapped function
24                 func(*args,**kwargs)
25             finally:
26                 if os.path.exists(self.lock_file):
27                     os.unlink(self.lock_file)
28         return f

So for example if we define the following functions:

1 @SingleRun(lock_file="x")
2 def a():
3     pass
4
5 @SingleRun(lock_file="x")
6 def b():
7     pass
8
9 @SingleRun(lock_file="y")
10 def c():
11     pass

Only one function among a() and b(), may be executed at a time, since they share the same lock file. Function c(), may be executed along with a() or b(), but only if no other c() function is executed (by other processes).

The presented solution is OS dependant (Unix).