My thoughts on computers, programming, ancient spell scrolls and other magic devices...: Make websites recognise python/urllib as a webbrowser

Today I want to share some of my expirience regarding web crawlers and web robots / bots. Having access to high level http client libraries (like urllib/urrllib2) and basic HTTP protocol knowledge it is easy to implement such a program in a relatively short time. You don't have to respect robots.txt (in most cases, especially regarding web crawling you should!), although a web application may still detect that your script is not a web browser. What if you actually want to pretend that your script is a web browser?

Let's write a simple program that requests some resource on localhost via HTTP protocol:

1 import urllib2
2
3 req = urllib2.Request("http://127.0.0.1:3333/")
4 urllib2.urlopen(req)

Now make netcat listen on port 3333:

~ netcat -l 3333

... and execute your script. Netcat captures the following data (it may differ from your output, but will be simillar in general):

GET / HTTP/1.1
Accept-Encoding: identity
Host: 127.0.0.1:3333
Connection: close
User-Agent: Python-urllib/2.7

Let's reenact this experiment, but instead of using the presented python script, open a web browser and navigate to http://127.0.0.1:3333/ .

This time the output generated by netcat looks lilke this:

GET / HTTP/1.1
Host: 127.0.0.1:3333
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.2.24) Gecko/20111107 Linux Mint/9 (Isadora) Firefox/3.6.24
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: pl,en;q=0.7,en-us;q=0.3
Accept-Encoding: gzip,deflate
Accept-Charset: ISO-8859-2,utf-8;q=0.7,*;q=0.7
Keep-Alive: 115
Connection: keep-alive

The output may differ depending on your OS and browser version. Some headers may be manually configured in your browser aswell (accepted charsets, language etc.). The header that is usually used to identify the requesting client (browser, bot, crawler) is User-Agent. A way to cheat web applications into thinking that your script is to provide a User-Agent header extracted from a real web browser. You can achieve this as follows:

1 import urllib2
2
3 headers = {
4 "User-Agent" : "Mozilla/5.0 (X11; U; Linux i686; " + \
5 "en-US; rv:1.9.2.24) Gecko/20111107 " + \
6 "Linux Mint/9 (Isadora) Firefox/3.6.24",
7 }
8
9 req = urllib2.Request("http://127.0.0.1:3333/", headers=headers)
10 urllib2.urlopen(req)

You may also include other headers - the more the better (so the request looks just like the one from FireFox). However, keep in mind that some headers may have an impact on the recieved content, for example: if you specify the Accept-Encoding header as follows: Accept-Encoding: gzip,deflate you shouldn't be suprised if the server servses you gzip compressed data. More information about http headers can be found in RFC-2616.

This is just one trick, stay tuned - there will be more.

~KR

My thoughts on computers, programming, ancient spell scrolls and other magic devices...

Monday, May 28, 2012

Make websites recognise python/urllib as a webbrowser - (part 1: http headers)

1 comment: