My thoughts on computers, programming, ancient spell scrolls and other magic devices...: bots

Tuesday, June 26, 2012

Make websites recognise python/urllias a webbrowser - (part 2: following links)

A key feature of every bot/crawler is the ability to follow hyperlinks and post data to the web application. If you won't implement additional features for such actions your bot is likely to get exposed. A typical bot implementation would look like this:

1 import urllib2
2
3 class MyBot():
4     @classmethod
5     def _get_default_headers(cls):
6         return {'User-Agent': 'Mozilla/5.0 (X11; U; Linux i686; en-US;\
7                 rv:1.9.2.24)'}
8
9     def execute_request(self, uri, post_data=None,additional_headers={}):
10         req = urllib2.Request(uri, post_data, \
11                 self._get_default_headers().update(additional_headers))
12         return urllib2.urlopen(req)

MyBot class implements methods responsible for providing default HTTP headers: _get_default_headers and executing a request: execute_request.

The first thing that is quite suspicious, is that all requests made by the bot are performed instantly. When you request a few resources one by one there will hardly be a pause between consequent requests - it is almost impossible to achieve such statistics using a regular browser (human action delay). The delay needs to be simulated by the script:

9     def execute_request(self, uri, post_data=None,additional_headers={}):
10         import time, random
11         req = urllib2.Request(uri, post_data, \
12                 self._get_default_headers().update(additional_headers))
13         if post_data:
14             time.sleep(random.uniform(1.5,5))
15         else:
16             time.sleep(random.uniform(1,2.5))
17         return urllib2.urlopen(req)

The improved execute_request method sleeps for a period of time before performing each request. The sleep time is dependent on the request method, usually filling up web forms takes more time than clicking a hyper-reference. In the presented code the sleep time varies from 1-2.5 to 1.5-5 seconds for GET and POST requests respectively.

Another thing that we have to be aware of is that a browser sets the Referer header every time you navigate to an URL. According to the RFC-2616 section 14:

The Referer[sic] request-header field allows the client to specify, for the server's benefit, the address (URI) of the resource from which the Request-URI was obtained (...)

So all we have to do is remember the last URI, let's modify the execute_requests method the following way:

9     def execute_request(self, uri, post_data=None,additional_headers={}):
10         import time, random
11         referer = getattr(self, 'referer', None)
12         if not referer:
13             referer = "http://google.com/"
14         additional_headers.update({'Referer' : referer})
15         req = urllib2.Request(uri, post_data, \
16                 self._get_default_headers().update(additional_headers))
17         self.referer = uri
18         if post_data:
19             time.sleep(random.uniform(1.5,5))
20         else:
21             time.sleep(random.uniform(1,2.5))
22         return urllib2.urlopen(req)
Lines 11-14 set the current referer header, if no referer was set, the script will set http://google.com as a default referer. After generating the request object we store the URI as a referer for the next request.

With this two features we can make our script highly undetectable for websites. Just remember, if you're implementing a multi-threaded robot you should provide a MyBot instance for each thread, or the referer field will become a mess.

~KR

Monday, May 28, 2012

Make websites recognise python/urllib as a webbrowser - (part 1: http headers)

Today I want to share some of my expirience regarding web crawlers and web robots / bots. Having access to high level http client libraries (like urllib/urrllib2) and basic HTTP protocol knowledge it is easy to implement such a program in a relatively short time. You don't have to respect robots.txt (in most cases, especially regarding web crawling you should!), although a web application may still detect that your script is not a web browser. What if you actually want to pretend that your script is a web browser?

Let's write a simple program that requests some resource on localhost via HTTP protocol:

1 import urllib2
2
3 req = urllib2.Request("http://127.0.0.1:3333/")
4 urllib2.urlopen(req)

Now make netcat listen on port 3333:

~ netcat -l 3333

... and execute your script. Netcat captures the following data (it may differ from your output, but will be simillar in general):

GET / HTTP/1.1
Accept-Encoding: identity
Host: 127.0.0.1:3333
Connection: close
User-Agent: Python-urllib/2.7

Let's reenact this experiment, but instead of using the presented python script, open a web browser and navigate to http://127.0.0.1:3333/ .

This time the output generated by netcat looks lilke this:

GET / HTTP/1.1
Host: 127.0.0.1:3333
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.2.24) Gecko/20111107 Linux Mint/9 (Isadora) Firefox/3.6.24
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: pl,en;q=0.7,en-us;q=0.3
Accept-Encoding: gzip,deflate
Accept-Charset: ISO-8859-2,utf-8;q=0.7,*;q=0.7
Keep-Alive: 115
Connection: keep-alive

The output may differ depending on your OS and browser version. Some headers may be manually configured in your browser aswell (accepted charsets, language etc.). The header that is usually used to identify the requesting client (browser, bot, crawler) is User-Agent. A way to cheat web applications into thinking that your script is to provide a User-Agent header extracted from a real web browser. You can achieve this as follows:

1 import urllib2
2
3 headers = {
4 "User-Agent" : "Mozilla/5.0 (X11; U; Linux i686; " + \
5 "en-US; rv:1.9.2.24) Gecko/20111107 " + \
6 "Linux Mint/9 (Isadora) Firefox/3.6.24",
7 }
8
9 req = urllib2.Request("http://127.0.0.1:3333/", headers=headers)
10 urllib2.urlopen(req)

You may also include other headers - the more the better (so the request looks just like the one from FireFox). However, keep in mind that some headers may have an impact on the recieved content, for example: if you specify the Accept-Encoding header as follows: Accept-Encoding: gzip,deflate you shouldn't be suprised if the server servses you gzip compressed data. More information about http headers can be found in RFC-2616.

This is just one trick, stay tuned - there will be more.

~KR