My thoughts on computers, programming, ancient spell scrolls and other magic devices...: Make websites recognise python/urllias a webbrowser

A key feature of every bot/crawler is the ability to follow hyperlinks and post data to the web application. If you won't implement additional features for such actions your bot is likely to get exposed. A typical bot implementation would look like this:

1 import urllib2
2
3 class MyBot():
4     @classmethod
5     def _get_default_headers(cls):
6         return {'User-Agent': 'Mozilla/5.0 (X11; U; Linux i686; en-US;\
7                 rv:1.9.2.24)'}
8
9     def execute_request(self, uri, post_data=None,additional_headers={}):
10         req = urllib2.Request(uri, post_data, \
11                 self._get_default_headers().update(additional_headers))
12         return urllib2.urlopen(req)

MyBot class implements methods responsible for providing default HTTP headers: _get_default_headers and executing a request: execute_request.

The first thing that is quite suspicious, is that all requests made by the bot are performed instantly. When you request a few resources one by one there will hardly be a pause between consequent requests - it is almost impossible to achieve such statistics using a regular browser (human action delay). The delay needs to be simulated by the script:

9     def execute_request(self, uri, post_data=None,additional_headers={}):
10         import time, random
11         req = urllib2.Request(uri, post_data, \
12                 self._get_default_headers().update(additional_headers))
13         if post_data:
14             time.sleep(random.uniform(1.5,5))
15         else:
16             time.sleep(random.uniform(1,2.5))
17         return urllib2.urlopen(req)

The improved execute_request method sleeps for a period of time before performing each request. The sleep time is dependent on the request method, usually filling up web forms takes more time than clicking a hyper-reference. In the presented code the sleep time varies from 1-2.5 to 1.5-5 seconds for GET and POST requests respectively.

Another thing that we have to be aware of is that a browser sets the Referer header every time you navigate to an URL. According to the RFC-2616 section 14:

The Referer[sic] request-header field allows the client to specify, for the server's benefit, the address (URI) of the resource from which the Request-URI was obtained (...)

So all we have to do is remember the last URI, let's modify the execute_requests method the following way:

9     def execute_request(self, uri, post_data=None,additional_headers={}):
10         import time, random
11         referer = getattr(self, 'referer', None)
12         if not referer:
13             referer = "http://google.com/"
14         additional_headers.update({'Referer' : referer})
15         req = urllib2.Request(uri, post_data, \
16                 self._get_default_headers().update(additional_headers))
17         self.referer = uri
18         if post_data:
19             time.sleep(random.uniform(1.5,5))
20         else:
21             time.sleep(random.uniform(1,2.5))
22         return urllib2.urlopen(req)
Lines 11-14 set the current referer header, if no referer was set, the script will set http://google.com as a default referer. After generating the request object we store the URI as a referer for the next request.

With this two features we can make our script highly undetectable for websites. Just remember, if you're implementing a multi-threaded robot you should provide a MyBot instance for each thread, or the referer field will become a mess.

~KR

My thoughts on computers, programming, ancient spell scrolls and other magic devices...

Tuesday, June 26, 2012

Make websites recognise python/urllias a webbrowser - (part 2: following links)

No comments:

Post a Comment