Submit Blog  RSS Feeds

Tuesday, June 26, 2012

Make websites recognise python/urllias a webbrowser - (part 2: following links)

A key feature of every bot/crawler is the ability to follow hyperlinks and post data to the web application. If you won't implement additional features for such actions your bot is likely to get exposed. A typical bot implementation would look like this:

  1 import urllib2
  2
  3 class MyBot():
  4     @classmethod
  5     def _get_default_headers(cls):
  6         return {'User-Agent': 'Mozilla/5.0 (X11; U; Linux i686; en-US;\
  7                 rv:1.9.2.24)'}
  8     
  9     def execute_request(self, uri, post_data=None,additional_headers={}):
 10         req = urllib2.Request(uri, post_data, \
 11                 self._get_default_headers().update(additional_headers))
 12         return urllib2.urlopen(req)


MyBot class implements methods responsible for providing default HTTP headers: _get_default_headers and executing a request: execute_request.

The first thing that is quite suspicious, is that all requests made by the bot are performed instantly. When you request a few resources one by one there will hardly be a pause between consequent requests - it is almost impossible to achieve such statistics using a regular browser (human action delay). The delay needs to be simulated by the script:

 9     def execute_request(self, uri, post_data=None,additional_headers={}):
 10         import time, random
 11         req = urllib2.Request(uri, post_data, \
 12                 self._get_default_headers().update(additional_headers))
 13         if post_data:
 14             time.sleep(random.uniform(1.5,5))
 15         else:
 16             time.sleep(random.uniform(1,2.5))
 17         return urllib2.urlopen(req)


The improved execute_request method sleeps for a period of time before performing each request. The sleep time is dependent on the request method, usually filling up web forms takes more time than clicking a hyper-reference. In the presented code the sleep time varies from 1-2.5 to 1.5-5 seconds for GET and POST requests respectively.

Another thing that we have to be aware of is that a browser sets the Referer header every time you navigate to an URL. According to the RFC-2616 section 14:

The Referer[sic] request-header field allows the client to specify, for the server's benefit, the address (URI) of the resource from which the Request-URI was obtained (...)

So all we have to do is remember the last URI, let's modify the execute_requests method the following way:

  9     def execute_request(self, uri, post_data=None,additional_headers={}):
 10         import time, random
 11         referer = getattr(self, 'referer', None)
 12         if not referer:
 13             referer = "http://google.com/"
 14         additional_headers.update({'Referer' : referer})
 15         req = urllib2.Request(uri, post_data, \
 16                 self._get_default_headers().update(additional_headers))
 17         self.referer = uri
 18         if post_data:
 19             time.sleep(random.uniform(1.5,5))
 20         else:
 21             time.sleep(random.uniform(1,2.5))
 22         return urllib2.urlopen(req)

Lines 11-14 set the current referer header, if no referer was set, the script will set http://google.com as a default referer. After generating the request object we store the URI as a referer for the next request.

With this two features we can make our script highly undetectable for websites. Just remember, if you're implementing a multi-threaded robot you should provide a MyBot instance for each thread, or the referer field will become a mess.

~KR

No comments:

Post a Comment

free counters