1 import urllib2
2
3 class MyBot():
4 @classmethod
5 def _get_default_headers(cls):
6 return {'User-Agent': 'Mozilla/5.0 (X11; U; Linux i686; en-US;\
7 rv:1.9.2.24)'}
8
9 def execute_request(self, uri, post_data=None,additional_headers={}):
10 req = urllib2.Request(uri, post_data, \
11 self._get_default_headers().update(additional_headers))
12 return urllib2.urlopen(req)
MyBot class implements methods responsible for providing default HTTP headers: _get_default_headers and executing a request: execute_request.
The first thing that is quite suspicious, is that all requests made by the bot are performed instantly. When you request a few resources one by one there will hardly be a pause between consequent requests - it is almost impossible to achieve such statistics using a regular browser (human action delay). The delay needs to be simulated by the script:
10 import time, random
11 req = urllib2.Request(uri, post_data, \
12 self._get_default_headers().update(additional_headers))
13 if post_data:
14 time.sleep(random.uniform(1.5,5))
15 else:
16 time.sleep(random.uniform(1,2.5))
17 return urllib2.urlopen(req)
The improved execute_request method sleeps for a period of time before performing each request. The sleep time is dependent on the request method, usually filling up web forms takes more time than clicking a hyper-reference. In the presented code the sleep time varies from 1-2.5 to 1.5-5 seconds for GET and POST requests respectively.
Another thing that we have to be aware of is that a browser sets the Referer header every time you navigate to an URL. According to the RFC-2616 section 14:
The Referer[sic] request-header field allows the client to specify,
for the server's benefit, the address (URI) of the resource from
which the Request-URI was obtained (...)
So all we have to do is remember the last URI, let's modify the execute_requests method the following way:
10 import time, random
11 referer = getattr(self, 'referer', None)
12 if not referer:
13 referer = "http://google.com/"
14 additional_headers.update({'Referer' : referer})
15 req = urllib2.Request(uri, post_data, \
16 self._get_default_headers().update(additional_headers))
17 self.referer = uri
18 if post_data:
19 time.sleep(random.uniform(1.5,5))
20 else:
21 time.sleep(random.uniform(1,2.5))
22 return urllib2.urlopen(req)
Lines 11-14 set the current referer header, if no referer was set, the script will set http://google.com as a default referer. After generating the request object we store the URI as a referer for the next request.
With this two features we can make our script highly undetectable for websites. Just remember, if you're implementing a multi-threaded robot you should provide a MyBot instance for each thread, or the referer field will become a mess.
~KR
No comments:
Post a Comment