Let's write a simple program that requests some resource on localhost via HTTP protocol:
1 import urllib2
2
3 req = urllib2.Request("http://127.0.0.1:3333/")
4 urllib2.urlopen(req)
Now make netcat listen on port 3333:
~ netcat -l 3333
... and execute your script. Netcat captures the following data (it may differ from your output, but will be simillar in general):
... and execute your script. Netcat captures the following data (it may differ from your output, but will be simillar in general):
GET / HTTP/1.1
Accept-Encoding: identity
Host: 127.0.0.1:3333
Connection: close
User-Agent: Python-urllib/2.7
Let's reenact this experiment, but instead of using the presented python script, open a web browser and navigate to http://127.0.0.1:3333/ .
This time the output generated by netcat looks lilke this:
GET / HTTP/1.1
Host: 127.0.0.1:3333
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.2.24) Gecko/20111107 Linux Mint/9 (Isadora) Firefox/3.6.24
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: pl,en;q=0.7,en-us;q=0.3
Accept-Encoding: gzip,deflate
Accept-Charset: ISO-8859-2,utf-8;q=0.7,*;q=0.7
Keep-Alive: 115
Connection: keep-alive
The output may differ depending on your OS and browser version. Some headers may be manually configured in your browser aswell (accepted charsets, language etc.). The header that is usually used to identify the requesting client (browser, bot, crawler) is User-Agent. A way to cheat web applications into thinking that your script is to provide a User-Agent header extracted from a real web browser. You can achieve this as follows:
1 import urllib2
2
3 headers = {
4 "User-Agent" : "Mozilla/5.0 (X11; U; Linux i686; " + \
5 "en-US; rv:1.9.2.24) Gecko/20111107 " + \
6 "Linux Mint/9 (Isadora) Firefox/3.6.24",
7 }
8
9 req = urllib2.Request("http://127.0.0.1:3333/", headers=headers)
10 urllib2.urlopen(req)
This is just one trick, stay tuned - there will be more.
~KR
Nice! thanks man, you helped with my script :)
ReplyDelete