My thoughts on computers, programming, ancient spell scrolls and other magic devices...: http

Showing posts with label http. Show all posts

Saturday, October 13, 2012

Serve text file via HTTP protocol using netcat

Unix based system provide a lot of cool network tools, today I'd like to show the potential of netcat. Netcat is a utility that may be used for "just about anything under the sun involving TCP and UDP" (netcat BSD manual) . Thats a pretty description, but let's go on to some practical stuff.

If we want to make a useful script, we should make it work with a popular protocol, such as HTTP. This way we don't have to worry about potential client application, we may use an ordinary web browser to test our script. More formal information about the HTTP protocol can be found in RFC2616.

So a bash script that generates a simple HTTP response and serves a text file may look like this:

if [ -z "$1" ]
then
    echo "Usage: $0 <text file to serve>"
    exit 1
fi

filename=$1

echo "
HTTP/1.1 200 OK
Date: $(LANG=en_US date -u) 
Server: NetCatFileServe
Last-Modified: $(LANG=en_US date -u)
Accept-Ranges: bytes
Content-Length: $(cat $filename | wc -c  | cut -d " " -f 1) 
Content-Type: application/force-download
Content-Disposition: attachment; filename=\"$filename\"

$(cat $filename)
"

If a HTTP client receives such a response, it should try to save the attached data as a file (a typical save-file-as window in a web browser).

Since we have a HTTP response generate, we need to create a server that will serve it - that's where netcat comes in handy. To serve some data we need to run it in the listening mode. For example :

~ $ ./ncserve.sh test.txt | nc -l 8888

If you now enter http://127.0.0.1:8888 (or your other IP) in a webbrowser, you should be able to download text.txt file. You may also test it using curl:

~ $ curl -X GET 127.0.0.1:8888
HTTP/1.1 200 OK
Date: Sat Oct 13 10:40:27 UTC 2012
Server: NetCatFileServe
Last-Modified: Sat Oct 13 10:40:27 UTC 2012
Accept-Ranges: bytes
Content-Length: 71
Content-Type: application/force-download
Content-Disposition: attachment; filename="test.txt"

This is a simple text file
bla bla bla
downloaded via netcatserver
:-)

This script only serve a file once and dies, if you want it to act like a regular HTTP server you should run it in a infinite loop.

Cheers!

~KR

Tuesday, July 24, 2012

Cheating online games /polls / contests by using anonumous HTTP proxies / python

This post is indeed about cheating. You know those browser-based game profile refs that provide you with some benefits each time a persons clicks it? Thats right, everybody spammed them here and there, some people had many visitors, some had not. I wanted to gain some extra funds too, but the thing I hate more than not-having-extra-funds is spamming... I just felt bad about posting that ref anywhere it was possible, like:

Check out these hot chicks <a href="#stupid_game_ref#">photos</a>

It's quite obvious, that clicking the link again and again yourself didn't have much effect. Only requests with unique IP numbers (daily) where generating profit. So the question was how to access an URI from many IP from one PC (my PC that is). The answer is simple: by using anonymous HTTP proxies.

There are many sites that aggregate lists of free proxies, like proxy-list.org. It's best to find a site that enables fetching proxy IP/PORT data in a script processable way (many sites have captchas).

Let's get to the fun part, the following script executes specific actions for each proxy in the provided list.

1 import socket
2 import urllib
3 import time
4
5 HEADERS = {"User-Agent" : "Mozilla/5.0" }
6
7 proxies = ["10.10.1.1:8080", "192.168.1.1:80"]
8
9 #timeout in case proxy does not respond
10 socket.setdefaulttimeout(5)
11
12 for proxy in proxies:
13     opener = urllib.FancyURLopener({"http" : proxy})
14     opener.addheaders = HEADERS.items()
15     try:
16         res = opener.open("http://some.uri/?ref=123456")
17         res.read()
18         time.sleep(3)
19     except:
20         print "Proxy %s is probably dead :-(" % proxy
21
Line 5 contains a basic User-Agent header, you can find more about setting the appropriate headers here. Line 10 sets the default socket timeout on 5 seconds - many proxies tend not to work 24/7, so best catch those exceptions. Finally we create an opener from each IP and request some resources (our ref link), you might replace this simple request with a set of actions, or even make bots that act via proxies. Just make sure you're proxy is truly anonymous (easy to verify by a simple PHP script) .

This may be called cheating, but at least it's not spamming :-)

~KR

Tuesday, June 26, 2012

Make websites recognise python/urllias a webbrowser - (part 2: following links)

A key feature of every bot/crawler is the ability to follow hyperlinks and post data to the web application. If you won't implement additional features for such actions your bot is likely to get exposed. A typical bot implementation would look like this:

1 import urllib2
2
3 class MyBot():
4     @classmethod
5     def _get_default_headers(cls):
6         return {'User-Agent': 'Mozilla/5.0 (X11; U; Linux i686; en-US;\
7                 rv:1.9.2.24)'}
8
9     def execute_request(self, uri, post_data=None,additional_headers={}):
10         req = urllib2.Request(uri, post_data, \
11                 self._get_default_headers().update(additional_headers))
12         return urllib2.urlopen(req)

MyBot class implements methods responsible for providing default HTTP headers: _get_default_headers and executing a request: execute_request.

The first thing that is quite suspicious, is that all requests made by the bot are performed instantly. When you request a few resources one by one there will hardly be a pause between consequent requests - it is almost impossible to achieve such statistics using a regular browser (human action delay). The delay needs to be simulated by the script:

9     def execute_request(self, uri, post_data=None,additional_headers={}):
10         import time, random
11         req = urllib2.Request(uri, post_data, \
12                 self._get_default_headers().update(additional_headers))
13         if post_data:
14             time.sleep(random.uniform(1.5,5))
15         else:
16             time.sleep(random.uniform(1,2.5))
17         return urllib2.urlopen(req)

The improved execute_request method sleeps for a period of time before performing each request. The sleep time is dependent on the request method, usually filling up web forms takes more time than clicking a hyper-reference. In the presented code the sleep time varies from 1-2.5 to 1.5-5 seconds for GET and POST requests respectively.

Another thing that we have to be aware of is that a browser sets the Referer header every time you navigate to an URL. According to the RFC-2616 section 14:

The Referer[sic] request-header field allows the client to specify, for the server's benefit, the address (URI) of the resource from which the Request-URI was obtained (...)

So all we have to do is remember the last URI, let's modify the execute_requests method the following way:

9     def execute_request(self, uri, post_data=None,additional_headers={}):
10         import time, random
11         referer = getattr(self, 'referer', None)
12         if not referer:
13             referer = "http://google.com/"
14         additional_headers.update({'Referer' : referer})
15         req = urllib2.Request(uri, post_data, \
16                 self._get_default_headers().update(additional_headers))
17         self.referer = uri
18         if post_data:
19             time.sleep(random.uniform(1.5,5))
20         else:
21             time.sleep(random.uniform(1,2.5))
22         return urllib2.urlopen(req)
Lines 11-14 set the current referer header, if no referer was set, the script will set http://google.com as a default referer. After generating the request object we store the URI as a referer for the next request.

With this two features we can make our script highly undetectable for websites. Just remember, if you're implementing a multi-threaded robot you should provide a MyBot instance for each thread, or the referer field will become a mess.

~KR

Monday, May 28, 2012

Make websites recognise python/urllib as a webbrowser - (part 1: http headers)

Today I want to share some of my expirience regarding web crawlers and web robots / bots. Having access to high level http client libraries (like urllib/urrllib2) and basic HTTP protocol knowledge it is easy to implement such a program in a relatively short time. You don't have to respect robots.txt (in most cases, especially regarding web crawling you should!), although a web application may still detect that your script is not a web browser. What if you actually want to pretend that your script is a web browser?

Let's write a simple program that requests some resource on localhost via HTTP protocol:

1 import urllib2
2
3 req = urllib2.Request("http://127.0.0.1:3333/")
4 urllib2.urlopen(req)

Now make netcat listen on port 3333:

~ netcat -l 3333

... and execute your script. Netcat captures the following data (it may differ from your output, but will be simillar in general):

GET / HTTP/1.1
Accept-Encoding: identity
Host: 127.0.0.1:3333
Connection: close
User-Agent: Python-urllib/2.7

Let's reenact this experiment, but instead of using the presented python script, open a web browser and navigate to http://127.0.0.1:3333/ .

This time the output generated by netcat looks lilke this:

GET / HTTP/1.1
Host: 127.0.0.1:3333
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.2.24) Gecko/20111107 Linux Mint/9 (Isadora) Firefox/3.6.24
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: pl,en;q=0.7,en-us;q=0.3
Accept-Encoding: gzip,deflate
Accept-Charset: ISO-8859-2,utf-8;q=0.7,*;q=0.7
Keep-Alive: 115
Connection: keep-alive

The output may differ depending on your OS and browser version. Some headers may be manually configured in your browser aswell (accepted charsets, language etc.). The header that is usually used to identify the requesting client (browser, bot, crawler) is User-Agent. A way to cheat web applications into thinking that your script is to provide a User-Agent header extracted from a real web browser. You can achieve this as follows:

1 import urllib2
2
3 headers = {
4 "User-Agent" : "Mozilla/5.0 (X11; U; Linux i686; " + \
5 "en-US; rv:1.9.2.24) Gecko/20111107 " + \
6 "Linux Mint/9 (Isadora) Firefox/3.6.24",
7 }
8
9 req = urllib2.Request("http://127.0.0.1:3333/", headers=headers)
10 urllib2.urlopen(req)

You may also include other headers - the more the better (so the request looks just like the one from FireFox). However, keep in mind that some headers may have an impact on the recieved content, for example: if you specify the Accept-Encoding header as follows: Accept-Encoding: gzip,deflate you shouldn't be suprised if the server servses you gzip compressed data. More information about http headers can be found in RFC-2616.

This is just one trick, stay tuned - there will be more.

~KR

Tuesday, May 22, 2012

LastPy - a simple last.fm scrobbler

Anybody that loves music must have heard about a portal called last.fm. There are many social-music portals, but last.fm is a bit different - mainly the music you listen to is not entered via a form 'my favorite genres', you submit your music preferences by actually listening to music (plugins for music players, embedded radio stations etc.).

It's cool, most popular music players support last.fm, however problems may occur while you're listening Internet radio stations via a browser (i.e. using flash plugins). I really loved a radio station called Epic Rock Radio, and I always wondered how to scrobble the songs that I am currently listening.

When I accidentally lost all my music files I was encouraged (besides recovering music from my iPad) to implement a scrobbler that could handle online radio stations.

First I made some research, searched the last.fm discussion groups, developers guides and I found just what I was looking for: the Audioscrobbler Realtime Submission Protocol (v1.2 specification available here).

It occurred that last.fm has a RESTful API for purpose of scrobbling songs. A simplified session may be presented as follows:

1. HANDSHAKE / AUTHENTICATION
2. "Now Playing"
3. SUBMIT SONG

Where steps 2-3 may be repeated (in any order) as long as the session key (obtained via HANDSHAKE) is up to date.

After I had a working audioscrobbler it was much easier to implement one for the mentioned radio station (data available in XML from live365.com) :-)

You can find my prototype implementation on github: LastPy

P.S. I wrote it a few years ago, it ain't clean and tidy, but it works ;-)

Wednesday, March 21, 2012

A django login_required decorator that preserves URI query parameters

The default django auth module provides some awesome function decorators - login_required and permission_required. These decorators provide a flexible authorisation mechanism, each time a user tries to access a resource he is not permitted to view (modify or do anything with it) he is redirected to the login page. The user has a chance authenticate himself or provide new credentials (in case he was already authenticated but was lacking permissions), and if the authentication process is completed successfully the resource may be accessed.

In most cases using these decorators solves the problem of protecting resources while keeping the code (and the user interface) clean.

A problem occurs when we try to access a protected resource while attaching some URI query parameter, ex. http://example.com/resource/?foo=1&bar=example. We get redirected to the login page, and after providing our credentials we get redirected back to http://example.com/resource/ ... and the query parameters are gone!

Sadly the default login_required decorator does not preserve them... we have to provide our own decorator:

1. def resource_login_required(some_view):

2. def wrapper(request, *args, **kw):

3.         if not request.user.is_authenticated():

4.             params = map(lambda x: "%s=%s&" % \

                  (x[0],x[1]), request.GET.items())

5.             return HttpResponseRedirect( \

                  "/login/?%snext=%s" % \              

                 ("".join(params),request.path))

6.         else:

7.             return some_view(request, *args, **kw)

8.     return wrapper

Line 4 is the key instruction, the presented lambda expression maps the parameter key-value pairs to an URI scheme query parameter representation. Next we concatenate the parameters with the original request path - this is it, after performing a successful login we should be redirected to the requested resource along with the request query parameters.

This decorator may by used just like the prevoiusly mentioned ones:

1.  @resource_login_required

2.  def my_view(request):

3.      #your view code

4. pass

Feel free to adjust this decorator to your needs.

Saturday, March 17, 2012

Cross-site ajax requests and security issues

Today I would like to share some of my experience regarding AJAX.

Ajax is commonly used to asynchronously fetch data from web applications deployed on the same domain (the user is not forced to reload the whole page while awaiting new data). So what happens when the the target URI domain differs from the domain on which the ajax script is running (cross-site)?

A web browser makes a request using the OPTIONS method and the following headers are set:

X-Requested-With: XMLHttpRequest 

Origin: http://my.script.domain.com

The browser informs the requested resource about the type of the request (Ajax), and its origin (my.script.domain.com). If the target site accepts such requests it should attach the following response headers:

Access-Control-Allow-Origin: http://my.script.domain.com

Access-Control-Allow-Headers: X-Requested-With
Access-Control-Allow-Methods: GET, OPTIONS        

Access-Control-Max-Age: 3600 

The first three headers determine the resource accessibility for certain requests: in this example only GET / OPTIONS Ajax requests from my.script.domain.com may be performed. The fourth header is related to response caching (seconds). So if the potential AJAX request meats these requirements, the browsers sends a following request using the appropriate method, and processes the response.

If the response headers do not indicate that the request is allowed or there are no such headers at all (not all are obligatory), the second request is not performed. This security mechanism is implemented in all commonly used browsers... lets face it, without it AJAX would be a very dangerous technology...

Summing up: if you ever want to enable cross-site ajax requests for your site you must configure the Access-Control-Allow-* policy on the target server.

Tip: If you want your server to accept such requests from all sites, you may use the following policy:

Access-Control-Allow-Origin: *

Access-Control-Allow-Methods: GET, POST, DELETE, HEAD, OPTIONS        

Access-Control-Max-Age: 1209600