Today is: 11 January, 2012
Check todays hot topics

Open Proxy Harvesting

I like things that reach out to the web and harvest other things. I somehow got involved in a project to compile a blacklist of open proxies for an Efnet admin. The code they had so far was a joke. Literally, I was confused about how it even worked. He refused to give me any details about how this will actually be implemented so I took the approach that it would be a demonized spider that populates databases.

This is a stripped down version of code I turned over. This is used in conjunction with Ghettocode's LIB::Google. The end spider maintains a mysql table of search terms relating to open proxies, tracks parsed urls in a similar table and does not check them again for a dynamic period of time. It was also meant to populate a database with the IP's/Ports it finds and/or tests to be still 'activated'.

use strict;
use warnings;
 
use WWW::Mechanize;
use Data::Dumper;
use DBI;
 
package Lib::Spider;
 
sub new {
        my $package = shift;
        my $self = {
                url     => "",
                report  => 0,
                verbose => 0,
                mech    => WWW::Mechanize->new( stackdepth => 0,
                                                agent      => "Proxy Harvest")
        };
        bless($self);
        return $self;
}
 
sub grab {
        my $self = shift;
        $self->{url} = shift;
        my @proxies;
 
        my $time = time();
        $self->{mech}->get($self->{url});
        print "Fetching ". $self->{url} . "-" . $self->{mech}->status() . "-" . $self->{mech}->ct() . "\n" if ($self->{verbose});
        my $content = $self->{mech}->content();
        my @lines = split(/\n/, $content);
        foreach my $line (@lines) {
                while ($line =~ /(\d+\.\d+\.\d+\.\d+(?::|\s+)\d*)/g) {
                        push @proxies, $1;
                }
        }
        return @proxies;
}
 
1;

Here is a small example that ties in LIB::Google.

#!/usr/bin/perl
#GHETTO!!!
use strict;
use warnings;
 
use Lib::Google;
use Lib::Spider;
 
my $obj = Lib::Spider->new();
my $search_object = Lib::Google->new();
 
#$obj->{verbose} = 1;
#$search_object->{verbose} = 1;
 
my @urls = $search_object->search("socks proxy list", 5);
 
foreach my $url (@urls) {
        my @proxies = $obj->grab($url);
 
        print "Found ". $#proxies ." proxies\n";
        foreach my $proxy (@proxies) {
                print "$proxy\n";
        }
}

I was surprised how many proxies this spiders able to grab in a short period of time. It's ability to do it's job is dependent on quantity and quality of search terms you provide it.

If you plan to use this in conjunction with a database, note I actually planned to resolve duplicate proxy addresses in the database. I suppose you could check for somethings existence before the insert.

I've also noticed there is a large number of proxy sites out there that try to defeat spiders by displaying data in tables. We've decided to use HTML::TokeParser to get around this.

And lastly, some proxies are displayed on page by javascript. The browser is expected to solve a math problem, and the solution is used to display port number (or something like that). This is a little more difficult to get around because the javascript tends to differ slightly. The logic is usually pretty simple though.

It looks like whatever is responsible for rendering the page uses something tantamount to divide by two and render the javascript accordingly.

Example:

documentWrite("IP: 127.0.0.1");
documentWrite("PORT: ".2000+400);

I would recommend creating several javascript filters and iterate through each one supplying the page content. If it matches it matches, if it doesn't it doesn't. Data Mining is what it is.

Let me know if you have any unique takes on this process or something similar. Of course, we'd love to see any spin offs you have of our code. Please let us know.

AttachmentSize
Spider.pm780 bytes
mod_spider.pl.txt424 bytes