Snag webpage content

marshall04b · November 7th, 2008, 01:13 PM

Hello,

I need to snag the content of some webpages (.php files), and heard that Perl can help snag webpage and store it as a text file. I am totally new on this. Any hint or example would be greatly appreciated.

Peter

ciderpunx · November 20th, 2008, 01:06 PM

Yeh that's easy enough. You'll want to install a CPAN module called LWP::Simple to use this. Do perl -MCPAN -e'install LWP::Simple' from a command prompt first.

Then open a new perl file in your fave text editor.

Cut and paste, explanations in code

Code:

#!/usr/bin/perl
use strict;
use warnings;
use LWP::Simple;

# While we keep getting lines
while(<>) {
  # Do the following with each line
  next unless(/^http:\/\//);  # skip everything if line 
                              # doesn't start http:
  chomp();                    # get rid of the line-ending
  print "Retrieving $_\n";    # Tell user what's going on
  my $filename = $_;          # use url as filename
  $filename =~ s/http:\/\///; # get rid of http:// bit
  $filename =~ s/\//_/g;      # change /s into _s
  my $page = get($_);         # retrieve the page
  die "Couldn't retrieve $_" unless defined $page; # die if we can't get the page
  open (OUT, ">$filename")    # Open file for writing
    or die("Couldn't open $filename\n$!\n"); # or die, explaining why we can't
  print OUT $page;            # print page to file
  close OUT;                  # close file
  print "Wrote file $filename\n"; # Tell user what's going on
}

Save that as snag.pl

(if you're using a unix or linux box, make it executable chmod a+x snag.pl)

You can now run snag.pl by typing ./snag.pl

It'll wait for you to type a url, then it'll retrieve the url and tell you the file where its put it.

Alternatively fire up a text editor and make a file of urls, one per line (say urls.txt) That would look like this:

Code:

http://google.com
http://charlieharvey.org.uk/index.pl

Then do:

Code:

$ ./snag.pl urls.txt 
Retrieving http://google.com
Wrote file google.com
Retrieving http://charlieharvey.org.uk/index.pl
Wrote file charlieharvey.org.uk_index.pl

--
Charlie Harvey's website - linux, perl, java, anarchism and punk rock: http://charlieharvey.org.uk