Wrox Programmer Forums
Go Back   Wrox Programmer Forums > Open Source > Perl
|
Welcome to the p2p.wrox.com Forums.

You are currently viewing the Perl section of the Wrox Programmer to Programmer discussions. This is a community of software programmers and website developers including Wrox book authors and readers. New member registration was closed in 2019. New posts were shut off and the site was archived into this static format as of October 1, 2020. If you require technical support for a Wrox book please contact http://hub.wiley.com
 
Old November 7th, 2008, 01:13 PM
Registered User
 
Join Date: Oct 2008
Posts: 5
Thanks: 0
Thanked 0 Times in 0 Posts
Default Snag webpage content

Hello,

I need to snag the content of some webpages (.php files), and heard that Perl can help snag webpage and store it as a text file. I am totally new on this. Any hint or example would be greatly appreciated.

Peter
 
Old November 20th, 2008, 01:06 PM
Friend of Wrox
 
Join Date: Dec 2003
Posts: 488
Thanks: 0
Thanked 3 Times in 3 Posts
Default

Yeh that's easy enough. You'll want to install a CPAN module called LWP::Simple to use this. Do perl -MCPAN -e'install LWP::Simple' from a command prompt first.

Then open a new perl file in your fave text editor.

Cut and paste, explanations in code

Code:
#!/usr/bin/perl
use strict;
use warnings;
use LWP::Simple;

# While we keep getting lines
while(<>) {
  # Do the following with each line
  next unless(/^http:\/\//);  # skip everything if line 
                              # doesn't start http:
  chomp();                    # get rid of the line-ending
  print "Retrieving $_\n";    # Tell user what's going on
  my $filename = $_;          # use url as filename
  $filename =~ s/http:\/\///; # get rid of http:// bit
  $filename =~ s/\//_/g;      # change /s into _s
  my $page = get($_);         # retrieve the page
  die "Couldn't retrieve $_" unless defined $page; # die if we can't get the page
  open (OUT, ">$filename")    # Open file for writing
    or die("Couldn't open $filename\n$!\n"); # or die, explaining why we can't
  print OUT $page;            # print page to file
  close OUT;                  # close file
  print "Wrote file $filename\n"; # Tell user what's going on
}
Save that as snag.pl

(if you're using a unix or linux box, make it executable chmod a+x snag.pl)

You can now run snag.pl by typing ./snag.pl

It'll wait for you to type a url, then it'll retrieve the url and tell you the file where its put it.

Alternatively fire up a text editor and make a file of urls, one per line (say urls.txt) That would look like this:
Then do:
Code:
$ ./snag.pl urls.txt 
Retrieving http://google.com
Wrote file google.com
Retrieving http://charlieharvey.org.uk/index.pl
Wrote file charlieharvey.org.uk_index.pl
--
Charlie Harvey's website - linux, perl, java, anarchism and punk rock: http://charlieharvey.org.uk





Similar Threads
Thread Thread Starter Forum Replies Last Post
print a webpage debjanib ASP.NET 2.0 Professional 1 January 1st, 2007 01:07 PM
e-mail a webpage rylemer Classic ASP Professional 5 July 11th, 2006 08:05 PM
To Load a webpage Nitu kumar RSS and Atom 0 April 3rd, 2006 04:39 AM
Choosing content depending on content of other ele dsekar_nat XSLT 1 February 27th, 2006 05:58 AM
Printing webpage rajanikrishna HTML Code Clinic 1 February 2nd, 2005 09:11 PM





Powered by vBulletin®
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.
Copyright (c) 2020 John Wiley & Sons, Inc.