Wrox Home  
Search P2P Archive for: Go

  Return to Index  

asp_web_howto thread: Retrieving content from another page...


Message #1 by "Paul McKeever" <paul@f...> on Mon, 22 Oct 2001 14:14:56
Hey guys,



I'm trying to implement a news service for a site we're working on 

(www.nistudent.com).



A national UK newspaper provides a free news feed service through its 

site - http://www.guardian.co.uk. The way it works is that they give you a 

URL from which to harvest the content to reproduce on your own site eg 

from 

http://www.guardian.co.uk/Distribution/Artifact_Trail_Block/0,5184,179821-

0-,00.html.



As I see it, there are 4 distinct stages to the process:



1. Retrieve content from guardian.co.uk

2. Format content into headlines/supporting text

3. Insert into database

4. Retrieve information on a page within the site



The problem I'm having is that I have no idea of how to create a script 

that runs automatically which will do stage 1.



They provide a sample Perl script to collect the content, but I have only 

a little experience with Perl, and don't know how to replicate this in 

VBScript.



I've appended the Perl script at the bottom.



Any help, thoughts, ideas, suggestions, comments or anything that may 

point me in the right direction would be really appreciated.



Thanks in advance,







Paul McKeever

http://www.front-online.com



PERL SCRIPT--------------------------------





#!/usr/local/bin/perl -Tw



###########################################################################

######

#

# Content Harvester

#

# v1.1 - Jamie Unwin, Kieran Topping

# Guardian Unlimited, Guardian Newspapers Limited 2000

#

# Automatically harvest distributed Guardian Unlimited content

#

# 

===========================================================================

====

# IMPORTANT NOTES - PLEASE READ

# 

===========================================================================

====

# This script is provided "as is" and as an EXAMPLE only. 

#

# This script will need to be modified in order to fit into your 

particular 

# environment, and to add an appropriate level of error checking.

#

# Guardian Unlimited cannot offer technical support for implementing this 

script.

#

# Modification and execution should only be attempted by the Webmaster or 

# Sys-Admin of your site, and then only if they have experience and

# responsibility in the following fields:

#

#     * Perl

#     * LWP module

#     * Webserver & (your particular) operating system.

#     

# No responsibility can be accepted by Guardian Unlimited for any damage 

caused 

# to your website or computer systems arising from use of this script.

#

# If in doubt, DO NOT EXECUTE THIS SCRIPT.

#

# See http://www.guardianunlimited.co.uk/distribution 

# for further conditions of use

#

###########################################################################

######



# load required modules

use strict;          # this turns on strict error checking

use LWP::UserAgent;  # this loads the LWP module (used to retrieve a web 

page)



###########################################################################

######

# Global variables

###########################################################################

######



# --This scalar will need to be edited to suit your particular environment-

-

# Path to the document root of your web space

# (on your local file system)

#

my $doc_root = '/www/htdocs';





# --This scalar will need to be edited to suit your particular environment-

-

# Path to local directory relative to your document root

# (this is where the retrieved pages will be stored)

#

my $content_directory = $doc_root . '/content';





# --This hash will need to be edited to suit your particular environment--

# URLs of the content you wish to retrieve.

#

# The format is - 'local filename' => 'remote url'

#

# The 'local filename' is the name that the file will have on your 

webspace. 

# This is chosen by you, and is specified relative to the content 

directory.

# e.g. 'guardian_news.html'

#

# The 'remote url' is the URL of the content you wish to retrieve.

# You can obtain these URLs by following the instructions at 

# http://www.guardianunlimited.co.uk/distribution

# These will look like

# http://www.guardianunlimited.co.uk/Distribution/[...].html

#

my %content_to_retrieve = (

   'guardian_news.html'     

=> 'http://www.guardianunlimited.co.uk/Distribution/[...].html',

   'guardian_tv_radio.html' 

=> 'http://www.guardianunlimited.co.uk/Distribution/[...].html'

);



###########################################################################

######

# Main

###########################################################################

######



# create a user agent (this is like a browser)

my $ua = new LWP::UserAgent;

$ua->agent('ContentHarvester/1.1 (GU)');



# loop through each piece of content to be harvested

foreach my $local_filename (keys %content_to_retrieve) {



   my $remote_url = $content_to_retrieve{$local_filename};



   # get the page (retrieve content)

   my $request = new HTTP::Request('GET', $remote_url);

   my $response = $ua->request($request);

   my $content = $response->content;



   # check we got the page

   unless ($response->is_success) {

      die "$remote_url, $response->error_as_HTML\n";

   }



   # save the file to the local file system

   open (CONTENT, ">$content_directory/$local_filename")

      or die "Can't store the retrived file locally, $content_directory/

$local_filename, $!\n";

   print CONTENT $content . "\n";

   close CONTENT;

}



Message #2 by Nick Charlesworth <nick@f...> on Mon, 22 Oct 2001 14:31:43 +0100
Hi paul,



I am the web developer for www.fizzin.co.uk and we use the guardian service

for our news.



See our frontpage and news pages to see how the feeds appear.



I will send you the script i have used to harvest the news feeds and save

them to files on our server.



If anyone else is interested then let me know.



-----Original Message-----

From: Paul McKeever [mailto:paul@f...]

Sent: 22 October 2001 15:15

To: ASP Web HowTo

Subject: [asp_web_howto] Retrieving content from another page...





Hey guys,



I'm trying to implement a news service for a site we're working on 

(www.nistudent.com).



A national UK newspaper provides a free news feed service through its 

site - http://www.guardian.co.uk. The way it works is that they give you a 

URL from which to harvest the content to reproduce on your own site eg 

from 

http://www.guardian.co.uk/Distribution/Artifact_Trail_Block/0,5184,179821-

0-,00.html.



As I see it, there are 4 distinct stages to the process:



1. Retrieve content from guardian.co.uk

2. Format content into headlines/supporting text

3. Insert into database

4. Retrieve information on a page within the site



The problem I'm having is that I have no idea of how to create a script 

that runs automatically which will do stage 1.



They provide a sample Perl script to collect the content, but I have only 

a little experience with Perl, and don't know how to replicate this in 

VBScript.



I've appended the Perl script at the bottom.



Any help, thoughts, ideas, suggestions, comments or anything that may 

point me in the right direction would be really appreciated.



Thanks in advance,







Paul McKeever

http://www.front-online.com



PERL SCRIPT--------------------------------





#!/usr/local/bin/perl -Tw



###########################################################################

######

#

# Content Harvester

#

# v1.1 - Jamie Unwin, Kieran Topping

# Guardian Unlimited, Guardian Newspapers Limited 2000

#

# Automatically harvest distributed Guardian Unlimited content

#

# 

===========================================================================

====

# IMPORTANT NOTES - PLEASE READ

# 

===========================================================================

====

# This script is provided "as is" and as an EXAMPLE only. 

#

# This script will need to be modified in order to fit into your 

particular 

# environment, and to add an appropriate level of error checking.

#

# Guardian Unlimited cannot offer technical support for implementing this 

script.

#

# Modification and execution should only be attempted by the Webmaster or 

# Sys-Admin of your site, and then only if they have experience and

# responsibility in the following fields:

#

#     * Perl

#     * LWP module

#     * Webserver & (your particular) operating system.

#     

# No responsibility can be accepted by Guardian Unlimited for any damage 

caused 

# to your website or computer systems arising from use of this script.

#

# If in doubt, DO NOT EXECUTE THIS SCRIPT.

#

# See http://www.guardianunlimited.co.uk/distribution 

# for further conditions of use

#

###########################################################################

######



# load required modules

use strict;          # this turns on strict error checking

use LWP::UserAgent;  # this loads the LWP module (used to retrieve a web 

page)



###########################################################################

######

# Global variables

###########################################################################

######



# --This scalar will need to be edited to suit your particular environment-

-

# Path to the document root of your web space

# (on your local file system)

#

my $doc_root = '/www/htdocs';





# --This scalar will need to be edited to suit your particular environment-

-

# Path to local directory relative to your document root

# (this is where the retrieved pages will be stored)

#

my $content_directory = $doc_root . '/content';





# --This hash will need to be edited to suit your particular environment--

# URLs of the content you wish to retrieve.

#

# The format is - 'local filename' => 'remote url'

#

# The 'local filename' is the name that the file will have on your 

webspace. 

# This is chosen by you, and is specified relative to the content 

directory.

# e.g. 'guardian_news.html'

#

# The 'remote url' is the URL of the content you wish to retrieve.

# You can obtain these URLs by following the instructions at 

# http://www.guardianunlimited.co.uk/distribution

# These will look like

# http://www.guardianunlimited.co.uk/Distribution/[...].html

#

my %content_to_retrieve = (

   'guardian_news.html'     

=> 'http://www.guardianunlimited.co.uk/Distribution/[...].html',

   'guardian_tv_radio.html' 

=> 'http://www.guardianunlimited.co.uk/Distribution/[...].html'

);



###########################################################################

######

# Main

###########################################################################

######



# create a user agent (this is like a browser)

my $ua = new LWP::UserAgent;

$ua->agent('ContentHarvester/1.1 (GU)');



# loop through each piece of content to be harvested

foreach my $local_filename (keys %content_to_retrieve) {



   my $remote_url = $content_to_retrieve{$local_filename};



   # get the page (retrieve content)

   my $request = new HTTP::Request('GET', $remote_url);

   my $response = $ua->request($request);

   my $content = $response->content;



   # check we got the page

   unless ($response->is_success) {

      die "$remote_url, $response->error_as_HTML\n";

   }



   # save the file to the local file system

   open (CONTENT, ">$content_directory/$local_filename")

      or die "Can't store the retrived file locally, $content_directory/

$local_filename, $!\n";

   print CONTENT $content . "\n";

   close CONTENT;

}
Message #3 by "Alex Shiell, ITS, EC, SE" <alex.shiell@s...> on Mon, 22 Oct 2001 14:54:24 +0100
You would need to do steps 1-3 in a vbs file that can be automated to run

under windows script host.



This script uses an object unique to PERL (LWP::UserAgent) to retreive the

web page from the site.  If you want to replicate this functionality in

vbscript, you would need to use serverXMLHTTP (available in MSXML3), or some

kind of third party component.



Have you tried using the script itself? You might save yourself a lot of

hassle...  



You can download the PERL script engine from

http://www.activestate.com/Products/ActivePerl/, and http://www.perl.com

should contain all the resources you need. 



-----Original Message-----

From: Paul McKeever [mailto:paul@f...]

Sent: 22 October 2001 15:15

To: ASP Web HowTo

Subject: [asp_web_howto] Retrieving content from another page...





Hey guys,



I'm trying to implement a news service for a site we're working on 

(www.nistudent.com).



A national UK newspaper provides a free news feed service through its 

site - http://www.guardian.co.uk. The way it works is that they give you a 

URL from which to harvest the content to reproduce on your own site eg 

from 

http://www.guardian.co.uk/Distribution/Artifact_Trail_Block/0,5184,179821-

0-,00.html.



As I see it, there are 4 distinct stages to the process:



1. Retrieve content from guardian.co.uk

2. Format content into headlines/supporting text

3. Insert into database

4. Retrieve information on a page within the site



The problem I'm having is that I have no idea of how to create a script 

that runs automatically which will do stage 1.



They provide a sample Perl script to collect the content, but I have only 

a little experience with Perl, and don't know how to replicate this in 

VBScript.



I've appended the Perl script at the bottom.



Any help, thoughts, ideas, suggestions, comments or anything that may 

point me in the right direction would be really appreciated.



Thanks in advance,







Paul McKeever

http://www.front-online.com



PERL SCRIPT--------------------------------





#!/usr/local/bin/perl -Tw



###########################################################################

######

#

# Content Harvester

#

# v1.1 - Jamie Unwin, Kieran Topping

# Guardian Unlimited, Guardian Newspapers Limited 2000

#

# Automatically harvest distributed Guardian Unlimited content

#

# 

===========================================================================

====

# IMPORTANT NOTES - PLEASE READ

# 

===========================================================================

====

# This script is provided "as is" and as an EXAMPLE only. 

#

# This script will need to be modified in order to fit into your 

particular 

# environment, and to add an appropriate level of error checking.

#

# Guardian Unlimited cannot offer technical support for implementing this 

script.

#

# Modification and execution should only be attempted by the Webmaster or 

# Sys-Admin of your site, and then only if they have experience and

# responsibility in the following fields:

#

#     * Perl

#     * LWP module

#     * Webserver & (your particular) operating system.

#     

# No responsibility can be accepted by Guardian Unlimited for any damage 

caused 

# to your website or computer systems arising from use of this script.

#

# If in doubt, DO NOT EXECUTE THIS SCRIPT.

#

# See http://www.guardianunlimited.co.uk/distribution 

# for further conditions of use

#

###########################################################################

######



# load required modules

use strict;          # this turns on strict error checking

use LWP::UserAgent;  # this loads the LWP module (used to retrieve a web 

page)



###########################################################################

######

# Global variables

###########################################################################

######



# --This scalar will need to be edited to suit your particular environment-

-

# Path to the document root of your web space

# (on your local file system)

#

my $doc_root = '/www/htdocs';





# --This scalar will need to be edited to suit your particular environment-

-

# Path to local directory relative to your document root

# (this is where the retrieved pages will be stored)

#

my $content_directory = $doc_root . '/content';





# --This hash will need to be edited to suit your particular environment--

# URLs of the content you wish to retrieve.

#

# The format is - 'local filename' => 'remote url'

#

# The 'local filename' is the name that the file will have on your 

webspace. 

# This is chosen by you, and is specified relative to the content 

directory.

# e.g. 'guardian_news.html'

#

# The 'remote url' is the URL of the content you wish to retrieve.

# You can obtain these URLs by following the instructions at 

# http://www.guardianunlimited.co.uk/distribution

# These will look like

# http://www.guardianunlimited.co.uk/Distribution/[...].html

#

my %content_to_retrieve = (

   'guardian_news.html'     

=> 'http://www.guardianunlimited.co.uk/Distribution/[...].html',

   'guardian_tv_radio.html' 

=> 'http://www.guardianunlimited.co.uk/Distribution/[...].html'

);



###########################################################################

######

# Main

###########################################################################

######



# create a user agent (this is like a browser)

my $ua = new LWP::UserAgent;

$ua->agent('ContentHarvester/1.1 (GU)');



# loop through each piece of content to be harvested

foreach my $local_filename (keys %content_to_retrieve) {



   my $remote_url = $content_to_retrieve{$local_filename};



   # get the page (retrieve content)

   my $request = new HTTP::Request('GET', $remote_url);

   my $response = $ua->request($request);

   my $content = $response->content;



   # check we got the page

   unless ($response->is_success) {

      die "$remote_url, $response->error_as_HTML\n";

   }



   # save the file to the local file system

   open (CONTENT, ">$content_directory/$local_filename")

      or die "Can't store the retrived file locally, $content_directory/

$local_filename, $!\n";

   print CONTENT $content . "\n";

   close CONTENT;

}
Message #4 by <odempsey@b...> on Mon, 22 Oct 2001 21:52:44 +0100
If your background is in ASP using VB Script it would be easier for you to

write a program in VB.  You wouldn't even need to automate the execution of

the program since you would only need somebody to start the program once a

day, the program would take care of the rest.





Kind Regards

Oliver Dempsey







Message #5 by "Alex Shiell, ITS, EC, SE" <alex.shiell@s...> on Tue, 23 Oct 2001 09:17:35 +0100
why have someone start it every day when its so easy to schedule?



-----Original Message-----

From: odempsey@b... [mailto:odempsey@b...]

Sent: 22 October 2001 21:53

To: ASP Web HowTo

Subject: [asp_web_howto] RE: Retrieving content from another page...





If your background is in ASP using VB Script it would be easier for you to

write a program in VB.  You wouldn't even need to automate the execution of

the program since you would only need somebody to start the program once a

day, the program would take care of the rest.





Kind Regards

Oliver Dempsey
Message #6 by <odempsey@b...> on Wed, 24 Oct 2001 09:45:28 +0100
Can you schedule from Win 98 or Win Millenium?



----- Original Message -----

From: Alex Shiell, ITS, EC, SE <alex.shiell@s...>

To: ASP Web HowTo <asp_web_howto@p...>

Sent: Tuesday, October 23, 2001 9:17 AM

Subject: [asp_web_howto] RE: Retrieving content from another page...





> why have someone start it every day when its so easy to schedule?

>

> -----Original Message-----

> From: odempsey@b... [mailto:odempsey@b...]

> Sent: 22 October 2001 21:53

> To: ASP Web HowTo

> Subject: [asp_web_howto] RE: Retrieving content from another page...

>

>

> If your background is in ASP using VB Script it would be easier for you to

> write a program in VB.  You wouldn't even need to automate the execution

of

> the program since you would only need somebody to start the program once a

> day, the program would take care of the rest.

>

>

> Kind Regards

> Oliver Dempsey




$subst('Email.Unsub')

>

>

>



Message #7 by "Alex Shiell, ITS, EC, SE" <alex.shiell@s...> on Wed, 24 Oct 2001 10:28:30 +0100
yes, there is a task scheduler in My Computer, which is installed with IE5

for windows 98, comes as standard with ME



-----Original Message-----

From: odempsey@b... [mailto:odempsey@b...]

Sent: 24 October 2001 09:45

To: ASP Web HowTo

Subject: [asp_web_howto] RE: Retrieving content from another page...





Can you schedule from Win 98 or Win Millenium?



----- Original Message -----

From: Alex Shiell, ITS, EC, SE <alex.shiell@s...>

To: ASP Web HowTo <asp_web_howto@p...>

Sent: Tuesday, October 23, 2001 9:17 AM

Subject: [asp_web_howto] RE: Retrieving content from another page...





> why have someone start it every day when its so easy to schedule?

>

> -----Original Message-----

> From: odempsey@b... [mailto:odempsey@b...]

> Sent: 22 October 2001 21:53

> To: ASP Web HowTo

> Subject: [asp_web_howto] RE: Retrieving content from another page...

>

>

> If your background is in ASP using VB Script it would be easier for you to

> write a program in VB.  You wouldn't even need to automate the execution

of

> the program since you would only need somebody to start the program once a

> day, the program would take care of the rest.

>

>

> Kind Regards

> Oliver Dempsey

> 


  Return to Index