|
 |
asp_web_howto thread: Retrieving content from another page...
Message #1 by "Paul McKeever" <paul@f...> on Mon, 22 Oct 2001 14:14:56
|
|
Hey guys,
I'm trying to implement a news service for a site we're working on
(www.nistudent.com).
A national UK newspaper provides a free news feed service through its
site - http://www.guardian.co.uk. The way it works is that they give you a
URL from which to harvest the content to reproduce on your own site eg
from
http://www.guardian.co.uk/Distribution/Artifact_Trail_Block/0,5184,179821-
0-,00.html.
As I see it, there are 4 distinct stages to the process:
1. Retrieve content from guardian.co.uk
2. Format content into headlines/supporting text
3. Insert into database
4. Retrieve information on a page within the site
The problem I'm having is that I have no idea of how to create a script
that runs automatically which will do stage 1.
They provide a sample Perl script to collect the content, but I have only
a little experience with Perl, and don't know how to replicate this in
VBScript.
I've appended the Perl script at the bottom.
Any help, thoughts, ideas, suggestions, comments or anything that may
point me in the right direction would be really appreciated.
Thanks in advance,
Paul McKeever
http://www.front-online.com
PERL SCRIPT--------------------------------
#!/usr/local/bin/perl -Tw
###########################################################################
######
#
# Content Harvester
#
# v1.1 - Jamie Unwin, Kieran Topping
# Guardian Unlimited, Guardian Newspapers Limited 2000
#
# Automatically harvest distributed Guardian Unlimited content
#
#
===========================================================================
====
# IMPORTANT NOTES - PLEASE READ
#
===========================================================================
====
# This script is provided "as is" and as an EXAMPLE only.
#
# This script will need to be modified in order to fit into your
particular
# environment, and to add an appropriate level of error checking.
#
# Guardian Unlimited cannot offer technical support for implementing this
script.
#
# Modification and execution should only be attempted by the Webmaster or
# Sys-Admin of your site, and then only if they have experience and
# responsibility in the following fields:
#
# * Perl
# * LWP module
# * Webserver & (your particular) operating system.
#
# No responsibility can be accepted by Guardian Unlimited for any damage
caused
# to your website or computer systems arising from use of this script.
#
# If in doubt, DO NOT EXECUTE THIS SCRIPT.
#
# See http://www.guardianunlimited.co.uk/distribution
# for further conditions of use
#
###########################################################################
######
# load required modules
use strict; # this turns on strict error checking
use LWP::UserAgent; # this loads the LWP module (used to retrieve a web
page)
###########################################################################
######
# Global variables
###########################################################################
######
# --This scalar will need to be edited to suit your particular environment-
-
# Path to the document root of your web space
# (on your local file system)
#
my $doc_root = '/www/htdocs';
# --This scalar will need to be edited to suit your particular environment-
-
# Path to local directory relative to your document root
# (this is where the retrieved pages will be stored)
#
my $content_directory = $doc_root . '/content';
# --This hash will need to be edited to suit your particular environment--
# URLs of the content you wish to retrieve.
#
# The format is - 'local filename' => 'remote url'
#
# The 'local filename' is the name that the file will have on your
webspace.
# This is chosen by you, and is specified relative to the content
directory.
# e.g. 'guardian_news.html'
#
# The 'remote url' is the URL of the content you wish to retrieve.
# You can obtain these URLs by following the instructions at
# http://www.guardianunlimited.co.uk/distribution
# These will look like
# http://www.guardianunlimited.co.uk/Distribution/[...].html
#
my %content_to_retrieve = (
'guardian_news.html'
=> 'http://www.guardianunlimited.co.uk/Distribution/[...].html',
'guardian_tv_radio.html'
=> 'http://www.guardianunlimited.co.uk/Distribution/[...].html'
);
###########################################################################
######
# Main
###########################################################################
######
# create a user agent (this is like a browser)
my $ua = new LWP::UserAgent;
$ua->agent('ContentHarvester/1.1 (GU)');
# loop through each piece of content to be harvested
foreach my $local_filename (keys %content_to_retrieve) {
my $remote_url = $content_to_retrieve{$local_filename};
# get the page (retrieve content)
my $request = new HTTP::Request('GET', $remote_url);
my $response = $ua->request($request);
my $content = $response->content;
# check we got the page
unless ($response->is_success) {
die "$remote_url, $response->error_as_HTML\n";
}
# save the file to the local file system
open (CONTENT, ">$content_directory/$local_filename")
or die "Can't store the retrived file locally, $content_directory/
$local_filename, $!\n";
print CONTENT $content . "\n";
close CONTENT;
}
Message #2 by Nick Charlesworth <nick@f...> on Mon, 22 Oct 2001 14:31:43 +0100
|
|
Hi paul,
I am the web developer for www.fizzin.co.uk and we use the guardian service
for our news.
See our frontpage and news pages to see how the feeds appear.
I will send you the script i have used to harvest the news feeds and save
them to files on our server.
If anyone else is interested then let me know.
-----Original Message-----
From: Paul McKeever [mailto:paul@f...]
Sent: 22 October 2001 15:15
To: ASP Web HowTo
Subject: [asp_web_howto] Retrieving content from another page...
Hey guys,
I'm trying to implement a news service for a site we're working on
(www.nistudent.com).
A national UK newspaper provides a free news feed service through its
site - http://www.guardian.co.uk. The way it works is that they give you a
URL from which to harvest the content to reproduce on your own site eg
from
http://www.guardian.co.uk/Distribution/Artifact_Trail_Block/0,5184,179821-
0-,00.html.
As I see it, there are 4 distinct stages to the process:
1. Retrieve content from guardian.co.uk
2. Format content into headlines/supporting text
3. Insert into database
4. Retrieve information on a page within the site
The problem I'm having is that I have no idea of how to create a script
that runs automatically which will do stage 1.
They provide a sample Perl script to collect the content, but I have only
a little experience with Perl, and don't know how to replicate this in
VBScript.
I've appended the Perl script at the bottom.
Any help, thoughts, ideas, suggestions, comments or anything that may
point me in the right direction would be really appreciated.
Thanks in advance,
Paul McKeever
http://www.front-online.com
PERL SCRIPT--------------------------------
#!/usr/local/bin/perl -Tw
###########################################################################
######
#
# Content Harvester
#
# v1.1 - Jamie Unwin, Kieran Topping
# Guardian Unlimited, Guardian Newspapers Limited 2000
#
# Automatically harvest distributed Guardian Unlimited content
#
#
===========================================================================
====
# IMPORTANT NOTES - PLEASE READ
#
===========================================================================
====
# This script is provided "as is" and as an EXAMPLE only.
#
# This script will need to be modified in order to fit into your
particular
# environment, and to add an appropriate level of error checking.
#
# Guardian Unlimited cannot offer technical support for implementing this
script.
#
# Modification and execution should only be attempted by the Webmaster or
# Sys-Admin of your site, and then only if they have experience and
# responsibility in the following fields:
#
# * Perl
# * LWP module
# * Webserver & (your particular) operating system.
#
# No responsibility can be accepted by Guardian Unlimited for any damage
caused
# to your website or computer systems arising from use of this script.
#
# If in doubt, DO NOT EXECUTE THIS SCRIPT.
#
# See http://www.guardianunlimited.co.uk/distribution
# for further conditions of use
#
###########################################################################
######
# load required modules
use strict; # this turns on strict error checking
use LWP::UserAgent; # this loads the LWP module (used to retrieve a web
page)
###########################################################################
######
# Global variables
###########################################################################
######
# --This scalar will need to be edited to suit your particular environment-
-
# Path to the document root of your web space
# (on your local file system)
#
my $doc_root = '/www/htdocs';
# --This scalar will need to be edited to suit your particular environment-
-
# Path to local directory relative to your document root
# (this is where the retrieved pages will be stored)
#
my $content_directory = $doc_root . '/content';
# --This hash will need to be edited to suit your particular environment--
# URLs of the content you wish to retrieve.
#
# The format is - 'local filename' => 'remote url'
#
# The 'local filename' is the name that the file will have on your
webspace.
# This is chosen by you, and is specified relative to the content
directory.
# e.g. 'guardian_news.html'
#
# The 'remote url' is the URL of the content you wish to retrieve.
# You can obtain these URLs by following the instructions at
# http://www.guardianunlimited.co.uk/distribution
# These will look like
# http://www.guardianunlimited.co.uk/Distribution/[...].html
#
my %content_to_retrieve = (
'guardian_news.html'
=> 'http://www.guardianunlimited.co.uk/Distribution/[...].html',
'guardian_tv_radio.html'
=> 'http://www.guardianunlimited.co.uk/Distribution/[...].html'
);
###########################################################################
######
# Main
###########################################################################
######
# create a user agent (this is like a browser)
my $ua = new LWP::UserAgent;
$ua->agent('ContentHarvester/1.1 (GU)');
# loop through each piece of content to be harvested
foreach my $local_filename (keys %content_to_retrieve) {
my $remote_url = $content_to_retrieve{$local_filename};
# get the page (retrieve content)
my $request = new HTTP::Request('GET', $remote_url);
my $response = $ua->request($request);
my $content = $response->content;
# check we got the page
unless ($response->is_success) {
die "$remote_url, $response->error_as_HTML\n";
}
# save the file to the local file system
open (CONTENT, ">$content_directory/$local_filename")
or die "Can't store the retrived file locally, $content_directory/
$local_filename, $!\n";
print CONTENT $content . "\n";
close CONTENT;
}
Message #3 by "Alex Shiell, ITS, EC, SE" <alex.shiell@s...> on Mon, 22 Oct 2001 14:54:24 +0100
|
|
You would need to do steps 1-3 in a vbs file that can be automated to run
under windows script host.
This script uses an object unique to PERL (LWP::UserAgent) to retreive the
web page from the site. If you want to replicate this functionality in
vbscript, you would need to use serverXMLHTTP (available in MSXML3), or some
kind of third party component.
Have you tried using the script itself? You might save yourself a lot of
hassle...
You can download the PERL script engine from
http://www.activestate.com/Products/ActivePerl/, and http://www.perl.com
should contain all the resources you need.
-----Original Message-----
From: Paul McKeever [mailto:paul@f...]
Sent: 22 October 2001 15:15
To: ASP Web HowTo
Subject: [asp_web_howto] Retrieving content from another page...
Hey guys,
I'm trying to implement a news service for a site we're working on
(www.nistudent.com).
A national UK newspaper provides a free news feed service through its
site - http://www.guardian.co.uk. The way it works is that they give you a
URL from which to harvest the content to reproduce on your own site eg
from
http://www.guardian.co.uk/Distribution/Artifact_Trail_Block/0,5184,179821-
0-,00.html.
As I see it, there are 4 distinct stages to the process:
1. Retrieve content from guardian.co.uk
2. Format content into headlines/supporting text
3. Insert into database
4. Retrieve information on a page within the site
The problem I'm having is that I have no idea of how to create a script
that runs automatically which will do stage 1.
They provide a sample Perl script to collect the content, but I have only
a little experience with Perl, and don't know how to replicate this in
VBScript.
I've appended the Perl script at the bottom.
Any help, thoughts, ideas, suggestions, comments or anything that may
point me in the right direction would be really appreciated.
Thanks in advance,
Paul McKeever
http://www.front-online.com
PERL SCRIPT--------------------------------
#!/usr/local/bin/perl -Tw
###########################################################################
######
#
# Content Harvester
#
# v1.1 - Jamie Unwin, Kieran Topping
# Guardian Unlimited, Guardian Newspapers Limited 2000
#
# Automatically harvest distributed Guardian Unlimited content
#
#
===========================================================================
====
# IMPORTANT NOTES - PLEASE READ
#
===========================================================================
====
# This script is provided "as is" and as an EXAMPLE only.
#
# This script will need to be modified in order to fit into your
particular
# environment, and to add an appropriate level of error checking.
#
# Guardian Unlimited cannot offer technical support for implementing this
script.
#
# Modification and execution should only be attempted by the Webmaster or
# Sys-Admin of your site, and then only if they have experience and
# responsibility in the following fields:
#
# * Perl
# * LWP module
# * Webserver & (your particular) operating system.
#
# No responsibility can be accepted by Guardian Unlimited for any damage
caused
# to your website or computer systems arising from use of this script.
#
# If in doubt, DO NOT EXECUTE THIS SCRIPT.
#
# See http://www.guardianunlimited.co.uk/distribution
# for further conditions of use
#
###########################################################################
######
# load required modules
use strict; # this turns on strict error checking
use LWP::UserAgent; # this loads the LWP module (used to retrieve a web
page)
###########################################################################
######
# Global variables
###########################################################################
######
# --This scalar will need to be edited to suit your particular environment-
-
# Path to the document root of your web space
# (on your local file system)
#
my $doc_root = '/www/htdocs';
# --This scalar will need to be edited to suit your particular environment-
-
# Path to local directory relative to your document root
# (this is where the retrieved pages will be stored)
#
my $content_directory = $doc_root . '/content';
# --This hash will need to be edited to suit your particular environment--
# URLs of the content you wish to retrieve.
#
# The format is - 'local filename' => 'remote url'
#
# The 'local filename' is the name that the file will have on your
webspace.
# This is chosen by you, and is specified relative to the content
directory.
# e.g. 'guardian_news.html'
#
# The 'remote url' is the URL of the content you wish to retrieve.
# You can obtain these URLs by following the instructions at
# http://www.guardianunlimited.co.uk/distribution
# These will look like
# http://www.guardianunlimited.co.uk/Distribution/[...].html
#
my %content_to_retrieve = (
'guardian_news.html'
=> 'http://www.guardianunlimited.co.uk/Distribution/[...].html',
'guardian_tv_radio.html'
=> 'http://www.guardianunlimited.co.uk/Distribution/[...].html'
);
###########################################################################
######
# Main
###########################################################################
######
# create a user agent (this is like a browser)
my $ua = new LWP::UserAgent;
$ua->agent('ContentHarvester/1.1 (GU)');
# loop through each piece of content to be harvested
foreach my $local_filename (keys %content_to_retrieve) {
my $remote_url = $content_to_retrieve{$local_filename};
# get the page (retrieve content)
my $request = new HTTP::Request('GET', $remote_url);
my $response = $ua->request($request);
my $content = $response->content;
# check we got the page
unless ($response->is_success) {
die "$remote_url, $response->error_as_HTML\n";
}
# save the file to the local file system
open (CONTENT, ">$content_directory/$local_filename")
or die "Can't store the retrived file locally, $content_directory/
$local_filename, $!\n";
print CONTENT $content . "\n";
close CONTENT;
}
Message #4 by <odempsey@b...> on Mon, 22 Oct 2001 21:52:44 +0100
|
|
If your background is in ASP using VB Script it would be easier for you to
write a program in VB. You wouldn't even need to automate the execution of
the program since you would only need somebody to start the program once a
day, the program would take care of the rest.
Kind Regards
Oliver Dempsey
Message #5 by "Alex Shiell, ITS, EC, SE" <alex.shiell@s...> on Tue, 23 Oct 2001 09:17:35 +0100
|
|
why have someone start it every day when its so easy to schedule?
-----Original Message-----
From: odempsey@b... [mailto:odempsey@b...]
Sent: 22 October 2001 21:53
To: ASP Web HowTo
Subject: [asp_web_howto] RE: Retrieving content from another page...
If your background is in ASP using VB Script it would be easier for you to
write a program in VB. You wouldn't even need to automate the execution of
the program since you would only need somebody to start the program once a
day, the program would take care of the rest.
Kind Regards
Oliver Dempsey
Message #6 by <odempsey@b...> on Wed, 24 Oct 2001 09:45:28 +0100
|
|
Can you schedule from Win 98 or Win Millenium?
----- Original Message -----
From: Alex Shiell, ITS, EC, SE <alex.shiell@s...>
To: ASP Web HowTo <asp_web_howto@p...>
Sent: Tuesday, October 23, 2001 9:17 AM
Subject: [asp_web_howto] RE: Retrieving content from another page...
> why have someone start it every day when its so easy to schedule?
>
> -----Original Message-----
> From: odempsey@b... [mailto:odempsey@b...]
> Sent: 22 October 2001 21:53
> To: ASP Web HowTo
> Subject: [asp_web_howto] RE: Retrieving content from another page...
>
>
> If your background is in ASP using VB Script it would be easier for you to
> write a program in VB. You wouldn't even need to automate the execution
of
> the program since you would only need somebody to start the program once a
> day, the program would take care of the rest.
>
>
> Kind Regards
> Oliver Dempsey
$subst('Email.Unsub')
>
>
>
Message #7 by "Alex Shiell, ITS, EC, SE" <alex.shiell@s...> on Wed, 24 Oct 2001 10:28:30 +0100
|
|
yes, there is a task scheduler in My Computer, which is installed with IE5
for windows 98, comes as standard with ME
-----Original Message-----
From: odempsey@b... [mailto:odempsey@b...]
Sent: 24 October 2001 09:45
To: ASP Web HowTo
Subject: [asp_web_howto] RE: Retrieving content from another page...
Can you schedule from Win 98 or Win Millenium?
----- Original Message -----
From: Alex Shiell, ITS, EC, SE <alex.shiell@s...>
To: ASP Web HowTo <asp_web_howto@p...>
Sent: Tuesday, October 23, 2001 9:17 AM
Subject: [asp_web_howto] RE: Retrieving content from another page...
> why have someone start it every day when its so easy to schedule?
>
> -----Original Message-----
> From: odempsey@b... [mailto:odempsey@b...]
> Sent: 22 October 2001 21:53
> To: ASP Web HowTo
> Subject: [asp_web_howto] RE: Retrieving content from another page...
>
>
> If your background is in ASP using VB Script it would be easier for you to
> write a program in VB. You wouldn't even need to automate the execution
of
> the program since you would only need somebody to start the program once a
> day, the program would take care of the rest.
>
>
> Kind Regards
> Oliver Dempsey
>
|
|
 |