Wrox Home  
Search P2P Archive for: Go

  Return to Index  

pro_php thread: screen scraping with PHP


Message #1 by "Christopher Janney" <wroxlist@a...> on Sat, 1 Mar 2003 13:03:57 -0800
You don't even need to do any socket stuff if allow_url_fopen is turned 
on, just use

$data = file( "http://www.somesite.com/sompage.html" );

and then you've got everything in the page in the array $data.

You could use fopen()/fgets() and just search each line one at a time 
until you find what you're looking for, but that means holding the 
connection open and making multiple requests for data. Since PHP doesn't 
have any firm upper limit on the size of a string, I'd just use file(), 
glom it together into a single string with implode() and search the 
string -- if you're that worried about memory, do

$content = implode( $data );
unset( $data );

If you're running PHP 4.3 or above, you can get the entire contents of 
the file as a single string $content using

$content = file_get_contents( "http://www.somesite.com/sompage.html" );

and save yourself the intermediate step.

Since your typical Web page doesn't usually get *that* big (maybe 
75-100k of text max, if that much?) I really wouldn't obsess on the 
memory issue too much. (Not that you should flagrantly waste it to 
excess, but no need to cringe over every byte, either!)

You can't inspect any content that you don't somehow load into one or 
more variables, if that's what you mean. One thing I would watch so far 
as resource menagement goes: use str_replace()rather than regexp's 
whenever possible -- regular expressions can eat up a lot of memeory in 
hurry, if you let them get out of hand.

Just my 2 cents' Australian, I guess. :)

j.

professional php digest wrote:

> 
> Subject: screen scraping with PHP
> From: "Christopher Janney" <E-MAIL REMOVED>
> Date: Sat, 1 Mar 2003 13:03:57 -0800
> X-Message-Number: 2
> 
> I'm trying to build a 'screen scraping' class that is general enough to
> access a broad range of sites for the same info.  That's for me to figure
> out.  The question is what is the best approach to getting the page?  Open a
> an http socket, request the page, dump the page into an array, search the
> array and pow!  done?  That sounds like a lot of wasted memory to me, but
> I've only built a shopping cart and dynamic pages in PHP so far.  No socket
> stuff yet.
> 
> 
> TIA,
> 
> -ctj

-- 
jon stephens
<zontar@m...>

http://hiveminds.info/ HiveMinds Group
http://phpuddi.sourceforge.net/ phpUDDI Project
http://www.wrox.com/ Wrox Press "Programmer To Programmer"
http://www.glasshaus.com/ glasshaus "Web Developer To Web Developer"


  Return to Index