Wrox Programmer Forums

Need to download code?

View our list of code downloads.

Go Back   Wrox Programmer Forums > PHP/MySQL > Beginning PHP
Password Reminder
Register
| FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read
Beginning PHP Beginning-level PHP discussions. More advanced coders should post to the Pro PHP forum.
Welcome to the p2p.wrox.com Forums.

You are currently viewing the Beginning PHP section of the Wrox Programmer to Programmer discussions. This is a community of tens of thousands of software programmers and website developers including Wrox book authors and readers. As a guest, you can read any forum posting. By joining today you can post your own programming questions, respond to other developers’ questions, and eliminate the ads that are displayed to guests. Registration is fast, simple and absolutely free .
DRM-free e-books 300x50
Reply
 
Thread Tools Search this Thread Display Modes
  #1 (permalink)  
Old December 13th, 2009, 01:02 PM
Registered User
 
Join Date: Jun 2008
Location: London, , .
Posts: 4
Thanks: 0
Thanked 0 Times in 0 Posts
Question Handling redirects in crawler script

I am pretty new to PHP coming from a DB background and I am trying to write a PHP script to be used for an AJAX proxy to handle cross domain requests.

The code works however I am unsure about the section that supposedly handles redirects by reading in the headers and looking for the location: URL and if found it does a recursive call to the function again after decreasing a counter. If this counter is set to 2 then it should only do 2 redirects.

I have created a few test pages where the inital script is called from an HTML page by a JS AJAX call to load in a page called redirect1.php. This redirects to redirect2.php then to redirect3.php then to redirect4.php then to the final page where the content is. This is 4 redirects.

The redirects are all done with the following line

Code:
<?php
header('Location: http://localhost/redirect4.php');
?>
I wanted to test the $maxredirs counter by setting it to 1 thinking that it wouldn't return me the content because it has to do 4 redirects to get the content so it should only do 1 redirect and then stop. However no matter what value I set $maxredirs to the content is always returned. Its as if the code to handle the redirects is never actually used and the redirects are handled automatically anyway.

Is this because the browser is handling the redirects or am I doing something wrong? Can someone please explain to me when this code to handle the location header would ever be run. I took the code from another site and modified it slightly.

Code:
<?php

$url = $_REQUEST["u"];

if(!empty($url)){
    $html = mycrawler_single($url);
    echo $html["html"];
}else{
    echo "";
}

function mycrawler_single($url, $useragent="",$timeout=10, $maxredirs=1) 
{
    $urlinfo = parse_url($url);
                 
    if (empty($urlinfo["scheme"])) {$urlinfo = parse_url("http://".$url);}                                                                  
    if (empty($urlinfo["path"])) {$urlinfo["path"]="/";}
              
    if (empty($urlinfo["port"]))
    {
            switch($urlinfo["scheme"])
            {
                case "http":
                    $urlinfo["port"] = 80;
                break;  
                case "https":
                    $urlinfo["port"] = 443;
                break;                
            }
    }

    // default to current browsers agent if none supplied
    if (empty($useragent)) $useragent = $_SERVER["HTTP_USER_AGENT"];

    if (isset($urlinfo["query"]))
    {
        $request = "GET ".$urlinfo["path"]."?".$urlinfo["query"]." ";
    } else {   
        $request = "GET ".$urlinfo["path"]." ";
    }

    //echo "request = ".$request;

    $request .= "HTTP/1.0\r\n";
    $request .= "Host: ".$urlinfo["host"]."\r\n";
    $request .= "User-Agent: ".$useragent."\r\n";
    $request .= "Connection: close\r\n\r\n";
    
    //echo "open ".$urlinfo["host"].$urlinfo["port"];

    $fp = fsockopen($urlinfo["host"], $urlinfo["port"], $errno, $errstr, $timeout);

    if (!$fp)
    {
        echo "(".$errno.")".$errstr."\n";           
    }
    else
    {   
        
        // $request;

        fwrite($fp, $request);
        
        while (!feof($fp)) 
        {
            if(isset($data)){
                $data .= fgets($fp, 4096);                      
            }else{
                $data = fgets($fp, 4096);
            }
        }

        fclose($fp);   
        
        //echo "response = ".$data;

        $tmp = explode("\r\n\r\n", $data, 2);
        
        $urlinfo["header"] = $tmp[0];
        $urlinfo["html"] = $tmp[1]; 
        
        //echo "html = ".$urlinfo["html"];
        
         // Handle redirects by reading in the header looking for the location                                   
        // this code never seems to run even when I am doing 4 redirects and $maxredirs=1
        if ((stripos($urlinfo["header"], "location:")) && ($maxredirs > 0))
        {
            preg_match("/\r\nlocation:(.*)/i", $urlinfo["header"], $match);

            if ($match)
            {    
                $redirect = trim($match[1]);
                
                //echo "Redirecting to ".$redirect."\n";
                
                // decrease counter
                $maxredirs–-;                         
                
                // call the function again to follow the redirect
                return mycrawler_single($redirect, $useragent, $timeout, $maxredirs);
            }
        }       

        // return array of header/html
        return $urlinfo;          
    }        
}
?>
Is this code that handles the parsing of the headers looking for a location and then a recursive call even required? If so when is it used as I cannot see a use for is.

Any help would be much appreciated.
Reply With Quote
  #2 (permalink)  
Old December 13th, 2009, 09:47 PM
Authorized User
 
Join Date: Dec 2008
Location: London
Posts: 50
Thanks: 1
Thanked 5 Times in 5 Posts
Default

Hi,

Can you please make it a bit more clear, as it seems confusing. Can you please Label the code with file names.

You can do 4 redirects as follows

page1.php
PHP Code:
<?php header("Location:/page2.php");

page2.php
PHP Code:
<?php header("Location:/page3.php");

page3.php
PHP Code:
<?php header("Location:/page4.php");
command header("Location:<any-page>") gets handled by the browser which you will set in your code on server side.

Paste your JScript as well please, if you can please explain the execution flow and which part you need help would be more helpful.

Kind regards,
Reply With Quote
  #3 (permalink)  
Old December 14th, 2009, 01:04 AM
Registered User
 
Join Date: Jun 2008
Location: London, , .
Posts: 4
Thanks: 0
Thanked 0 Times in 0 Posts
Default Redirect code doesn't seem to do anything

My question is whether any of the following code in the crawler function is necessary

Code:
// Handle redirects by reading in the header looking for the location                                   
// this code never seems to run even when I am doing 4 redirects and $maxredirs=1 which should only allow 1 redirect!! so is it neccessary at all?
if ((stripos($urlinfo["header"], "location:")) && ($maxredirs > 0))
{
    preg_match("/\r\nlocation:(.*)/i", $urlinfo["header"], $match);

    if ($match)
    {    
    $redirect = trim($match[1]);
    
    //echo "Redirecting to ".$redirect."\n";
    
    // decrease counter
    $maxredirs–-;                         
    
    // call the function again to follow the redirect
    return mycrawler_single($redirect, $useragent, $timeout, $maxredirs);
    }
}
I am doing my redirects in all the test pages exactly as you specified e.g

Code:
 <?php header("Location:/redirect1.php");?> 
Code:
 <?php header("Location:/redirect2.php");?> 
and so on.

What I am saying is that the browser seems to follow all these redirects (4 of them)
without my code (that handles the location header) even being in the function.

therefore I am asking whether that code is necessary and if it is what will
cause it to fire.

If I have 4 pages that all redirect to each other with a final html page at the end
that I am trying to retrieve. Then in theory I would need to set the $maxredirs var
to at least 4 to allow for 4 redirects. However if I set it to 1 the redirects
are still all followed. Even if I comment out the whole block of code
the redirects are still all followed. So what is going on.
Do I even need to bother trying to handle redirects myself by reading in the headers
and looking for the location header value or is something else going on I am unaware of.

The Javascript is just a Jquery AJAX call to the .php page containing the code in my first
post with the URL of the first page redirect1.php passed as the value for u.

<a href="#" onclick="AjaxProxy.php?u=redirect1.php">click me</a>

would do the same job. OR you can just hardcode the value of the first page in for the value
for the $url var e.g

$url = "redirect1.php";

Is that enough?
Reply With Quote
  #4 (permalink)  
Old December 31st, 2009, 12:04 AM
Authorized User
 
Join Date: Dec 2008
Location: London
Posts: 50
Thanks: 1
Thanked 5 Times in 5 Posts
Default

Hi monkeymagix,

Sorry for the late reply, you might have already solved the issue.

The code is required to avoid infinite redirect(dead lock).

in your Javascript pass the full URL(path) to the AjaxProxy.php file

HTML Code:
<a href="#" onclick="AjaxProxy.php?u=http://www.somehost.com/redirect1.php">click me</a>
This change should trigger the redirect check.

Thanks & Happy new year
Reply With Quote
Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off
Trackbacks are Off
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
handling script errors using vb web browser state Beginning VB 6 3 January 19th, 2006 09:01 AM
Session Management / Security / Redirects justinhume Beginning PHP 5 March 3rd, 2004 08:43 PM
Redirects skicrud Beginning PHP 1 October 10th, 2003 12:50 PM
HTTP_REFERER Tester Sometimes Redirects OK Visitor markw707 Classic ASP Basics 4 August 21st, 2003 11:41 PM
Querystrings and Redirects hcweb Classic ASP Basics 3 July 30th, 2003 08:50 AM



All times are GMT -4. The time now is 12:38 PM.


Powered by vBulletin®
Copyright ©2000 - 2019, Jelsoft Enterprises Ltd.
© 2013 John Wiley & Sons, Inc.