Handling redirects in crawler script

monkeymagix · December 13th, 2009, 01:02 PM

I am pretty new to PHP coming from a DB background and I am trying to write a PHP script to be used for an AJAX proxy to handle cross domain requests.

The code works however I am unsure about the section that supposedly handles redirects by reading in the headers and looking for the location: URL and if found it does a recursive call to the function again after decreasing a counter. If this counter is set to 2 then it should only do 2 redirects.

I have created a few test pages where the inital script is called from an HTML page by a JS AJAX call to load in a page called redirect1.php. This redirects to redirect2.php then to redirect3.php then to redirect4.php then to the final page where the content is. This is 4 redirects.

The redirects are all done with the following line

Code:

<?php
header('Location: http://localhost/redirect4.php');
?>

I wanted to test the $maxredirs counter by setting it to 1 thinking that it wouldn't return me the content because it has to do 4 redirects to get the content so it should only do 1 redirect and then stop. However no matter what value I set $maxredirs to the content is always returned. Its as if the code to handle the redirects is never actually used and the redirects are handled automatically anyway.

Is this because the browser is handling the redirects or am I doing something wrong? Can someone please explain to me when this code to handle the location header would ever be run. I took the code from another site and modified it slightly.

Code:

<?php

$url = $_REQUEST["u"];

if(!empty($url)){
    $html = mycrawler_single($url);
    echo $html["html"];
}else{
    echo "";
}

function mycrawler_single($url, $useragent="",$timeout=10, $maxredirs=1) 
{
    $urlinfo = parse_url($url);
                 
    if (empty($urlinfo["scheme"])) {$urlinfo = parse_url("http://".$url);}                                                                  
    if (empty($urlinfo["path"])) {$urlinfo["path"]="/";}
              
    if (empty($urlinfo["port"]))
    {
            switch($urlinfo["scheme"])
            {
                case "http":
                    $urlinfo["port"] = 80;
                break;  
                case "https":
                    $urlinfo["port"] = 443;
                break;                
            }
    }

    // default to current browsers agent if none supplied
    if (empty($useragent)) $useragent = $_SERVER["HTTP_USER_AGENT"];

    if (isset($urlinfo["query"]))
    {
        $request = "GET ".$urlinfo["path"]."?".$urlinfo["query"]." ";
    } else {   
        $request = "GET ".$urlinfo["path"]." ";
    }

    //echo "request = ".$request;

    $request .= "HTTP/1.0\r\n";
    $request .= "Host: ".$urlinfo["host"]."\r\n";
    $request .= "User-Agent: ".$useragent."\r\n";
    $request .= "Connection: close\r\n\r\n";
    
    //echo "open ".$urlinfo["host"].$urlinfo["port"];

    $fp = fsockopen($urlinfo["host"], $urlinfo["port"], $errno, $errstr, $timeout);

    if (!$fp)
    {
        echo "(".$errno.")".$errstr."\n";           
    }
    else
    {   
        
        // $request;

        fwrite($fp, $request);
        
        while (!feof($fp)) 
        {
            if(isset($data)){
                $data .= fgets($fp, 4096);                      
            }else{
                $data = fgets($fp, 4096);
            }
        }

        fclose($fp);   
        
        //echo "response = ".$data;

        $tmp = explode("\r\n\r\n", $data, 2);
        
        $urlinfo["header"] = $tmp[0];
        $urlinfo["html"] = $tmp[1]; 
        
        //echo "html = ".$urlinfo["html"];
        
         // Handle redirects by reading in the header looking for the location                                   
        // this code never seems to run even when I am doing 4 redirects and $maxredirs=1
        if ((stripos($urlinfo["header"], "location:")) && ($maxredirs > 0))
        {
            preg_match("/\r\nlocation:(.*)/i", $urlinfo["header"], $match);

            if ($match)
            {    
                $redirect = trim($match[1]);
                
                //echo "Redirecting to ".$redirect."\n";
                
                // decrease counter
                $maxredirsâ-;                         
                
                // call the function again to follow the redirect
                return mycrawler_single($redirect, $useragent, $timeout, $maxredirs);
            }
        }       

        // return array of header/html
        return $urlinfo;          
    }        
}
?>

Is this code that handles the parsing of the headers looking for a location and then a recursive call even required? If so when is it used as I cannot see a use for is.

Any help would be much appreciated.

zeronexxx · December 13th, 2009, 09:47 PM

Hi,

Can you please make it a bit more clear, as it seems confusing. Can you please Label the code with file names.

You can do 4 redirects as follows

page1.php

PHP Code:


			
<?php header("Location:/page2.php");

page2.php

PHP Code:


			
<?php header("Location:/page3.php");

page3.php

PHP Code:


			
<?php header("Location:/page4.php");

command header("Location:<any-page>") gets handled by the browser which you will set in your code on server side.

Paste your JScript as well please, if you can please explain the execution flow and which part you need help would be more helpful.

Kind regards,

monkeymagix · December 14th, 2009, 01:04 AM

My question is whether any of the following code in the crawler function is necessary

Code:

// Handle redirects by reading in the header looking for the location                                   
// this code never seems to run even when I am doing 4 redirects and $maxredirs=1 which should only allow 1 redirect!! so is it neccessary at all?
if ((stripos($urlinfo["header"], "location:")) && ($maxredirs > 0))
{
    preg_match("/\r\nlocation:(.*)/i", $urlinfo["header"], $match);

    if ($match)
    {    
    $redirect = trim($match[1]);
    
    //echo "Redirecting to ".$redirect."\n";
    
    // decrease counter
    $maxredirsâ-;                         
    
    // call the function again to follow the redirect
    return mycrawler_single($redirect, $useragent, $timeout, $maxredirs);
    }
}

I am doing my redirects in all the test pages exactly as you specified e.g

Code:

 <?php header("Location:/redirect1.php");?>

Code:

 <?php header("Location:/redirect2.php");?>

and so on.

What I am saying is that the browser seems to follow all these redirects (4 of them)
without my code (that handles the location header) even being in the function.

therefore I am asking whether that code is necessary and if it is what will
cause it to fire.

If I have 4 pages that all redirect to each other with a final html page at the end
that I am trying to retrieve. Then in theory I would need to set the $maxredirs var
to at least 4 to allow for 4 redirects. However if I set it to 1 the redirects
are still all followed. Even if I comment out the whole block of code
the redirects are still all followed. So what is going on.
Do I even need to bother trying to handle redirects myself by reading in the headers
and looking for the location header value or is something else going on I am unaware of.

The Javascript is just a Jquery AJAX call to the .php page containing the code in my first
post with the URL of the first page redirect1.php passed as the value for u.

<a href="#" onclick="AjaxProxy.php?u=redirect1.php">click me</a>

would do the same job. OR you can just hardcode the value of the first page in for the value
for the $url var e.g

$url = "redirect1.php";

Is that enough?

zeronexxx · December 31st, 2009, 12:04 AM

Hi monkeymagix,

Sorry for the late reply, you might have already solved the issue.

The code is required to avoid infinite redirect(dead lock).

in your Javascript pass the full URL(path) to the AjaxProxy.php file

HTML Code:

<a href="#" onclick="AjaxProxy.php?u=http://www.somehost.com/redirect1.php">click me</a>

This change should trigger the redirect check.

Thanks & Happy new year