Code critique invited.

Daniel Walker · April 30th, 2004, 09:13 AM

Hi all,
 I've had this chunk of code for some time, and I tend to use it quite a lot, as it is - and certainly it's proven useful as such. It's simple, reliable, pretty fast, effective... and a bit of a hack, so I'd welcome some impartial comments about how it could be improved, or whether it's any use to people, as it is.

 Basically what it's for, is finding all the words in a search string, and highlighting them when you output matching results - a bit like the Google highlighter. The highlighting is case-sensitive, finding instances of the search words in three possible guises:
 1. all-in-lower-case
 2. lower-case-with-first-word-capitalised
and 3. all-in-capitals
 It preserves this in the highlighted results.

 The real hack is in how it avoids munging it's own highlighting tags during sequential searches. As a disclaimer, I will say that it actually predates preg_replace, and has thus rather evolved, over time. In the interests of a) passing on a passable and useful bit of code that I thinks add a visually compelling element to search results, and b) improving it without having to do any work myself (:)), I place it before you all for your scruitiny and withering critique.

 [u]Synopsis:
 You'll have done a search for something in a database along the lines of:

$searchString = $_POST['searchString'];
$field = $_POST['field'];

// retrieve all contacts from DB
$sql = "SELECT $field ";
$sql .= "FROM table ";
$sql .= "WHERE $field LIKE '%$searchString%'";
$sql .= "ORDER BY TRIM($field) ASC";

$result = $db->query($sql);

Where we specify a search string and a field to search. That's pretty straightforward, I know. However, with our list of searched fields, in each case, we can do this:

$searchedField = $row[$field];
$searchedField = highlighter(trim($searchString), $searchedField);

...which shoves it through this thing:

<?php
/* This function runs through the text given to it as '$haystack' and
highlights all matching instances of all words in the search string
'$needle'. It probably seems Byzantine, but I'm sure it will prove
useful, with further refinement. */
function highlighter($needle, $haystack){
 //Break the search string into single words...
 $needles = explode(" ", $needle);

 /*We create two easily matched strings. These mark the
 start and end of each highlighted section and will be replaced by
 tags in the final run through.*/
 $regstart = "#Â¬#~#";
 $regend = "#~#Â¬#";

 /*The rationale is that these strings are very unlikely to
 actually be part of the string we're searching. If we were to
 insert tags directly, they would be liable to
 insertion, themselves, on each subsequent search&replace (if
 we were searching for fragments of "span class=", such as "a" or
 "an" - or the word "class", itself, of course!)*/

 //Then we pattern-match a maximum of four times for each word...
 foreach($needles as $needleword){
 /*Start building our search & replace string arrays, starting
 with the search text as first entered...*/
 $patterns[] = "/" .$needleword . "/";
 $replacements[] = $regstart .$needleword. $regend;

 /*Then, if the word isn't in lower case already, we'll search
 for it in lowercase
 if($needleword!=strtolower($needleword)){
 $patterns[] = "/" . strtolower($needleword) . "/";
 $replacements[] = $regstart.strtolower($needleword).$regend;
 }

 /*Then, if the word doesn't have a captial letter for its
 first letter already, we search with the first letter
 capitalised*/
 if($needleword!=ucwords(strtolower($needleword))){
 $needleword = ucwords(strtolower($needleword));
 $patterns[] = "/" . $needleword . "/";
 $replacements[] = $regstart .$needleword. $regend;
 }

 //Then, finally, if the word isn't capitalised already, we
 search for it in ALL CAPITALS.
 if($needleword!=strtoupper($needleword)){
 $needleword=strtoupper($needleword);
 $patterns[] = "/" . $needleword . "/";
 $replacements[] = $regstart .$needleword. $regend;
 }
 }//... we do this for each word in turn

 //Now perform the replacements...
 $haystack = preg_replace($patterns, $replacements, $haystack);

 //... then replace our delimiters with the actual tags...
 $haystack = ereg_replace($regstart, "",$haystack);
 $haystack = ereg_replace($regend, "", $haystack);

 /*(Could probably use preg-replace for this, too, but creating the
 arrays in the firstplace probaly takes just as long...)*/

 //...and then we return our modified string...
 return $haystack;
}
?>

Where "lighlight" is, obviously, something pretty distinctive like yellow text on a dark maroon background (what sort of bunch of aesthetically inept loosers would adopt a colourscheme like that?)

Anyway, what do you reckon?

Dan

richard.york · May 1st, 2004, 06:33 PM

I dunno Dan, this looks pretty good to me. I use little hacks like that from time to time.. that is invent placeholders for data. One of my favorites is to use characters that bear resemblence to HTML entities, such as: &id; where a unique id will be replaced later on. I've done a similar thing in my search program, but I don't think mine was case sensative. I just did a straight-up replacement of all search words using str_replace, so my approach wasn't quite as advanced. Obviously using preg_replace would be marginally faster, but who's counting the milliseconds?

My $0.02, anyway. Maybe Nik's listening in and has something to say.

Regards,
Rich

::::::::::::::::::::::::::::::::::::::::::
The Spicy Peanut Project
http://www.spicypeanut.net
::::::::::::::::::::::::::::::::::::::::::

Daniel Walker · May 2nd, 2004, 05:32 AM

Aye? Well, you're welcome to use it :). I'll "BSD licence" it.

It's quite cute, when you show your script to the client and perform a search for "ea", say, and the results instantly start filling the screen with all instances "year", "Early" and "BEA Weblogic" highlighted - with the letters still in their correct case.

What I'd like, is for suggestion of some means of highlighting he _actual_ search string. At the moment, if you do a search for "mary had a little lamb", the _database_ is searched for all instances of field LIKE '%mary had a little lamb%' (and the search is case-insensitive, by default, of course). However, the results are displayed with all instances of "mary", "a", "had" (including "mary's lamb was eaten by a wolf hiding in the shaddows", etc., highlighted - which rather detracts from the effect :P.

It probably just requires a bit of thought, but, at the moment, I notice that the sun is shining, outside, so I think I'll just go and do something very unprogrammer-like that involves getting dirty and tired, instead...

richard.york · May 2nd, 2004, 06:36 AM

Quote:

quote:Originally posted by Daniel Walker

What I'd like, is for suggestion of some means of highlighting he _actual_ search string. At the moment, if you do a search for "mary had a little lamb", the _database_ is searched for all instances of field LIKE '%mary had a little lamb%' (and the search is case-insensitive, by default, of course). However, the results are displayed with all instances of "mary", "a", "had" (including "mary's lamb was eaten by a wolf hiding in the shaddows", etc., highlighted - which rather detracts from the effect :P.

Ah, I see what you're saying now (sorry I'm a bit dense for the details now and then). Well you probably need some more Google-ish syntax. Do you already have a mechanism in place to specify the search string literally and not as exploded terms?

"Mary had a little lamb" with exploded terms:

It goes in as
SEARCH table
WHERE FIELD
LIKE '%word1%' AND
LIKE '%word2%' ...etc.

1st its exploded on the space, then highlighted using that array.

Whereas, if the search string is delimited by quotations, "\"Mary had a little lamb\"", the parts enclosed with quotations are supposed to be taken literally.

It goes
SEARCH table
WHERE FIELD
LIKE '%search_string%'

The the term is exploded into bits based on where the quotations start and stop.

I haven't gottten around to implementing "Google" syntax like this myself, the following was my approach on it.

Code:

        /*
         * mixed explode_search(void) takes a search term and breaks it down into individual words
         * via the explode() function, this is then passed to an array and a where clause
         * is built from the search term array and a field array.
         *
        */

        function explode_search()
        {
            if (isset($_GET["search"]))
            {
                $search = urldecode($_GET["search"]);

                if (stristr($search, " "))
                {
                    $search = trim($search);
                    $search = explode(" ", $search);

                    for ($n = 0; each($this->search_fields); $n++)
                    {
                        if ($n == 0)    $where  = $this->loop_search($search, $n, $this->search_fields[$n]);
                        else            $where  .= $this->loop_search($search, $n, $this->search_fields[$n]);
                    }
                }
                else
                {
                    for ($n = 0; each($this->search_fields); $n++)
                    {
                        if ($n == 0)    $where  = $this->search_fields[$n]." LIKE '%".$search."%'";
                        else            $where  .= " OR ".$this->search_fields[$n]." LIKE '%".$search."%'";
                    }
                }

                return $where;
            }
        }

        /*
         * loop_search() is a function called upon by explode_search() to build a
         * where clause.
        */

        function loop_search($search, $n, $search_field)
        {
            for ($i = 0; each($search); $i++)
            {
                if ($n == 0 && $i == 0)
                {
                    $where = $search_field." LIKE '%".$search[$i]."%'";
                }
                else
                {
                    $where .= ($i == 0)?  " OR ".$search_field." LIKE '%".$search[$i]."%'" : " AND ".$search_field." LIKE '%".$search[$i]."%'";
                }
            }

            return $where;
        }

$where = $this->explode_search();

It takes a pre-defined list of fields and builds the whole WHERE query. It could be easily modified to do Google syntax, but I haven't yet gotten around to it.

With this approach you can save the search term array when its built here, then pass it along to your highlighter function.. then there's no need to explode it there and the regular expressions in that function will just deal with the words or phrases you pass to it, surpassing just exploded search terms.

That'd be the way I'd go about it anyway :).

Regards,
Rich

::::::::::::::::::::::::::::::::::::::::::
The Spicy Peanut Project
http://www.spicypeanut.net
::::::::::::::::::::::::::::::::::::::::::

Daniel Walker · May 3rd, 2004, 07:20 AM

Right. I prefer the "exact match" approach, since it is usually what the user intends, in my
experience. However, having just read your code, it suddenly hit me like a brick on the forehead
why the Google search engine asks for double quotes around exact match searches: it
isn't just some quaint little shorthand that it happens to be using, but a request, on the part
of Google, for the user to insert a piece of regular expression (the " sign) that it can
then shove through it's sausage maker. My mistake was doing the explode within the highlighter,
by default. I should make the explode an option for all portions of the text passed to it without
double quotes around it and push all double-quoted text through the mill, unexploded. Database
searches for text that wasn't double quoted could be handled by the code you have give, above
(explode it and then search for each LIKE '%word%'), while (as far as the database search was
concerned) text that was quoted could be handled by code like that in the original post.

I suppose the reason I want to find out how this is done is partly because it's useful, in itself,
but mostly because I happen to now that they (Google) use standard PHP/Apache running
on standard Linux boxes to achieve the same effect... so it must be doable :).

Daniel Walker · May 10th, 2004, 08:21 AM

For what it's worth, I completely replaced my somewhat combersome process of
three/four-way search by using stristr to parse out matching block of text
in a case-insensitive manner - preserving the exact matches and adding them
to an array of replacements on an if(!in_array(... basis.

This has fixed one the longstanding problems with the code, since it finds
and highlights suitable matches for sources of bicapitalisation, such as
surnames like McDonnald, MacDonnald, O'Niel, D'Acre, etc., as well as camel-casing
in quoted code, shift-key-obsessive languge names like JavaScript, etc.. All of
these shared a sequence of capitalisation that did not come close to matching the
somewhat simplistic rules I had originally been using.

This, coupled with an improved method of searching that I'm building, using wildcards
that the user can input, and double quotes to indicate exact matches for word sequences
should make for a much more useful piece of code. I'll probably write this into a web
article and post it up, when I'm done, but what I've described here is a brief overview.

Daniel Walker · May 11th, 2004, 04:25 PM

Maintaining this dialogue with myself even further, I see how stupid I've been to
attempt this mechanistic approach to isolating matching strings in a procedural
manner, when standard POSIX regular expressions could have done it for me.

By saying:

$haystack = preg_replace('/('.$needle.')/i','$1',$haystack);

I'd have been able to wrap s around all matching instances
of needle without needing to do all that elaborate search&replace stuff.

Oh well, time to do some proper reading up on RegExs, I suppose.