p2p.wrox.com Forums

p2p.wrox.com Forums (http://p2p.wrox.com/index.php)
-   Pro PHP (http://p2p.wrox.com/forumdisplay.php?f=96)
-   -   perl compatible regular expressions (http://p2p.wrox.com/showthread.php?t=5778)

richard.york November 4th, 2003 09:27 PM

perl compatible regular expressions
 
I was wondering if anyone knew of a good tutorial on perl compatible regular expressions. I am trying to write a regular expression that would replace links in an email program with HTML formatted links.

I tried Nik's example in the following thread:
http://p2p.wrox.com/topic.asp?TOPIC_ID=5482

But have seem to run into a snag, in that I designed my program as a class and cannot seem to find a way to do the callback function from the class. I also tried defining the callback function in global scope, but the regular expression function didn't return the mail body. I've actually attempted several examples that I found on the web and none of them bring back the message body.

Here is one example that I tried:
$msg_body = imap_fetchbody($this->mailbox, $mid, $pid);

$msg_body = preg_replace("/([\w\.]+)(@)([\S\.]+)\b/i","<a href=\"mailto:$0\">$0</a>", $msg_body);
$msg_body = preg_replace("(^)"<a href=\"http$3://$4$5\"target=\"_blank\">$2$4$5</a>", $msg_body);

Neither of these look like a very good solution.

If I comment out the preg_replace functions the message body shows up, when I use them I get a blank message body.

I don't know much about regular expressions anyway, so I am at a loss to see where it might be going wrong. In all of my PHP books none of them seem to discuss perl compatible regular expressions in any detail, but they do talk quite a bit about POSIX-style regular expressions.

Thanks in advance!
: )
Rich


:::::::::::::::::::::::::::::::::
Smiling Souls
http://www.smilingsouls.net
:::::::::::::::::::::::::::::::::

richard.york November 4th, 2003 11:37 PM

I was able to figure out a way to get Nik's example working.

Apparently my decode function which decodes the message body from quoted-printable was creating a conflict, so I moved that to happen before I attempted regular expression replacement.

I used create_function() to use preg_replace_callback from within my class.

$msg_body = imap_fetchbody($this->mailbox, $mid, $pid);
$msg_body = $this->decode_message($msg_body, $this->encoding[$mid][$i]);

$pattern = '!\bhttps?://([\w\-]+\.)+[a-zA-Z]{2,3}(/(\S+)?)?\b!';

$msg_body = htmlspecialchars($msg_body);
$msg_body = preg_replace_callback($pattern, create_function('$matches', 'return "<a href=\'".$matches[0]."\' target=\'_new\'>".$matches[0]."</a>";'), $msg_body);

: )
Rich

:::::::::::::::::::::::::::::::::
Smiling Souls
http://www.smilingsouls.net
:::::::::::::::::::::::::::::::::

nikolai November 5th, 2003 03:55 PM

Hey Rich,

I'd recommend reading through PHP's manual pages:
  http://www.php.net/pcre

Check out their "pattern syntax" and "pattern modifiers" page. Also, search for 'perl regular expression tutorial' on google; there's lots of hits.


I don't think for your case you need to use create_function(); the problem with that approach is that you create an unnamed function EVERY time you get to the point in execution. I don't think it causes a huge amount of excess overhead, but it's there nonetheless.


I don't have the time to play with your original patterns, but I suspect a couple reasons your patterns are failing:

1) You're using a dollar to access your back references. Perl-compatible regexes in PHP use a backslash and a number between 0 and 99 to access a back reference.

2) Your 2nd pattern isn't a valid string:
  "(^)"<a href=\"http$3://$4$5\"target=\"_blank\">$2$4$5</a>"

The 4th character of your pattern string is a double-quote character, which ends the string and should cause a parse error.

Good luck, and let me know if any more problems come up.





Take care,

Nik
http://www.bigaction.org/

richard.york November 6th, 2003 12:33 AM

Thanks Nik,

I must have overlooked the pattern syntax links when I was looking through the manual. I have been trying out some patterns.

I saw in the user notes at http://www.php.net/preg_replace_callback someone suggested plugging in an array with two indices, the first being the class name and the second the function name.. well actually here is a quote:

Quote:

quote:
Also, if you want to use a *static* class method for the callback function, you can refer to it like this:
   preg_replace_callback(pattern, array('ClassName', 'methodName'), subject)

In PHP5, from within the class:
   preg_replace_callback(pattern, array('self', 'methodName'), subject)
I tried this and it works, well the first method, I'm waiting for PHP 5 to come out of beta before fooling with that.

I have been pouring over your syntax for a while and cannot seem to get it modified to accept any protocol.

The original I think was this:
$pattern = '!\bhttps?://([\w\-]+\.)+[a-zA-Z]{2,3}(/(\S+)?)?\b!';

I tried changing it to this:
$pattern = '!\b(https?|telnet|ftp)(:\/\/)([\w\-]+\.)+[a-zA-Z]{2,3}(/(\S+)?)?\b!';

And I was also trying to include an optional '/' at the end of the URL... for cases where the url contains only http://www.somesite.com/

I wrote this one for emails which seems to work well... actually I took the example on the zend website and modified it to include more addresses.

$body = preg_replace_callback('/[A-z0-9_\-\.]+[@][A-z0-9_\-]+([.][A-z0-9_\-]+)+[A-z0-9\-]+([.][A-z0-9_\-]+)?+[A-z]?/', array('library', 'mailify'), $body);

It matches dots in the address and optionally matches sub-domain addresses or double suffix domains, like .co.uk and it matches addresses attached to a mailto: statement.

I would appreciate any comments you might be able to throw my way!

Thanks!
: )
Rich

:::::::::::::::::::::::::::::::::
Smiling Souls
http://www.smilingsouls.net
:::::::::::::::::::::::::::::::::

nikolai November 6th, 2003 03:07 PM

Your modified version of the pattern works for recognizing telnet and ftp protocol declarations. The reason the trailing slash doesn't get recognized is because the transition from a slash to whitespace (or the end of the line) does NOT constitute a word boundary. I thought that it would...

Remove the last \b in the pattern and the slashes sould be recognized.

When matching hostnames, most people find it sufficient to just enforce the top-level domain to either be 2 or 3 characters. All country domains (ws, tv, uk, en, jp, etc...) and US domain types (net, com, org, edu, gov, mil) will be matched.


Take care,

Nik
http://www.bigaction.org/

richard.york November 6th, 2003 04:31 PM

Thanks Nik, that did the trick.

:::::::::::::::::::::::::::::::::
Smiling Souls
http://www.smilingsouls.net
:::::::::::::::::::::::::::::::::


All times are GMT -4. The time now is 07:41 PM.

Powered by vBulletin®
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.
© 2013 John Wiley & Sons, Inc.