p2p.wrox.com Forums

Need to download code?

View our list of code downloads.


  Return to Index  

beginning_php thread: preg_replace to add target to URLs in string


Message #1 by "Victor Biro" <vabiro@y...> on Tue, 25 Feb 2003 19:08:45
Hi,

I am trying to make a change to the links on a page that is contained in a 
string.

For example, when a hyperlink is found (<a 
href="http://www.example.com>words here</a>) it will have a target added 
to the hyperlink ((<a href="http://www.example.com target="new">words 
here</a>) 

Here are some of the variations I have tried:

$data = preg_replace("/(<a.*)>(.*)$/","'$1 'target=\"new\">$2",$data);

$data = preg_replace("'<a $>'i","'<a $1 target=\"new\">'",$data);

As well as umpteen different variations. 

I am quite shure my problem is with trying to find the Perl Expression 
that will echo the original data into the replace along with the original 
url.

Any help would be appreciated. I have tried to use the explanations in the 
php docs, but they all seem to make the assumption that the reader is 
familiar with PRE, and simply wants to use them in PHP. 

Thanks in advance for any help.

Victor 
Message #2 by "Nikolai Devereaux" <yomama@u...> on Tue, 25 Feb 2003 11:17:10 -0800
The problem I see is that you're trying to match too much stuff.


Your expression: "/(<a.*)>(.*)$/" matches too much input for the first
parenthesized group, and nothing for the last.

Consider this line:

<a href="some url">some text</a>


The first expression will match this:

<a href="some url">some text</a

Then the > the > in your expression, and the last .* matches nothing, since
* means "zero or more".


You need to be more specific about what you want to match.  How does this
look to you?


|<[Aa]([^>]*)>([^<]*)</[Aa]>|


The parenthesized expressions use character classes that begin with ^ to
mean "not".  The first parenthesized expression matches everything within an
<a tag up to (but not including) the closing bracket.  Then we match the
closing bracket.

The second parenthesized expression matches all the text up to the first
open bracket, which we can assume will be the end tag.  (there are problems
with this, which I'll get to later.)

Then we just match the closing </a> tag.

The problem with the above expression is that it will not properly match
link text with additional tags in it, for example:

<a href="some url">This is a <b>bold</b> link</a>


The 2nd paren grouping will match "this is a " and then stop, since it sees
the open angle bracket.  The expression will fail since that text is not
immediately followed by a "</a>".


If you change the expression to match the middle more greedily:

|<[Aa]([^>]*)>(.*)</[Aa]>|


Then you will have problems when there is more than one link per line:

<a href="foo">foo</a> and <a href="bar">bar</a>


The middle paren expression will match 'foo</a> and <a href="bar">bar'.


Hope this helps clear things up, good luck figuring out the best expression
for your needs!


Nik

Message #3 by "Victor Biro" <vabiro@y...> on Tue, 25 Feb 2003 20:18:41
Nik,

Thanks! You've come through again.

What I came up with was:

 $data = preg_replace("'<[Aa]([^>]*)>([^<]*)</[Aa]>'","'<a $1 
target=\"new\"\>$2</a>'",$data);

It's working like a charm. I used this version because there was little 
risk of additional tags being enclosed in the href archor.

Another quick question regarding PRE: I am trying to migrate some static 
html and txt pages, that I have from an earlier version of my site, into a 
mySQL database. What I would like to do is something like screen scraping, 
without the http request, and then put the correct portions into the 
correct field in the table.

I'm getting the impression that regular expressions might be the ticket to 
recognise a pattern (eg <h2>title of article</h2> or <pre> several 
paragraphs of text</pre>) and turn it into a variable that would be 
inserted.

I'm sure that I'm not the first person to move from static web pages to 
php, so the question I have is: is there a php function that would allow 
me to parse for a regular expression and then use that as a varriable. My 
concern is with large text blocks; these are academic papers, sometimes 
with formula that might cause problems.

Any suggestions?

Victor

> 
The problem I see is that you're trying to match too much stuff.


Your expression: "/(<a.*)>(.*)$/" matches too much input for the first
parenthesized group, and nothing for the last.

Consider this line:

<a href="some url">some text</a>


The first expression will match this:

<a href="some url">some text</a

Then the > the > in your expression, and the last .* matches nothing, since
* means "zero or more".


You need to be more specific about what you want to match.  How does this
look to you?


|<[Aa]([^>]*)>([^<]*)</[Aa]>|


The parenthesized expressions use character classes that begin with ^ to
mean "not".  The first parenthesized expression matches everything within 
an
<a tag up to (but not including) the closing bracket.  Then we match the
closing bracket.

The second parenthesized expression matches all the text up to the first
open bracket, which we can assume will be the end tag.  (there are problems
with this, which I'll get to later.)

Then we just match the closing </a> tag.

The problem with the above expression is that it will not properly match
link text with additional tags in it, for example:

<a href="some url">This is a <b>bold</b> link</a>


The 2nd paren grouping will match "this is a " and then stop, since it sees
the open angle bracket.  The expression will fail since that text is not
immediately followed by a "</a>".


If you change the expression to match the middle more greedily:

|<[Aa]([^>]*)>(.*)</[Aa]>|


Then you will have problems when there is more than one link per line:

<a href="foo">foo</a> and <a href="bar">bar</a>


The middle paren expression will match 'foo</a> and <a href="bar">bar'.


Hope this helps clear things up, good luck figuring out the best expression
for your needs!


Nik


  Return to Index