'Twas brillig Thursday 18 July 2002 13:38, when you scrobe:
> Sean, as far as I can see this pattern ignores the urls with documents,
> like www.yahoo.com/index.htm. I can't ignore them, they should be parsed
> along.
>
> But I appreciate your help. Thanks a lot. (I couldn't find any solution
> yet, so the fun is not yet over :)
>
> >>>^www\.[a-z0-9][a-z0-9_-]*(\.[a-z0-9_-]+)*\.[a-z]{2,4}$
It shouldn't be too hard to add the subdirectories. They're really a string
of characters preceded by a forward slash something like
/[a-z0-9\._-]+
Since we want to allow zero or more subdirectories, we end up with something
like
^www\.[a-z0-9][a-z0-9_-]*(\.[a-z0-9_-]+)*\.[a-z]{2,4}(/[a-z0-9\._-]+)*$
To extend this even further to allow url-level parameters, we can add (before
the $ above) something like
(?[a-z0-9\._-]+=[a-z0-9\._-]+(&[a-z0-9\._-]+=[a-z0-9\._-]+)*)){0,1}
I may have missed escaping some special characters, so check if your compiler
needs to escape the ?, /, & and = characters and add the appropriate escape.
--
Sean Lamb, Software Engineer - sean@f...
while( ) { s/$badcode/$goodcode/g; }
"A day without laughter is a day wasted." -- Groucho Marx