Wrox Home  
Search P2P Archive for: Go

  Return to Index  

regular_expressions thread: URL Parsing


Message #1 by armagan@o... on Wed, 17 Jul 2002 13:03:03
'Twas brillig Thursday 18 July 2002 13:38, when you scrobe:
> Sean, as far as I can see this pattern ignores the urls with documents,
> like www.yahoo.com/index.htm. I can't ignore them, they should be parsed
> along.
>
> But I appreciate your help. Thanks a lot. (I couldn't find any solution
> yet, so the fun is not yet over :)
>
> >>>^www\.[a-z0-9][a-z0-9_-]*(\.[a-z0-9_-]+)*\.[a-z]{2,4}$

It shouldn't be too hard to add the subdirectories.  They're really a string 
of characters preceded by a forward slash something like

	/[a-z0-9\._-]+

Since we want to allow zero or more subdirectories, we end up with something 
like

	^www\.[a-z0-9][a-z0-9_-]*(\.[a-z0-9_-]+)*\.[a-z]{2,4}(/[a-z0-9\._-]+)*$

To extend this even further to allow url-level parameters, we can add (before 
the $ above) something like

	(?[a-z0-9\._-]+=[a-z0-9\._-]+(&[a-z0-9\._-]+=[a-z0-9\._-]+)*)){0,1}

I may have missed escaping some special characters, so check if your compiler 
needs to escape the ?, /, & and = characters and add the appropriate escape.

-- 
Sean Lamb, Software Engineer - sean@f...
         while( ) { s/$badcode/$goodcode/g; }
"A day without laughter is a day wasted." -- Groucho Marx

  Return to Index