my question conncernning building the pattern of regular expersion
i parse the returned page from google search and i want to extract only the links of the pages but this page contains advertising and links to cashed pages pictures anyway i use the most popular regex for href
a.*href\\s*=\\s*(?:\"(?<1>[^\"]*)\"|(?<1>\\S+) and modify it to not include the last >
a.*href\\s*=\\s*(?:\"(?<1>[^\"]*)\"|(?<1>[^>]*
but the output is have some not needed information
http://www.allsports.com/ ok
http://translate.google.com/translate?hl=en&sl=fr&u=http://www.jeunessesports.gouv.fr/&prev=/search%3Fq%3Dsports%26num%3D50%26hl%3Den%26lr%3D%2 6ie%3DUTF-8%26oe%3DUTF-8%26sa%3DG
(not ok it contains what we need the url in blak)
http://www.sports-central.org/ ok
http://www.dsusa.org/ ok
/search?q=sports&num=50&hl=en&lr=&ie=UTF-8&oe=UTF-8&start=50&sa=N not ok
/about.html not ok
the page of the return result of google is organized as
<p class=g><a href=http://dmoz.org/Sports/>Open Directory - <b>Sports</b></....and some other
what we need is
href=http://dmoz.org/Sports/
also,
i parse the returned page from yahoo search and i want to extract only the links of the
pages i use the same regex
but the output is have some not needed information
note this is one line
http://drs.yahoo.com/S=2766679/K=spo.../www.espn.com/
note this is one line
http://drs.yahoo.com/S=2766679/K=spo.../www.espn.com/
the page of the return result of yahoo is organized as
<li><big><a href="http://drs.yahoo.com/S=2766679/K=sports/v=2/SID=w/l=WS1/R=36/H=0/*-http://www.jeunesse-sports.gouv.fr/"> ....and some others
what we need
http://www.jeunesse-sports.gouv.fr/
thanks you
thanks
My Regards