Wrox Home  
Search P2P Archive for: Go

  Return to Index  

asp_web_howto thread: Extracting URL from a page?


Message #1 by "David Murphy" <yomommaissofat@h...> on Wed, 16 Oct 2002 11:31:32
I'm currently writing a search for my Intranet, I've built one that will
search the text in each file,.. and then I moved on to one that will search
the text on each page as they would be served (as a lot of the pages are
dynamic),..

I want to extend my search engine so that it will spider the site.  I've
modified the code and everythiing is ready to go.  EXCEPT! I can't work out
a way of extracting URLs from the pages.

I've got the whole code for the page as a string, strPageCode , and I have
an array into which I'll be inserting the URLs I find on each page,
URLArray , everything else is ready to go,.. I just need to extract the
hrefs from strPageCode and insert them into URLArray ...

Any thoughts?

David Murphy
NHS Direct Wales
Message #2 by "George Draper" <gdraper@c...> on Wed, 16 Oct 2002 11:38:58 -0400
David,

It's just a matter of looking for "href=" and parsing out the following
url.  I had good luck using the RegExp object to get the location of
each href, such as:

    Set regEx1 = New RegExp
    regEx1.Pattern = "href="   ' Set pattern.
    regEx1.IgnoreCase = True   ' Set case insensitivity.
    regEx1.Global = True   ' Set global applicability.

Then you can use the Match object to For Each through the match
collection.  I used the InStr and Mid functions to extract the url
string.

- George

>>> yomommaissofat@h... 10/16/2002 11:31:32 AM >>>
I'm currently writing a search for my Intranet, I've built one that
will
search the text in each file,.. and then I moved on to one that will
search
the text on each page as they would be served (as a lot of the pages
are
dynamic),..

I want to extend my search engine so that it will spider the site. 
I've
modified the code and everythiing is ready to go.  EXCEPT! I can't work
out
a way of extracting URLs from the pages.

I've got the whole code for the page as a string, strPageCode , and I
have
an array into which I'll be inserting the URLs I find on each page,
URLArray , everything else is ready to go,.. I just need to extract
the
hrefs from strPageCode and insert them into URLArray ...

Any thoughts?

David Murphy
NHS Direct Wales

---

Improve your web design skills with these new books from Glasshaus.

Usable Web Menus
http://www.amazon.com/exec/obidos/ASIN/1904151027/ref=nosim/theprogramme

r-20
Constructing Accessible Web Sites
http://www.amazon.com/exec/obidos/ASIN/1904151000/ref=nosim/theprogramme

r-20
Practical JavaScript for the Usable Web
http://www.amazon.com/exec/obidos/ASIN/1904151051/ref=nosim/theprogramme

r-20
Message #3 by "David Murphy" <yomommaissofat@h...> on Wed, 16 Oct 2002 16:00:42 +0000
Thank you,.. I puzzled it out a bit and managed to get my head round it 
after a while,.. but glad to know I'm doing it the right way.  Comes out as 
a nice small bit of code,.. which is usually a good pointer :)

-David

>From: "George Draper" <gdraper@c...>
>Reply-To: "ASP Web HowTo" <asp_web_howto@p...>
>To: "ASP Web HowTo" <asp_web_howto@p...>
>Subject: [asp_web_howto] Re: Extracting URL from a page?
>Date: Wed, 16 Oct 2002 11:38:58 -0400
>
>David,
>
>It's just a matter of looking for "href=" and parsing out the following
>url.  I had good luck using the RegExp object to get the location of
>each href, such as:
>
>     Set regEx1 = New RegExp
>     regEx1.Pattern = "href="   ' Set pattern.
>     regEx1.IgnoreCase = True   ' Set case insensitivity.
>     regEx1.Global = True   ' Set global applicability.
>
>Then you can use the Match object to For Each through the match
>collection.  I used the InStr and Mid functions to extract the url
>string.
>
>- George

_________________________________________________________________
Unlimited Internet access -- and 2 months free!  Try MSN. 
http://resourcecenter.msn.com/access/plans/2monthsfree.asp


  Return to Index