Wrox Programmer Forums
|
ASP.NET 2.0 Basics If you are new to ASP or ASP.NET programming with version 2.0, this is the forum to begin asking questions. Please also see the Visual Web Developer 2005 forum.
Welcome to the p2p.wrox.com Forums.

You are currently viewing the ASP.NET 2.0 Basics section of the Wrox Programmer to Programmer discussions. This is a community of software programmers and website developers including Wrox book authors and readers. New member registration was closed in 2019. New posts were shut off and the site was archived into this static format as of October 1, 2020. If you require technical support for a Wrox book please contact http://hub.wiley.com
 
Old July 19th, 2006, 07:56 AM
Authorized User
 
Join Date: Sep 2005
Posts: 21
Thanks: 0
Thanked 0 Times in 0 Posts
Default Extract text from webpages

Hi,
I am developing a webcrawler/webspider in C#.Net 2005, I am extracting text from web pages through the code below. But the problem is that it only extract text from html pages because asp.net pages does not contain headers like <h1>…<h6>. So how I extract text from asp.net & php pages.



Code for extracting text from html web pages
S is string which contains the webpage


MatchCollection mPage = Regex.Matches(s, @"((<h1>|<h2>|<h3>|<h4>|<h5>|<h6>)\s*.+\s*(</h1>|</h2>|</h3>|</h4>|</h5>|</h6>))", RegexOptions.IgnoreCase);
            foreach (Match mP in mPage)
            {

                StreamWriter i = new StreamWriter(@"C:\WebSpider\index.txt", true);
                i.Write(mP.Groups[0].Value.ToString() +"\t"+ u +"\n");
                i.Close();
            }



asif
__________________
asif
 
Old July 19th, 2006, 08:01 AM
Imar's Avatar
Wrox Author
 
Join Date: Jun 2003
Posts: 17,089
Thanks: 80
Thanked 1,576 Times in 1,552 Posts
Default

What makes you think an ASP.NET page does not contain h1 and other tags? An ASP.NET page can output any HTML you like, including headings like h1 and h2. It all depends on the page developer....

Maybe you can explain what you're doing in a bit more detail?

Imar
---------------------------------------
Imar Spaanjaars
Everyone is unique, except for me.
 
Old July 21st, 2006, 07:59 AM
Authorized User
 
Join Date: Sep 2005
Posts: 21
Thanks: 0
Thanked 0 Times in 0 Posts
Default

Thanks for your attention,
you are right it depends on developer but i am developing a web search engine and webcrawler is an automatic browser which visits internet by first visit a page like wrox.com and then follow all links on wrox.com and so on, but webcrawler also index the pages it visits and for indexing a web page, i am extracing the meta information , the <title> tag and h1 to h6 tags, but it is my observation that most asp,asp.net pages does not contain h1 to h6 headers and developers are used label control for heading purposes so how i index asp.net and php web pages for my search engine.
Again thanks for your reply.

asif
 
Old July 21st, 2006, 10:48 AM
Authorized User
 
Join Date: Jul 2004
Posts: 69
Thanks: 0
Thanked 1 Time in 1 Post
Default

Search engines that I have seen typically strip most of the HTML from the page and look at the content for indexing purposes. Here is a function that will do that for you if it helps:

Code:
public static string StripHtml(string strHtml)
        {
            if (strHtml == null)
                return string.Empty;

            //Strips the HTML tags from strHTML 
            System.Text.RegularExpressions.Regex objRegExp
                    = new System.Text.RegularExpressions.Regex("<(.|\n)+?>");

            // Replace all tags with a space, otherwise words either side 
            // of a tag might be concatenated 
            string strOutput = objRegExp.Replace(strHtml, " ");

            // Replace all < and > with &lt; and &gt; 

            strOutput = strOutput.Replace("<", "&lt;");
            strOutput = strOutput.Replace(">", "&gt;");
            strOutput = strOutput.Replace("  ", " ");
            return strOutput;
        }
www.CoderForRent.com
Get A Computer Job!

www.ComputersComplete.com
Computer Parts & Accessories
 
Old July 21st, 2006, 03:44 PM
Imar's Avatar
Wrox Author
 
Join Date: Jun 2003
Posts: 17,089
Thanks: 80
Thanked 1,576 Times in 1,552 Posts
Default

Yeah, what coderforrent says makes sense; if a page doesn't contain an h1 tag, you'll never be able to extract it... ;)

Instead, focus on the HTML you do get, and get the text representation of the HTML.

Personally, I never noticed a difference between the usage of headings in static HTML and dynamic pages though. I know of many dynamic sites that use proper headings to divide the content....


Cheers,

Imar
---------------------------------------
Imar Spaanjaars
Everyone is unique, except for me.
Author of ASP.NET 2.0 Instant Results and Beginning Dreamweaver MX / MX 2004
While typing this post, I was listening to: Lights by Editors (Track 1 from the album: The Back Room) What's This?
 
Old July 22nd, 2006, 10:13 AM
Authorized User
 
Join Date: Sep 2005
Posts: 21
Thanks: 0
Thanked 0 Times in 0 Posts
Default

thanks for your code it really works for me
thanks again

asif
 
Old October 1st, 2007, 05:38 AM
Registered User
 
Join Date: Oct 2007
Posts: 3
Thanks: 0
Thanked 0 Times in 0 Posts
Default

hi..i am trying to download a detagging tool..but everythin comes as a windows application..actually i want to connect it with my java program...doing a project on web document summarization..so i want a code to detag html pages and get only the contents..does anybody have a code to extract text from webpages in java?????



 
Old October 1st, 2007, 03:56 PM
Wrox Author
 
Join Date: Oct 2005
Posts: 4,104
Thanks: 1
Thanked 64 Times in 64 Posts
Send a message via AIM to dparsons
Default

Naureen, if you look at coderforrent's post you will notice the RegEx that he uses, you will notice that this will take care of any encountered HTML tags. I am not a java guy but you should be able to use that RegEx (or a similar expression) inside your java project to return just the text from x website.

hth.

================================================== =========
Read this if you want to know how to get a correct reply for your question:
http://www.catb.org/~esr/faqs/smart-questions.html
================================================== =========
Technical Editor for:
Professional Search Engine Optimization with ASP.NET
Professional IIS 7 and ASP.NET Integrated Programming
Wrox Blox: Introduction to Google Gears
Wrox Blox: Create Amazing Custom User Interfaces with WPF and .NET 3.0
================================================== =========





Similar Threads
Thread Thread Starter Forum Replies Last Post
How can I extract text from a GIF image? Pls help! superjas Excel VBA 2 March 7th, 2018 11:16 PM
how to extract text from html??? naureen Java Basics 2 October 2nd, 2007 11:19 AM
Extract text with java script TheMajor Javascript 5 September 30th, 2007 09:45 PM
how to extract valid urls from a web response text connect2sandep C# 0 April 24th, 2006 04:12 PM
Extract text from text file & put in dropdown box tsukey Beginning PHP 5 July 20th, 2004 09:49 PM





Powered by vBulletin®
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.
Copyright (c) 2020 John Wiley & Sons, Inc.