 |
| ASP.NET 2.0 Basics If you are new to ASP or ASP.NET programming with version 2.0, this is the forum to begin asking questions. Please also see the Visual Web Developer 2005 forum. |
Welcome to the p2p.wrox.com Forums.
You are currently viewing the ASP.NET 2.0 Basics section of the Wrox Programmer to Programmer discussions. This is a community of software programmers and website developers including Wrox book authors and readers. New member registration was closed in 2019. New posts were shut off and the site was archived into this static format as of October 1, 2020. If you require technical support for a Wrox book please contact http://hub.wiley.com
|
|
|
|

July 19th, 2006, 07:56 AM
|
|
Authorized User
|
|
Join Date: Sep 2005
Posts: 21
Thanks: 0
Thanked 0 Times in 0 Posts
|
|
Extract text from webpages
Hi,
I am developing a webcrawler/webspider in C#.Net 2005, I am extracting text from web pages through the code below. But the problem is that it only extract text from html pages because asp.net pages does not contain headers like <h1>â¦<h6>. So how I extract text from asp.net & php pages.
Code for extracting text from html web pages
S is string which contains the webpage
MatchCollection mPage = Regex.Matches(s, @"((<h1>|<h2>|<h3>|<h4>|<h5>|<h6>)\s*.+\s*(</h1>|</h2>|</h3>|</h4>|</h5>|</h6>))", RegexOptions.IgnoreCase);
foreach (Match mP in mPage)
{
StreamWriter i = new StreamWriter(@"C:\WebSpider\index.txt", true);
i.Write(mP.Groups[0].Value.ToString() +"\t"+ u +"\n");
i.Close();
}
asif
__________________
asif
|
|

July 19th, 2006, 08:01 AM
|
 |
Wrox Author
|
|
Join Date: Jun 2003
Posts: 17,089
Thanks: 80
Thanked 1,576 Times in 1,552 Posts
|
|
What makes you think an ASP.NET page does not contain h1 and other tags? An ASP.NET page can output any HTML you like, including headings like h1 and h2. It all depends on the page developer....
Maybe you can explain what you're doing in a bit more detail?
Imar
---------------------------------------
Imar Spaanjaars
Everyone is unique, except for me.
|
|

July 21st, 2006, 07:59 AM
|
|
Authorized User
|
|
Join Date: Sep 2005
Posts: 21
Thanks: 0
Thanked 0 Times in 0 Posts
|
|
Thanks for your attention,
you are right it depends on developer but i am developing a web search engine and webcrawler is an automatic browser which visits internet by first visit a page like wrox.com and then follow all links on wrox.com and so on, but webcrawler also index the pages it visits and for indexing a web page, i am extracing the meta information , the <title> tag and h1 to h6 tags, but it is my observation that most asp,asp.net pages does not contain h1 to h6 headers and developers are used label control for heading purposes so how i index asp.net and php web pages for my search engine.
Again thanks for your reply.
asif
|
|

July 21st, 2006, 10:48 AM
|
|
Authorized User
|
|
Join Date: Jul 2004
Posts: 69
Thanks: 0
Thanked 1 Time in 1 Post
|
|
Search engines that I have seen typically strip most of the HTML from the page and look at the content for indexing purposes. Here is a function that will do that for you if it helps:
Code:
public static string StripHtml(string strHtml)
{
if (strHtml == null)
return string.Empty;
//Strips the HTML tags from strHTML
System.Text.RegularExpressions.Regex objRegExp
= new System.Text.RegularExpressions.Regex("<(.|\n)+?>");
// Replace all tags with a space, otherwise words either side
// of a tag might be concatenated
string strOutput = objRegExp.Replace(strHtml, " ");
// Replace all < and > with < and >
strOutput = strOutput.Replace("<", "<");
strOutput = strOutput.Replace(">", ">");
strOutput = strOutput.Replace(" ", " ");
return strOutput;
}
www.CoderForRent.com
Get A Computer Job!
www.ComputersComplete.com
Computer Parts & Accessories
|
|

July 21st, 2006, 03:44 PM
|
 |
Wrox Author
|
|
Join Date: Jun 2003
Posts: 17,089
Thanks: 80
Thanked 1,576 Times in 1,552 Posts
|
|
Yeah, what coderforrent says makes sense; if a page doesn't contain an h1 tag, you'll never be able to extract it... ;)
Instead, focus on the HTML you do get, and get the text representation of the HTML.
Personally, I never noticed a difference between the usage of headings in static HTML and dynamic pages though. I know of many dynamic sites that use proper headings to divide the content....
Cheers,
Imar
---------------------------------------
Imar Spaanjaars
Everyone is unique, except for me.
Author of ASP.NET 2.0 Instant Results and Beginning Dreamweaver MX / MX 2004
While typing this post, I was listening to: Lights by Editors (Track 1 from the album: The Back Room) What's This?
|
|

July 22nd, 2006, 10:13 AM
|
|
Authorized User
|
|
Join Date: Sep 2005
Posts: 21
Thanks: 0
Thanked 0 Times in 0 Posts
|
|
thanks for your code it really works for me
thanks again
asif
|
|

October 1st, 2007, 05:38 AM
|
|
Registered User
|
|
Join Date: Oct 2007
Posts: 3
Thanks: 0
Thanked 0 Times in 0 Posts
|
|
hi..i am trying to download a detagging tool..but everythin comes as a windows application..actually i want to connect it with my java program...doing a project on web document summarization..so i want a code to detag html pages and get only the contents..does anybody have a code to extract text from webpages in java?????
|
|

October 1st, 2007, 03:56 PM
|
|
Wrox Author
|
|
Join Date: Oct 2005
Posts: 4,104
Thanks: 1
Thanked 64 Times in 64 Posts
|
|
Naureen, if you look at coderforrent's post you will notice the RegEx that he uses, you will notice that this will take care of any encountered HTML tags. I am not a java guy but you should be able to use that RegEx (or a similar expression) inside your java project to return just the text from x website.
hth.
================================================== =========
Read this if you want to know how to get a correct reply for your question:
http://www.catb.org/~esr/faqs/smart-questions.html
================================================== =========
Technical Editor for:
Professional Search Engine Optimization with ASP.NET
Professional IIS 7 and ASP.NET Integrated Programming
Wrox Blox: Introduction to Google Gears
Wrox Blox: Create Amazing Custom User Interfaces with WPF and .NET 3.0
================================================== =========
|
|
 |