Wrox Programmer Forums
Go Back   Wrox Programmer Forums > Java > Java and JDK > Java Basics
|
Java Basics General beginning Java language questions that don't fit in one of the more specific forums. Please specify what version.
Welcome to the p2p.wrox.com Forums.

You are currently viewing the Java Basics section of the Wrox Programmer to Programmer discussions. This is a community of software programmers and website developers including Wrox book authors and readers. New member registration was closed in 2019. New posts were shut off and the site was archived into this static format as of October 1, 2020. If you require technical support for a Wrox book please contact http://hub.wiley.com
 
Old October 1st, 2007, 05:35 AM
Registered User
 
Join Date: Oct 2007
Posts: 3
Thanks: 0
Thanked 0 Times in 0 Posts
Default how to extract text from html???

hi..i am trying to download a detagging tool..but everythin comes as a windows application..actually i want to connect it with my java program...doing a project on web document summarization..so i want a code to detag html pages and get only the contents..does anybody have a code to extract text from webpages in java?????



 
Old October 1st, 2007, 12:16 PM
Friend of Wrox
 
Join Date: Dec 2003
Posts: 488
Thanks: 0
Thanked 3 Times in 3 Posts
Default

You're much better using the dump option of lynx( http://lynx.browser.org/ ), really.

But here is an attempt at a pure java solution, using HTMLEditorKit.ParserCallbacks.

Code:
import java.io.*;
import java.net.*;
import javax.swing.text.html.*;
import javax.swing.text.html.parser.*;

public class deTag extends HTMLEditorKit.ParserCallback {
  StringBuffer txt;
  Reader reader;

  // empty default constructor
  public deTag() {}

  // more convienient constructor
  public deTag(Reader r) {
    setReader(r);
  }

  public void setReader(Reader r) { reader = r; }

  public void parse() throws IOException {
    txt = new StringBuffer();
    ParserDelegator parserDelegator = new ParserDelegator();
    parserDelegator.parse(reader, this, true);
  }

  public void handleText(char[] text, int pos) {
    txt.append(text);
  }

  public String toString() {
    return txt.toString();
  }

  public static void main (String[] argv) {
    try {
      // the HTML to convert
      URL toRead;
      if(argv.length==1)
        toRead = new URL(argv[0]);
      else
        toRead = new URL("http://p2p.wrox.com");

      BufferedReader in = new BufferedReader(
        new InputStreamReader(toRead.openStream()));
      deTag d = new deTag(in);
      d.parse();
      in.close();
      System.out.println(d.toString());
    }
    catch (Exception e) {
      e.printStackTrace();
    }
  }
}
Example usage:

Code:
charlie@charlie:~/maui/src/java$ java deTag
p2p.wrox.com Forums View Cart  |  My AccountSupport  |  Contact Us   
Search P2P for Advanced Search Members:Participate in 
discussions or edit your profile. Login:Password:  Remember MeForgot
 Your Password?New Users: Register NowForum ToolsView All ForumsView 
Active TopicsArchivesFAQTerms of UseNew Titles for ASP.NETASP.NET 
AJAX Programmer's Reference: with ASP.NET 2.0 or ASP.NET 
3.5Professional ASP.NET 2.0 Design: CSS, Themes, and Master Pages 
>  P2P Forum> p2p Community ForumsNeed to download code? View our 
list of code downloads. ForumTopicsPostsLast PostModerator(s) > Wrox 
Announcements and Feedback   > Books   > ASP and ASP.NET   > 
C#/C++   > Database   > .NET   > General   > Java   > Mac   > 
Microsoft Office   > Microsoft Servers   > Open Source   > 
PHP/MySQL   > SQL Server   > Visual Basic   > Web   > 
XML   Statistics 32139 of 68324 Members have made 199919 posts in 
344 forums, with the last post on 10/01/2007 11:57:36 AM by: 
shipero.There are currently 62429 topics.Please welcome our newest 
member: shipero.> Contains new posts since last visit.    > No new 
posts since the last visit.>p2p.wrox.com ForumsTerms of Service© 
2007 Wiley Publishing, Inc.>This page was generated in 0.22 
seconds.Server time: 10/01/2007  12:12:34 PM (EST)>TopicIndexDynamic 
Topic ListCopyright © 2000-2007 by John Wiley & Sons, Inc. or related
 companies. All rights reserved. Please read our Privacy Policy.
charlie@charlie:~/maui/src/java$
Or specify a URL on the commandline:

Code:
charlie@charlie:~/maui/src/java$ java deTag http://perlmonks.com
PerlMonks - The Monastery Gates> >> > > > >laziness, impatience, and
 hubris   > >PerlMonks The Monastery Gates | Log in | Create a new 
user | The Monastery Gates | Super Search | >  | Seekers of Perl 
Wisdom | Meditations | PerlMonks Discussion | Snippets | >  | 
Obfuscation | Reviews | Cool Uses For Perl | Perl News | Q&A | 
Tutorials | >  | Code | Poetry | Recent Threads | Newest Nodes | 
Donate | What's New | >( #131=superdoc: print w/ replies, xml )Need 
Help??Donations gladly acceptedIf you're new here please read 
PerlMonks FAQ> and Create a new user.>Want Mega XP? Prepare to have 
your hopes dashed, join in on the: poll ideas quest 2007  (10702 days
 remain)New QuestionsHighest scalar = ???> on Oct 01, 2007 at 11:561 
replyby Anonymous MonkHowdy Monks! If I'm not using bignum, bigint, 
or biganything; what kind of limits should I expect my scalars to 
have? Do they have limits? EXAMPLE!! (insert your own 
impatience)#!/usr/bin/perl -w
use strict;
---------------------8<---------------
Cheers,
Charlie

--
Charlie Harvey's website - linux, perl, java, anarchism and punk rock: http://charlieharvey.org.uk
 
Old October 2nd, 2007, 11:19 AM
Registered User
 
Join Date: Oct 2007
Posts: 3
Thanks: 0
Thanked 0 Times in 0 Posts
Default

hey thanks a lot for tat code.... :)






Similar Threads
Thread Thread Starter Forum Replies Last Post
Extract text from webpages asif_sharif ASP.NET 2.0 Basics 7 October 1st, 2007 03:56 PM
Extract text with java script TheMajor Javascript 5 September 30th, 2007 09:45 PM
Extract hidden value from external HTML g2000 Classic ASP Basics 3 September 23rd, 2005 02:13 AM
how to extract contents of an e-mail in html/rtf mogli PHP How-To 5 September 15th, 2004 04:25 PM
Extract text from text file & put in dropdown box tsukey Beginning PHP 5 July 20th, 2004 09:49 PM





Powered by vBulletin®
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.
Copyright (c) 2020 John Wiley & Sons, Inc.