Wrox Home  
Search P2P Archive for: Go

  Return to Index  

pro_java_server thread: javax.swing.html.parser workaround for "implied" html


Message #1 by "Geoff Howard" <ghoward@c...> on Tue, 10 Jul 2001 19:18:41
I'm working on a utility that will scan the large number of web pages 
listed in a directory application for problems: 404's, abandoned domains, 
etc.  I'm using the javax.swing.html.parser.ParserDelegator class, 
following the Spider example in ch.24 of Professional Java Server 
Programming as a starting point.  I've noticed that a fair number of sites 
listed redirect to another page, which is itself failing.  As a 
workaround, I'm attempting to recognize redirects, and follow them to 
inspect the final page.  I have two questions:

1) The javax parser does not seem to be working in an "implied html" 
case.  That is, if an index page consists only of a <meta http-
equiv="Refresh" content="0; url=http://newpage.forexample.com"> and 
nothing else (no <html> tag for instance), the parser reports back an 
implied html tag, but does not recognize the meta tag.  Is there a way 
around this besides treating it as a special case and attempt to match a 
meta tag in otherwise ok pages?

2) The meta refresh covers many situations like this, but some are using 
javascript for the same purpose.  Parsing all the possible ways this might 
be coded in javascript would be a daunting task.  Are there any 
suggestions for how to follow redirects in these instances?

Geoff Howard

  Return to Index