I'm working on a utility that will scan the large number of web pages
listed in a directory application for problems: 404's, abandoned domains,
etc. I'm using the javax.swing.html.parser.ParserDelegator class,
following the Spider example in ch.24 of Professional Java Server
Programming as a starting point. I've noticed that a fair number of sites
listed redirect to another page, which is itself failing. As a
workaround, I'm attempting to recognize redirects, and follow them to
inspect the final page. I have two questions:
1) The javax parser does not seem to be working in an "implied html"
case. That is, if an index page consists only of a <meta http-
equiv="Refresh" content="0; url=http://newpage.forexample.com"> and
nothing else (no <html> tag for instance), the parser reports back an
implied html tag, but does not recognize the meta tag. Is there a way
around this besides treating it as a special case and attempt to match a
meta tag in otherwise ok pages?
2) The meta refresh covers many situations like this, but some are using
javascript for the same purpose. Parsing all the possible ways this might
be coded in javascript would be a daunting task. Are there any
suggestions for how to follow redirects in these instances?
Geoff Howard