extract data from one big text element

JohnBampton · August 21st, 2009, 08:08 AM

I have the following html filehttp://www.sec.gov/Archives/edgar/da...66204e10vk.htm

What i do with it is:

replace all < with [[
replace all > with ]]
replace all   with ' '
then wrap the whole text file in a <root> element

My task is then to find the Executive Officers.

If you scroll down or search through the html file for Executive Officers of Dell you will see they are in a colored table not to far in.

How would I extract these rows as xml elements with there details as attributes?

Can there be a generalist solution to this?

By the way Mike my employer bought me your book today.

Regards

mhkay · August 21st, 2009, 08:26 AM

Looks like you are planning to parse the HTML "by hand". That's not how I would do it. I would parse it using TagSoup (available via Saxon's parse-html() extension function), and then access it as structured XML.

Code:

<xsl:apply-templates select="//text()[normalize-space() = 'Executive Officers of Dell']/following::*:table[1]"/>

JohnBampton · August 21st, 2009, 08:41 AM

Thanks for the quick response.

I can't seem to find any information about saxon parse-html()

How do you use it?

Martin Honnen · August 21st, 2009, 08:45 AM

See http://www.saxonica.com/documentatio...arse-html.html. I think it is new in 9.2 and is not available in the home edition of 9.2.

JohnBampton · August 21st, 2009, 08:58 AM

Will this function work if the its html and not xhtml?

mhkay · August 21st, 2009, 09:06 AM

>Will this function work if the its html and not xhtml?

Yes, that's its whole purpose. If it's XHTML, you can just load it as a standard document.

Note, if the HTML is external, use saxon:parse-html(unparsed-text(...))). It works on a string rather than a URI so you can parse HTML held in CDATA sections within XML.

Martin Honnen · August 21st, 2009, 09:19 AM

As an alternative to Saxon's extension function, you could try to use David Carlisle's HTML parser implemented in pure XSLT 2.0. Against the large document you have it is slow and needs a lot of memory but I was able to make it work:

Code:

<xsl:stylesheet
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  xmlns:d="data:,dpc"
  exclude-result-prefixes="d"
  xpath-default-namespace="http://www.w3.org/1999/xhtml"
  version="2.0">
  
  <xsl:include href="htmlparse.xsl"/>
  <!--
  <xsl:include href="http://www.dcarlisle.demon.co.uk/htmlparse.xsl"/>
  -->
  
  <xsl:output indent="yes"/>
  
  <xsl:param name="f" select="'test20009082101Saved.html'"/>
  <!--
  <xsl:param name="f" select="'http://www.sec.gov/Archives/edgar/data/826083/000095013409006106/d66204e10vk.htm'"/>
  -->
  
  <xsl:template name="main">
    <xsl:variable name="html-doc" select="d:htmlparse(unparsed-text($f, 'ISO-8859-1'))"/>
   
    <results>
      <xsl:apply-templates select="$html-doc/document/type/sequence/filename/description/text/html/body//div[contains(normalize-space(.), 'Executive Officers of Dell')]/following-sibling::table[1]/tr[position() gt 3]"/>
    </results>
  </xsl:template>
  
  <xsl:template match="tr">
    <person name="{normalize-space(td[1])}" age="{normalize-space(td[4])}" title="{normalize-space(td[7])}"/>
  </xsl:template>

</xsl:stylesheet>

Output then is

Code:

<results>
   <person name="Michael S. Dell" age="44"
           title="Chairman of the Board and Chief Executive Officer"/>
   <person name="Bradley R. Anderson" age="49"
           title="Senior Vice President, Enterprise Product Group"/>
   <person name="Paul D. Bell" age="48" title="President, Global Public"/>
   <person name="Jeffrey W. Clarke" age="46"
           title="Vice Chairman, Operations and Technology"/>
   <person name="Andrew C. Esparza" age="50"
           title="Senior Vice President, Human Resources"/>
   <person name="Stephen J. Felice" age="51"
           title="President, Global Small and Medium Business"/>
   <person name="Ronald G. Garriques" age="45" title="President, Global Consumer"/>
   <person name="Brian T. Gladden" age="44"
           title="Senior Vice President and Chief Financial Officer"/>
   <person name="Erin Nelson" age="39" title="Vice President, Chief Marketing Officer"/>
   <person name="Stephen F. Schuckenbrock" age="48"
           title="President, Global Large Enterprise"/>
   <person name="Lawrence P. Tu" age="54"
           title="Senior Vice President, General Counsel and Secretary"/>
</results>

JohnBampton · August 21st, 2009, 10:10 PM

Hi all, thanks for your responses.

Martin - every time i try to transform the html document with the xslt you provided I get an error. It says:

The element type "BR" must be terminated by the matching end-tag "</BR>".

I must be doing something wrong!?

Regards,

John.

mhkay · August 22nd, 2009, 07:59 AM

Sounds like you are submitting the HTML as the principal input of the stylesheet, so it's being parsed by an XML parser. The code is designed to take the HTML as a secondary input.

Martin Honnen · August 22nd, 2009, 09:00 AM

Quote:

Originally Posted by JohnBampton

Hi all, thanks for your responses.

Martin - every time i try to transform the html document with the xslt you provided I get an error. It says:

The element type "BR" must be terminated by the matching end-tag "</BR>".

I must be doing something wrong!?

Regards,

John.

The HTML parser implemented by David Carlisle emits a lot of messages with xsl:message so when I run the stylesheet I get a lot of messages in the form
htmlparse: Not well formed (ignoring /div)
but at the end it will output the XML I posted earlier.

My command line with Saxon 9.2 Home Edition looks as follows:

java -Xmx128m -jar C:\path\saxon9he.jar -it:main -xsl:test2009082101Xsl.xml

so you run Saxon to start with the named template 'main' and not with an XML input document.

test2009082101Xsl.xml is the stylesheet I posted and that stylesheet includes David Carlisle's htmlparse.xsl and uses unparsed-text to load the HTML document (locally as the file test20009082101Saved.html which you would need to change/rename as needed in the stylesheet or change by setting the parameter named f).