extract data from one big text element

JohnBampton · August 22nd, 2009, 11:10 PM

That works thanks.

JohnBampton · August 23rd, 2009, 04:22 AM

Hello,

I now have to find the executive officers of the Kellog company and they are no longer in one nice table. They are spread out over many tables an divs in repeating pattern

If you do a search for - James M. Jenness - you will see what I mean

I was able to take the code and advice that you have provided and apply/change it for two other documents. One for microsoft and the other for 3M.

This is the kellog link

http://www.sec.gov/Archives/edgar/da...47381e10vk.htm

Any help is always appreciated

Regards.

Martin Honnen · August 23rd, 2009, 08:43 AM

You will need to look at the document structure and find the XPath expressions to select the elements containing the data you are looking for. I am afraid with such an irregular structure there is not much you can automate.
The code

Code:

  <xsl:template name="main">
    <xsl:variable name="html-doc" select="d:htmlparse(unparsed-text($f, 'ISO-8859-1'))"/>
   
    <results>    
      <xsl:apply-templates select="$html-doc/document/type/sequence/filename/description/text/html/body//div[div[contains(normalize-space(), 'Executive Officers.')]]/table/tr[2]"/>
    </results>
  </xsl:template>
  
  <xsl:template match="tr">
    <person name="{normalize-space(td[1])}" age="{normalize-space(td[2])}" title="{normalize-space(parent::table/following-sibling::div[1])}"/>
  </xsl:template>

finds only two items:

Code:

<results>
   <person name="James M. Jenness" age="62" title="Chairman of the Board"/>
   <person name="A. D. David Mackay" age="53"
           title="President and Chief Executive Officer"/>
</results>