Wrox Programmer Forums
Go Back   Wrox Programmer Forums > XML > XSLT
|
XSLT General questions and answers about XSLT. For issues strictly specific to the book XSLT 1.1 Programmers Reference, please post to that forum instead.
Welcome to the p2p.wrox.com Forums.

You are currently viewing the XSLT section of the Wrox Programmer to Programmer discussions. This is a community of software programmers and website developers including Wrox book authors and readers. New member registration was closed in 2019. New posts were shut off and the site was archived into this static format as of October 1, 2020. If you require technical support for a Wrox book please contact http://hub.wiley.com
 
Old August 21st, 2009, 08:08 AM
Friend of Wrox
 
Join Date: Feb 2009
Posts: 119
Thanks: 25
Thanked 3 Times in 3 Posts
Default extract data from one big text element

I have the following html filehttp://www.sec.gov/Archives/edgar/da...66204e10vk.htm

What i do with it is:

replace all < with [[
replace all > with ]]
replace all &nbsp; with ' '
then wrap the whole text file in a <root> element

My task is then to find the Executive Officers.

If you scroll down or search through the html file for Executive Officers of Dell you will see they are in a colored table not to far in.

How would I extract these rows as xml elements with there details as attributes?

Can there be a generalist solution to this?

By the way Mike my employer bought me your book today.

Regards
 
Old August 21st, 2009, 08:26 AM
mhkay's Avatar
Wrox Author
 
Join Date: Apr 2004
Posts: 4,962
Thanks: 0
Thanked 292 Times in 287 Posts
Default

Looks like you are planning to parse the HTML "by hand". That's not how I would do it. I would parse it using TagSoup (available via Saxon's parse-html() extension function), and then access it as structured XML.

Code:
<xsl:apply-templates select="//text()[normalize-space() = 'Executive Officers of Dell']/following::*:table[1]"/>
__________________
Michael Kay
http://www.saxonica.com/
Author, XSLT 2.0 and XPath 2.0 Programmer\'s Reference
The Following User Says Thank You to mhkay For This Useful Post:
JohnBampton (August 21st, 2009)
 
Old August 21st, 2009, 08:41 AM
Friend of Wrox
 
Join Date: Feb 2009
Posts: 119
Thanks: 25
Thanked 3 Times in 3 Posts
Default

Thanks for the quick response.

I can't seem to find any information about saxon parse-html()

How do you use it?
 
Old August 21st, 2009, 08:45 AM
Friend of Wrox
 
Join Date: Nov 2007
Posts: 1,243
Thanks: 0
Thanked 245 Times in 244 Posts
Default

See http://www.saxonica.com/documentatio...arse-html.html. I think it is new in 9.2 and is not available in the home edition of 9.2.
__________________
Martin Honnen
Microsoft MVP (XML, Data Platform Development) 2005/04 - 2013/03
My blog
The Following User Says Thank You to Martin Honnen For This Useful Post:
JohnBampton (August 21st, 2009)
 
Old August 21st, 2009, 08:58 AM
Friend of Wrox
 
Join Date: Feb 2009
Posts: 119
Thanks: 25
Thanked 3 Times in 3 Posts
Default

Will this function work if the its html and not xhtml?
 
Old August 21st, 2009, 09:06 AM
mhkay's Avatar
Wrox Author
 
Join Date: Apr 2004
Posts: 4,962
Thanks: 0
Thanked 292 Times in 287 Posts
Default

>Will this function work if the its html and not xhtml?

Yes, that's its whole purpose. If it's XHTML, you can just load it as a standard document.

Note, if the HTML is external, use saxon:parse-html(unparsed-text(...))). It works on a string rather than a URI so you can parse HTML held in CDATA sections within XML.
__________________
Michael Kay
http://www.saxonica.com/
Author, XSLT 2.0 and XPath 2.0 Programmer\'s Reference

Last edited by mhkay; August 21st, 2009 at 09:06 AM.. Reason: spelling
The Following User Says Thank You to mhkay For This Useful Post:
JohnBampton (August 21st, 2009)
 
Old August 21st, 2009, 09:19 AM
Friend of Wrox
 
Join Date: Nov 2007
Posts: 1,243
Thanks: 0
Thanked 245 Times in 244 Posts
Default

As an alternative to Saxon's extension function, you could try to use David Carlisle's HTML parser implemented in pure XSLT 2.0. Against the large document you have it is slow and needs a lot of memory but I was able to make it work:
Code:
<xsl:stylesheet
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  xmlns:d="data:,dpc"
  exclude-result-prefixes="d"
  xpath-default-namespace="http://www.w3.org/1999/xhtml"
  version="2.0">
  
  <xsl:include href="htmlparse.xsl"/>
  <!--
  <xsl:include href="http://www.dcarlisle.demon.co.uk/htmlparse.xsl"/>
  -->
  
  <xsl:output indent="yes"/>
  
  <xsl:param name="f" select="'test20009082101Saved.html'"/>
  <!--
  <xsl:param name="f" select="'http://www.sec.gov/Archives/edgar/data/826083/000095013409006106/d66204e10vk.htm'"/>
  -->
  
  <xsl:template name="main">
    <xsl:variable name="html-doc" select="d:htmlparse(unparsed-text($f, 'ISO-8859-1'))"/>
   
    <results>
      <xsl:apply-templates select="$html-doc/document/type/sequence/filename/description/text/html/body//div[contains(normalize-space(.), 'Executive Officers of Dell')]/following-sibling::table[1]/tr[position() gt 3]"/>
    </results>
  </xsl:template>
  
  <xsl:template match="tr">
    <person name="{normalize-space(td[1])}" age="{normalize-space(td[4])}" title="{normalize-space(td[7])}"/>
  </xsl:template>

</xsl:stylesheet>
Output then is
Code:
<results>
   <person name="Michael S. Dell" age="44"
           title="Chairman of the Board and Chief Executive Officer"/>
   <person name="Bradley R. Anderson" age="49"
           title="Senior Vice President, Enterprise Product Group"/>
   <person name="Paul D. Bell" age="48" title="President, Global Public"/>
   <person name="Jeffrey W. Clarke" age="46"
           title="Vice Chairman, Operations and Technology"/>
   <person name="Andrew C. Esparza" age="50"
           title="Senior Vice President, Human Resources"/>
   <person name="Stephen J. Felice" age="51"
           title="President, Global Small and Medium Business"/>
   <person name="Ronald G. Garriques" age="45" title="President, Global Consumer"/>
   <person name="Brian T. Gladden" age="44"
           title="Senior Vice President and Chief Financial Officer"/>
   <person name="Erin Nelson" age="39" title="Vice President, Chief Marketing Officer"/>
   <person name="Stephen F. Schuckenbrock" age="48"
           title="President, Global Large Enterprise"/>
   <person name="Lawrence P. Tu" age="54"
           title="Senior Vice President, General Counsel and Secretary"/>
</results>
__________________
Martin Honnen
Microsoft MVP (XML, Data Platform Development) 2005/04 - 2013/03
My blog
The Following User Says Thank You to Martin Honnen For This Useful Post:
JohnBampton (August 21st, 2009)
 
Old August 21st, 2009, 10:10 PM
Friend of Wrox
 
Join Date: Feb 2009
Posts: 119
Thanks: 25
Thanked 3 Times in 3 Posts
Default

Hi all, thanks for your responses.

Martin - every time i try to transform the html document with the xslt you provided I get an error. It says:

The element type "BR" must be terminated by the matching end-tag "</BR>".

I must be doing something wrong!?

Regards,

John.
 
Old August 22nd, 2009, 07:59 AM
mhkay's Avatar
Wrox Author
 
Join Date: Apr 2004
Posts: 4,962
Thanks: 0
Thanked 292 Times in 287 Posts
Default

Sounds like you are submitting the HTML as the principal input of the stylesheet, so it's being parsed by an XML parser. The code is designed to take the HTML as a secondary input.
__________________
Michael Kay
http://www.saxonica.com/
Author, XSLT 2.0 and XPath 2.0 Programmer\'s Reference
The Following User Says Thank You to mhkay For This Useful Post:
JohnBampton (August 22nd, 2009)
 
Old August 22nd, 2009, 09:00 AM
Friend of Wrox
 
Join Date: Nov 2007
Posts: 1,243
Thanks: 0
Thanked 245 Times in 244 Posts
Default

Quote:
Originally Posted by JohnBampton View Post
Hi all, thanks for your responses.

Martin - every time i try to transform the html document with the xslt you provided I get an error. It says:

The element type "BR" must be terminated by the matching end-tag "</BR>".

I must be doing something wrong!?

Regards,

John.
The HTML parser implemented by David Carlisle emits a lot of messages with xsl:message so when I run the stylesheet I get a lot of messages in the form
htmlparse: Not well formed (ignoring /div)
but at the end it will output the XML I posted earlier.

My command line with Saxon 9.2 Home Edition looks as follows:

java -Xmx128m -jar C:\path\saxon9he.jar -it:main -xsl:test2009082101Xsl.xml

so you run Saxon to start with the named template 'main' and not with an XML input document.

test2009082101Xsl.xml is the stylesheet I posted and that stylesheet includes David Carlisle's htmlparse.xsl and uses unparsed-text to load the HTML document (locally as the file test20009082101Saved.html which you would need to change/rename as needed in the stylesheet or change by setting the parameter named f).
__________________
Martin Honnen
Microsoft MVP (XML, Data Platform Development) 2005/04 - 2013/03
My blog
The Following User Says Thank You to Martin Honnen For This Useful Post:
JohnBampton (August 22nd, 2009)





Similar Threads
Thread Thread Starter Forum Replies Last Post
How can I extract text from a GIF image? Pls help! superjas Excel VBA 2 March 7th, 2018 11:16 PM
how to extract text from html??? naureen Java Basics 2 October 2nd, 2007 11:19 AM
Extract text from webpages asif_sharif ASP.NET 2.0 Basics 7 October 1st, 2007 03:56 PM
Extract text with java script TheMajor Javascript 5 September 30th, 2007 09:45 PM
Extract text from text file & put in dropdown box tsukey Beginning PHP 5 July 20th, 2004 09:49 PM





Powered by vBulletin®
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.
Copyright (c) 2020 John Wiley & Sons, Inc.