 |
| XSLT General questions and answers about XSLT. For issues strictly specific to the book XSLT 1.1 Programmers Reference, please post to that forum instead. |
Welcome to the p2p.wrox.com Forums.
You are currently viewing the XSLT section of the Wrox Programmer to Programmer discussions. This is a community of software programmers and website developers including Wrox book authors and readers. New member registration was closed in 2019. New posts were shut off and the site was archived into this static format as of October 1, 2020. If you require technical support for a Wrox book please contact http://hub.wiley.com
|
|
|
|

August 21st, 2009, 08:08 AM
|
|
Friend of Wrox
|
|
Join Date: Feb 2009
Posts: 119
Thanks: 25
Thanked 3 Times in 3 Posts
|
|
extract data from one big text element
I have the following html file http://www.sec.gov/Archives/edgar/da...66204e10vk.htm
What i do with it is:
replace all < with [[
replace all > with ]]
replace all with ' '
then wrap the whole text file in a <root> element
My task is then to find the Executive Officers.
If you scroll down or search through the html file for Executive Officers of Dell you will see they are in a colored table not to far in.
How would I extract these rows as xml elements with there details as attributes?
Can there be a generalist solution to this?
By the way Mike my employer bought me your book today.
Regards
|
|

August 21st, 2009, 08:26 AM
|
 |
Wrox Author
|
|
Join Date: Apr 2004
Posts: 4,962
Thanks: 0
Thanked 292 Times in 287 Posts
|
|
Looks like you are planning to parse the HTML "by hand". That's not how I would do it. I would parse it using TagSoup (available via Saxon's parse-html() extension function), and then access it as structured XML.
Code:
<xsl:apply-templates select="//text()[normalize-space() = 'Executive Officers of Dell']/following::*:table[1]"/>
__________________
Michael Kay
http://www.saxonica.com/
Author, XSLT 2.0 and XPath 2.0 Programmer\'s Reference
|
|
The Following User Says Thank You to mhkay For This Useful Post:
|
|
|

August 21st, 2009, 08:41 AM
|
|
Friend of Wrox
|
|
Join Date: Feb 2009
Posts: 119
Thanks: 25
Thanked 3 Times in 3 Posts
|
|
Thanks for the quick response.
I can't seem to find any information about saxon parse-html()
How do you use it?
|
|

August 21st, 2009, 08:45 AM
|
|
Friend of Wrox
|
|
Join Date: Nov 2007
Posts: 1,243
Thanks: 0
Thanked 245 Times in 244 Posts
|
|
See http://www.saxonica.com/documentatio...arse-html.html. I think it is new in 9.2 and is not available in the home edition of 9.2.
__________________
Martin Honnen
Microsoft MVP (XML, Data Platform Development) 2005/04 - 2013/03
My blog
|
|
The Following User Says Thank You to Martin Honnen For This Useful Post:
|
|
|

August 21st, 2009, 08:58 AM
|
|
Friend of Wrox
|
|
Join Date: Feb 2009
Posts: 119
Thanks: 25
Thanked 3 Times in 3 Posts
|
|
Will this function work if the its html and not xhtml?
|
|

August 21st, 2009, 09:06 AM
|
 |
Wrox Author
|
|
Join Date: Apr 2004
Posts: 4,962
Thanks: 0
Thanked 292 Times in 287 Posts
|
|
>Will this function work if the its html and not xhtml?
Yes, that's its whole purpose. If it's XHTML, you can just load it as a standard document.
Note, if the HTML is external, use saxon:parse-html(unparsed-text(...))). It works on a string rather than a URI so you can parse HTML held in CDATA sections within XML.
__________________
Michael Kay
http://www.saxonica.com/
Author, XSLT 2.0 and XPath 2.0 Programmer\'s Reference
Last edited by mhkay; August 21st, 2009 at 09:06 AM..
Reason: spelling
|
|
The Following User Says Thank You to mhkay For This Useful Post:
|
|
|

August 21st, 2009, 09:19 AM
|
|
Friend of Wrox
|
|
Join Date: Nov 2007
Posts: 1,243
Thanks: 0
Thanked 245 Times in 244 Posts
|
|
As an alternative to Saxon's extension function, you could try to use David Carlisle's HTML parser implemented in pure XSLT 2.0. Against the large document you have it is slow and needs a lot of memory but I was able to make it work:
Code:
<xsl:stylesheet
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:d="data:,dpc"
exclude-result-prefixes="d"
xpath-default-namespace="http://www.w3.org/1999/xhtml"
version="2.0">
<xsl:include href="htmlparse.xsl"/>
<!--
<xsl:include href="http://www.dcarlisle.demon.co.uk/htmlparse.xsl"/>
-->
<xsl:output indent="yes"/>
<xsl:param name="f" select="'test20009082101Saved.html'"/>
<!--
<xsl:param name="f" select="'http://www.sec.gov/Archives/edgar/data/826083/000095013409006106/d66204e10vk.htm'"/>
-->
<xsl:template name="main">
<xsl:variable name="html-doc" select="d:htmlparse(unparsed-text($f, 'ISO-8859-1'))"/>
<results>
<xsl:apply-templates select="$html-doc/document/type/sequence/filename/description/text/html/body//div[contains(normalize-space(.), 'Executive Officers of Dell')]/following-sibling::table[1]/tr[position() gt 3]"/>
</results>
</xsl:template>
<xsl:template match="tr">
<person name="{normalize-space(td[1])}" age="{normalize-space(td[4])}" title="{normalize-space(td[7])}"/>
</xsl:template>
</xsl:stylesheet>
Output then is
Code:
<results>
<person name="Michael S. Dell" age="44"
title="Chairman of the Board and Chief Executive Officer"/>
<person name="Bradley R. Anderson" age="49"
title="Senior Vice President, Enterprise Product Group"/>
<person name="Paul D. Bell" age="48" title="President, Global Public"/>
<person name="Jeffrey W. Clarke" age="46"
title="Vice Chairman, Operations and Technology"/>
<person name="Andrew C. Esparza" age="50"
title="Senior Vice President, Human Resources"/>
<person name="Stephen J. Felice" age="51"
title="President, Global Small and Medium Business"/>
<person name="Ronald G. Garriques" age="45" title="President, Global Consumer"/>
<person name="Brian T. Gladden" age="44"
title="Senior Vice President and Chief Financial Officer"/>
<person name="Erin Nelson" age="39" title="Vice President, Chief Marketing Officer"/>
<person name="Stephen F. Schuckenbrock" age="48"
title="President, Global Large Enterprise"/>
<person name="Lawrence P. Tu" age="54"
title="Senior Vice President, General Counsel and Secretary"/>
</results>
__________________
Martin Honnen
Microsoft MVP (XML, Data Platform Development) 2005/04 - 2013/03
My blog
|
|
The Following User Says Thank You to Martin Honnen For This Useful Post:
|
|
|

August 21st, 2009, 10:10 PM
|
|
Friend of Wrox
|
|
Join Date: Feb 2009
Posts: 119
Thanks: 25
Thanked 3 Times in 3 Posts
|
|
Hi all, thanks for your responses.
Martin - every time i try to transform the html document with the xslt you provided I get an error. It says:
The element type "BR" must be terminated by the matching end-tag "</BR>".
I must be doing something wrong!?
Regards,
John.
|
|

August 22nd, 2009, 07:59 AM
|
 |
Wrox Author
|
|
Join Date: Apr 2004
Posts: 4,962
Thanks: 0
Thanked 292 Times in 287 Posts
|
|
Sounds like you are submitting the HTML as the principal input of the stylesheet, so it's being parsed by an XML parser. The code is designed to take the HTML as a secondary input.
__________________
Michael Kay
http://www.saxonica.com/
Author, XSLT 2.0 and XPath 2.0 Programmer\'s Reference
|
|
The Following User Says Thank You to mhkay For This Useful Post:
|
|
|

August 22nd, 2009, 09:00 AM
|
|
Friend of Wrox
|
|
Join Date: Nov 2007
Posts: 1,243
Thanks: 0
Thanked 245 Times in 244 Posts
|
|
Quote:
Originally Posted by JohnBampton
Hi all, thanks for your responses.
Martin - every time i try to transform the html document with the xslt you provided I get an error. It says:
The element type "BR" must be terminated by the matching end-tag "</BR>".
I must be doing something wrong!?
Regards,
John.
|
The HTML parser implemented by David Carlisle emits a lot of messages with xsl:message so when I run the stylesheet I get a lot of messages in the form
htmlparse: Not well formed (ignoring /div)
but at the end it will output the XML I posted earlier.
My command line with Saxon 9.2 Home Edition looks as follows:
java -Xmx128m -jar C:\path\saxon9he.jar -it:main -xsl:test2009082101Xsl.xml
so you run Saxon to start with the named template 'main' and not with an XML input document.
test2009082101Xsl.xml is the stylesheet I posted and that stylesheet includes David Carlisle's htmlparse.xsl and uses unparsed-text to load the HTML document (locally as the file test20009082101Saved.html which you would need to change/rename as needed in the stylesheet or change by setting the parameter named f).
__________________
Martin Honnen
Microsoft MVP (XML, Data Platform Development) 2005/04 - 2013/03
My blog
|
|
The Following User Says Thank You to Martin Honnen For This Useful Post:
|
|
|
 |