|
|
 |
| XSLT General questions and answers about XSLT. For issues strictly specific to the book XSLT 1.1 Programmers Reference, please post to that forum instead. |
Welcome to the p2p.wrox.com Forums.
You are currently viewing the XSLT section of the Wrox p2p Programmer to Programmer discussion community. This is a community of more than 40,000 computer programmers including Wrox book authors and readers. As a guest, you can read any forum posting. By joining our free Wrox p2p community you can post your own programming questions and respond to other programmers’ questions. Registered users also don't have to see the ads that are displayed to guests. Registration is fast, simple and absolutely free so please, join today!
Join today and post to win prizes! Post more to increase your chances of being Wrox’s top poster of the month.
|
 |
|

August 21st, 2009, 09:08 AM
|
|
Authorized User
|
|
Join Date: Feb 2009
Posts: 92
Thanks: 24
Thanked 1 Time in 1 Post
|
|
extract data from one big text element
I have the following html file http://www.sec.gov/Archives/edgar/da...66204e10vk.htm
What i do with it is:
replace all < with [[
replace all > with ]]
replace all with ' '
then wrap the whole text file in a <root> element
My task is then to find the Executive Officers.
If you scroll down or search through the html file for Executive Officers of Dell you will see they are in a colored table not to far in.
How would I extract these rows as xml elements with there details as attributes?
Can there be a generalist solution to this?
By the way Mike my employer bought me your book today.
Regards
|

August 21st, 2009, 09:26 AM
|
 |
Wrox Author
Points: 12,735, Level: 48 |
|
|
Join Date: Apr 2004
Location: Reading, Berks, United Kingdom.
Posts: 3,923
Thanks: 0
Thanked 82 Times in 80 Posts
|
|
Looks like you are planning to parse the HTML "by hand". That's not how I would do it. I would parse it using TagSoup (available via Saxon's parse-html() extension function), and then access it as structured XML.
Code:
<xsl:apply-templates select="//text()[normalize-space() = 'Executive Officers of Dell']/following::*:table[1]"/>
__________________
Michael Kay
http://www.saxonica.com/
Author, XSLT 2.0 and XPath 2.0 Programmer\'s Reference
|
|
The Following User Says Thank You to mhkay For This Useful Post:
|
|

August 21st, 2009, 09:41 AM
|
|
Authorized User
|
|
Join Date: Feb 2009
Posts: 92
Thanks: 24
Thanked 1 Time in 1 Post
|
|
Thanks for the quick response.
I can't seem to find any information about saxon parse-html()
How do you use it?
|

August 21st, 2009, 09:45 AM
|
|
Friend of Wrox
|
|
Join Date: Nov 2007
Location: Germany
Posts: 655
Thanks: 0
Thanked 98 Times in 97 Posts
|
|
See http://www.saxonica.com/documentatio...arse-html.html. I think it is new in 9.2 and is not available in the home edition of 9.2.
__________________
Martin Honnen
Microsoft MVP - XML
My blog
|
|
The Following User Says Thank You to Martin Honnen For This Useful Post:
|
|

August 21st, 2009, 09:58 AM
|
|
Authorized User
|
|
Join Date: Feb 2009
Posts: 92
Thanks: 24
Thanked 1 Time in 1 Post
|
|
Will this function work if the its html and not xhtml?
|

August 21st, 2009, 10:06 AM
|
 |
Wrox Author
Points: 12,735, Level: 48 |
|
|
Join Date: Apr 2004
Location: Reading, Berks, United Kingdom.
Posts: 3,923
Thanks: 0
Thanked 82 Times in 80 Posts
|
|
>Will this function work if the its html and not xhtml?
Yes, that's its whole purpose. If it's XHTML, you can just load it as a standard document.
Note, if the HTML is external, use saxon:parse-html(unparsed-text(...))). It works on a string rather than a URI so you can parse HTML held in CDATA sections within XML.
__________________
Michael Kay
http://www.saxonica.com/
Author, XSLT 2.0 and XPath 2.0 Programmer\'s Reference
Last edited by mhkay : August 21st, 2009 at 10:06 AM.
Reason: spelling
|
|
The Following User Says Thank You to mhkay For This Useful Post:
|
|

August 21st, 2009, 10:19 AM
|
|
Friend of Wrox
|
|
Join Date: Nov 2007
Location: Germany
Posts: 655
Thanks: 0
Thanked 98 Times in 97 Posts
|
|
As an alternative to Saxon's extension function, you could try to use David Carlisle's HTML parser implemented in pure XSLT 2.0. Against the large document you have it is slow and needs a lot of memory but I was able to make it work:
Code:
<xsl:stylesheet
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:d="data:,dpc"
exclude-result-prefixes="d"
xpath-default-namespace="http://www.w3.org/1999/xhtml"
version="2.0">
<xsl:include href="htmlparse.xsl"/>
<!--
<xsl:include href="http://www.dcarlisle.demon.co.uk/htmlparse.xsl"/>
-->
<xsl:output indent="yes"/>
<xsl:param name="f" select="'test20009082101Saved.html'"/>
<!--
<xsl:param name="f" select="'http://www.sec.gov/Archives/edgar/data/826083/000095013409006106/d66204e10vk.htm'"/>
-->
<xsl:template name="main">
<xsl:variable name="html-doc" select="d:htmlparse(unparsed-text($f, 'ISO-8859-1'))"/>
<results>
<xsl:apply-templates select="$html-doc/document/type/sequence/filename/description/text/html/body//div[contains(normalize-space(.), 'Executive Officers of Dell')]/following-sibling::table[1]/tr[position() gt 3]"/>
</results>
</xsl:template>
<xsl:template match="tr">
<person name="{normalize-space(td[1])}" age="{normalize-space(td[4])}" title="{normalize-space(td[7])}"/>
</xsl:template>
</xsl:stylesheet>
Output then is
Code:
<results>
<person name="Michael S. Dell" age="44"
title="Chairman of the Board and Chief Executive Officer"/>
<person name="Bradley R. Anderson" age="49"
title="Senior Vice President, Enterprise Product Group"/>
<person name="Paul D. Bell" age="48" title="President, Global Public"/>
<person name="Jeffrey W. Clarke" age="46"
title="Vice Chairman, Operations and Technology"/>
<person name="Andrew C. Esparza" age="50"
title="Senior Vice President, Human Resources"/>
<person name="Stephen J. Felice" age="51"
title="President, Global Small and Medium Business"/>
<person name="Ronald G. Garriques" age="45" title="President, Global Consumer"/>
<person name="Brian T. Gladden" age="44"
title="Senior Vice President and Chief Financial Officer"/>
<person name="Erin Nelson" age="39" title="Vice President, Chief Marketing Officer"/>
<person name="Stephen F. Schuckenbrock" age="48"
title="President, Global Large Enterprise"/>
<person name="Lawrence P. Tu" age="54"
title="Senior Vice President, General Counsel and Secretary"/>
</results>
__________________
Martin Honnen
Microsoft MVP - XML
My blog
|
|
The Following User Says Thank You to Martin Honnen For This Useful Post:
|
|

August 21st, 2009, 11:10 PM
|
|
Authorized User
|
|
Join Date: Feb 2009
Posts: 92
Thanks: 24
Thanked 1 Time in 1 Post
|
|
Hi all, thanks for your responses.
Martin - every time i try to transform the html document with the xslt you provided I get an error. It says:
The element type "BR" must be terminated by the matching end-tag "</BR>".
I must be doing something wrong!?
Regards,
John.
|

August 22nd, 2009, 08:59 AM
|
 |
Wrox Author
Points: 12,735, Level: 48 |
|
|
Join Date: Apr 2004
Location: Reading, Berks, United Kingdom.
Posts: 3,923
Thanks: 0
Thanked 82 Times in 80 Posts
|
|
Sounds like you are submitting the HTML as the principal input of the stylesheet, so it's being parsed by an XML parser. The code is designed to take the HTML as a secondary input.
__________________
Michael Kay
http://www.saxonica.com/
Author, XSLT 2.0 and XPath 2.0 Programmer\'s Reference
|
|
The Following User Says Thank You to mhkay For This Useful Post:
|
|

August 22nd, 2009, 10:00 AM
|
|
Friend of Wrox
|
|
Join Date: Nov 2007
Location: Germany
Posts: 655
Thanks: 0
Thanked 98 Times in 97 Posts
|
|
Quote:
Originally Posted by JohnBampton
Hi all, thanks for your responses.
Martin - every time i try to transform the html document with the xslt you provided I get an error. It says:
The element type "BR" must be terminated by the matching end-tag "</BR>".
I must be doing something wrong!?
Regards,
John.
|
The HTML parser implemented by David Carlisle emits a lot of messages with xsl:message so when I run the stylesheet I get a lot of messages in the form
htmlparse: Not well formed (ignoring /div)
but at the end it will output the XML I posted earlier.
My command line with Saxon 9.2 Home Edition looks as follows:
java -Xmx128m -jar C:\path\saxon9he.jar -it:main -xsl:test2009082101Xsl.xml
so you run Saxon to start with the named template 'main' and not with an XML input document.
test2009082101Xsl.xml is the stylesheet I posted and that stylesheet includes David Carlisle's htmlparse.xsl and uses unparsed-text to load the HTML document (locally as the file test20009082101Saved.html which you would need to change/rename as needed in the stylesheet or change by setting the parameter named f).
__________________
Martin Honnen
Microsoft MVP - XML
My blog
|
|
The Following User Says Thank You to Martin Honnen For This Useful Post:
|
|
| Thread Tools |
Search this Thread |
|
|
|
| Display Modes |
Linear Mode
|
Posting Rules
|
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts
HTML code is Off
|
|
|
|
 |