p2p.wrox.com Forums

Need to download code?

View our list of code downloads.


Go Back   p2p.wrox.com Forums > XML > XSLT
I forgot my password Register Now
Register | FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read
XSLT General questions and answers about XSLT. For issues strictly specific to the book XSLT 1.1 Programmers Reference, please post to that forum instead.

Welcome to the p2p.wrox.com Forums.

You are currently viewing the XSLT section of the Wrox p2p Programmer to Programmer discussion community. This is a community of more than 40,000 computer programmers including Wrox book authors and readers. As a guest, you can read any forum posting. By joining our free Wrox p2p community you can post your own programming questions and respond to other programmers’ questions. Registered users also don't have to see the ads that are displayed to guests. Registration is fast, simple and absolutely free so please, join today!
Join today and post to win prizes! Post more to increase your chances of being Wrox’s top poster of the month.

Reply
 
Thread Tools Search this Thread Display Modes
  #1 (permalink)  
Old August 21st, 2009, 09:08 AM
Authorized User
Points: 379, Level: 6
Points: 379, Level: 6 Points: 379, Level: 6 Points: 379, Level: 6
Activity: 12%
Activity: 12% Activity: 12% Activity: 12%
 
Join Date: Feb 2009
Posts: 92
Thanks: 24
Thanked 1 Time in 1 Post
Default extract data from one big text element

I have the following html filehttp://www.sec.gov/Archives/edgar/da...66204e10vk.htm

What i do with it is:

replace all < with [[
replace all > with ]]
replace all &nbsp; with ' '
then wrap the whole text file in a <root> element

My task is then to find the Executive Officers.

If you scroll down or search through the html file for Executive Officers of Dell you will see they are in a colored table not to far in.

How would I extract these rows as xml elements with there details as attributes?

Can there be a generalist solution to this?

By the way Mike my employer bought me your book today.

Regards
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!Reddit!
Reply With Quote
  #2 (permalink)  
Old August 21st, 2009, 09:26 AM
mhkay's Avatar
Wrox Author
Points: 12,735, Level: 48
Points: 12,735, Level: 48 Points: 12,735, Level: 48 Points: 12,735, Level: 48
Activity: 100%
Activity: 100% Activity: 100% Activity: 100%
 
Join Date: Apr 2004
Location: Reading, Berks, United Kingdom.
Posts: 3,923
Thanks: 0
Thanked 82 Times in 80 Posts
Default

Looks like you are planning to parse the HTML "by hand". That's not how I would do it. I would parse it using TagSoup (available via Saxon's parse-html() extension function), and then access it as structured XML.

Code:
<xsl:apply-templates select="//text()[normalize-space() = 'Executive Officers of Dell']/following::*:table[1]"/>
__________________
Michael Kay
http://www.saxonica.com/
Author, XSLT 2.0 and XPath 2.0 Programmer\'s Reference
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!Reddit!
Reply With Quote
The Following User Says Thank You to mhkay For This Useful Post:
JohnBampton (August 21st, 2009)
  #3 (permalink)  
Old August 21st, 2009, 09:41 AM
Authorized User
Points: 379, Level: 6
Points: 379, Level: 6 Points: 379, Level: 6 Points: 379, Level: 6
Activity: 12%
Activity: 12% Activity: 12% Activity: 12%
 
Join Date: Feb 2009
Posts: 92
Thanks: 24
Thanked 1 Time in 1 Post
Default

Thanks for the quick response.

I can't seem to find any information about saxon parse-html()

How do you use it?
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!Reddit!
Reply With Quote
  #4 (permalink)  
Old August 21st, 2009, 09:45 AM
Friend of Wrox
Points: 3,131, Level: 23
Points: 3,131, Level: 23 Points: 3,131, Level: 23 Points: 3,131, Level: 23
Activity: 100%
Activity: 100% Activity: 100% Activity: 100%
 
Join Date: Nov 2007
Location: Germany
Posts: 655
Thanks: 0
Thanked 98 Times in 97 Posts
Default

See http://www.saxonica.com/documentatio...arse-html.html. I think it is new in 9.2 and is not available in the home edition of 9.2.
__________________
Martin Honnen
Microsoft MVP - XML
My blog
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!Reddit!
Reply With Quote
The Following User Says Thank You to Martin Honnen For This Useful Post:
JohnBampton (August 21st, 2009)
  #5 (permalink)  
Old August 21st, 2009, 09:58 AM
Authorized User
Points: 379, Level: 6
Points: 379, Level: 6 Points: 379, Level: 6 Points: 379, Level: 6
Activity: 12%
Activity: 12% Activity: 12% Activity: 12%
 
Join Date: Feb 2009
Posts: 92
Thanks: 24
Thanked 1 Time in 1 Post
Default

Will this function work if the its html and not xhtml?
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!Reddit!
Reply With Quote
  #6 (permalink)  
Old August 21st, 2009, 10:06 AM
mhkay's Avatar
Wrox Author
Points: 12,735, Level: 48
Points: 12,735, Level: 48 Points: 12,735, Level: 48 Points: 12,735, Level: 48
Activity: 100%
Activity: 100% Activity: 100% Activity: 100%
 
Join Date: Apr 2004
Location: Reading, Berks, United Kingdom.
Posts: 3,923
Thanks: 0
Thanked 82 Times in 80 Posts
Default

>Will this function work if the its html and not xhtml?

Yes, that's its whole purpose. If it's XHTML, you can just load it as a standard document.

Note, if the HTML is external, use saxon:parse-html(unparsed-text(...))). It works on a string rather than a URI so you can parse HTML held in CDATA sections within XML.
__________________
Michael Kay
http://www.saxonica.com/
Author, XSLT 2.0 and XPath 2.0 Programmer\'s Reference

Last edited by mhkay : August 21st, 2009 at 10:06 AM. Reason: spelling
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!Reddit!
Reply With Quote
The Following User Says Thank You to mhkay For This Useful Post:
JohnBampton (August 21st, 2009)
  #7 (permalink)  
Old August 21st, 2009, 10:19 AM
Friend of Wrox
Points: 3,131, Level: 23
Points: 3,131, Level: 23 Points: 3,131, Level: 23 Points: 3,131, Level: 23
Activity: 100%
Activity: 100% Activity: 100% Activity: 100%
 
Join Date: Nov 2007
Location: Germany
Posts: 655
Thanks: 0
Thanked 98 Times in 97 Posts
Default

As an alternative to Saxon's extension function, you could try to use David Carlisle's HTML parser implemented in pure XSLT 2.0. Against the large document you have it is slow and needs a lot of memory but I was able to make it work:
Code:
<xsl:stylesheet
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  xmlns:d="data:,dpc"
  exclude-result-prefixes="d"
  xpath-default-namespace="http://www.w3.org/1999/xhtml"
  version="2.0">
  
  <xsl:include href="htmlparse.xsl"/>
  <!--
  <xsl:include href="http://www.dcarlisle.demon.co.uk/htmlparse.xsl"/>
  -->
  
  <xsl:output indent="yes"/>
  
  <xsl:param name="f" select="'test20009082101Saved.html'"/>
  <!--
  <xsl:param name="f" select="'http://www.sec.gov/Archives/edgar/data/826083/000095013409006106/d66204e10vk.htm'"/>
  -->
  
  <xsl:template name="main">
    <xsl:variable name="html-doc" select="d:htmlparse(unparsed-text($f, 'ISO-8859-1'))"/>
   
    <results>
      <xsl:apply-templates select="$html-doc/document/type/sequence/filename/description/text/html/body//div[contains(normalize-space(.), 'Executive Officers of Dell')]/following-sibling::table[1]/tr[position() gt 3]"/>
    </results>
  </xsl:template>
  
  <xsl:template match="tr">
    <person name="{normalize-space(td[1])}" age="{normalize-space(td[4])}" title="{normalize-space(td[7])}"/>
  </xsl:template>

</xsl:stylesheet>
Output then is
Code:
<results>
   <person name="Michael S. Dell" age="44"
           title="Chairman of the Board and Chief Executive Officer"/>
   <person name="Bradley R. Anderson" age="49"
           title="Senior Vice President, Enterprise Product Group"/>
   <person name="Paul D. Bell" age="48" title="President, Global Public"/>
   <person name="Jeffrey W. Clarke" age="46"
           title="Vice Chairman, Operations and Technology"/>
   <person name="Andrew C. Esparza" age="50"
           title="Senior Vice President, Human Resources"/>
   <person name="Stephen J. Felice" age="51"
           title="President, Global Small and Medium Business"/>
   <person name="Ronald G. Garriques" age="45" title="President, Global Consumer"/>
   <person name="Brian T. Gladden" age="44"
           title="Senior Vice President and Chief Financial Officer"/>
   <person name="Erin Nelson" age="39" title="Vice President, Chief Marketing Officer"/>
   <person name="Stephen F. Schuckenbrock" age="48"
           title="President, Global Large Enterprise"/>
   <person name="Lawrence P. Tu" age="54"
           title="Senior Vice President, General Counsel and Secretary"/>
</results>
__________________
Martin Honnen
Microsoft MVP - XML
My blog
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!Reddit!
Reply With Quote
The Following User Says Thank You to Martin Honnen For This Useful Post:
JohnBampton (August 21st, 2009)
  #8 (permalink)  
Old August 21st, 2009, 11:10 PM
Authorized User
Points: 379, Level: 6
Points: 379, Level: 6 Points: 379, Level: 6 Points: 379, Level: 6
Activity: 12%
Activity: 12% Activity: 12% Activity: 12%
 
Join Date: Feb 2009
Posts: 92
Thanks: 24
Thanked 1 Time in 1 Post
Default

Hi all, thanks for your responses.

Martin - every time i try to transform the html document with the xslt you provided I get an error. It says:

The element type "BR" must be terminated by the matching end-tag "</BR>".

I must be doing something wrong!?

Regards,

John.
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!Reddit!
Reply With Quote
  #9 (permalink)  
Old August 22nd, 2009, 08:59 AM
mhkay's Avatar
Wrox Author
Points: 12,735, Level: 48
Points: 12,735, Level: 48 Points: 12,735, Level: 48 Points: 12,735, Level: 48
Activity: 100%
Activity: 100% Activity: 100% Activity: 100%
 
Join Date: Apr 2004
Location: Reading, Berks, United Kingdom.
Posts: 3,923
Thanks: 0
Thanked 82 Times in 80 Posts
Default

Sounds like you are submitting the HTML as the principal input of the stylesheet, so it's being parsed by an XML parser. The code is designed to take the HTML as a secondary input.
__________________
Michael Kay
http://www.saxonica.com/
Author, XSLT 2.0 and XPath 2.0 Programmer\'s Reference
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!Reddit!
Reply With Quote
The Following User Says Thank You to mhkay For This Useful Post:
JohnBampton (August 23rd, 2009)
  #10 (permalink)  
Old August 22nd, 2009, 10:00 AM
Friend of Wrox
Points: 3,131, Level: 23
Points: 3,131, Level: 23 Points: 3,131, Level: 23 Points: 3,131, Level: 23
Activity: 100%
Activity: 100% Activity: 100% Activity: 100%
 
Join Date: Nov 2007
Location: Germany
Posts: 655
Thanks: 0
Thanked 98 Times in 97 Posts
Default

Quote:
Originally Posted by JohnBampton View Post
Hi all, thanks for your responses.

Martin - every time i try to transform the html document with the xslt you provided I get an error. It says:

The element type "BR" must be terminated by the matching end-tag "</BR>".

I must be doing something wrong!?

Regards,

John.
The HTML parser implemented by David Carlisle emits a lot of messages with xsl:message so when I run the stylesheet I get a lot of messages in the form
htmlparse: Not well formed (ignoring /div)
but at the end it will output the XML I posted earlier.

My command line with Saxon 9.2 Home Edition looks as follows:

java -Xmx128m -jar C:\path\saxon9he.jar -it:main -xsl:test2009082101Xsl.xml

so you run Saxon to start with the named template 'main' and not with an XML input document.

test2009082101Xsl.xml is the stylesheet I posted and that stylesheet includes David Carlisle's htmlparse.xsl and uses unparsed-text to load the HTML document (locally as the file test20009082101Saved.html which you would need to change/rename as needed in the stylesheet or change by setting the parameter named f).
__________________
Martin Honnen
Microsoft MVP - XML
My blog
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!Reddit!
Reply With Quote
The Following User Says Thank You to Martin Honnen For This Useful Post:
JohnBampton (August 23rd, 2009)
Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is Off
HTML code is Off
Trackbacks are Off
Pingbacks are On
Refbacks are Off
Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
how to extract text from html??? naureen Java Basics 2 October 2nd, 2007 12:19 PM
Extract text from webpages asif_sharif ASP.NET 2.0 Basics 7 October 1st, 2007 04:56 PM
Extract text with java script TheMajor Javascript 5 September 30th, 2007 10:45 PM
How can I extract text from a GIF image? Pls help! superjas Excel VBA 1 March 10th, 2005 01:23 PM
Extract text from text file & put in dropdown box tsukey Beginning PHP 5 July 20th, 2004 10:49 PM



All times are GMT -4. The time now is 05:20 PM.


Powered by vBulletin® Version 3.6.8
Copyright ©2000 - 2009, Jelsoft Enterprises Ltd.
© 2008 Wiley Publishing, Inc