Wrox Programmer Forums
Go Back   Wrox Programmer Forums > XML > XSLT
|
XSLT General questions and answers about XSLT. For issues strictly specific to the book XSLT 1.1 Programmers Reference, please post to that forum instead.
Welcome to the p2p.wrox.com Forums.

You are currently viewing the XSLT section of the Wrox Programmer to Programmer discussions. This is a community of software programmers and website developers including Wrox book authors and readers. New member registration was closed in 2019. New posts were shut off and the site was archived into this static format as of October 1, 2020. If you require technical support for a Wrox book please contact http://hub.wiley.com
 
Old September 21st, 2011, 10:40 AM
Registered User
 
Join Date: Sep 2011
Posts: 2
Thanks: 0
Thanked 0 Times in 0 Posts
Default How to parse the html contents inside a CDATA elements

Hello,

We have a CDATA element inside "rss/channel/item/description" element in a RSS feed file.
The contents of the CDATA section is html fragments without any root html element or declarations. The text has lots of html entities such as —

We need to extract the html tags, and save it valid HTML

Possible solutinos:

1. we use a valueOf on the CDATA element to produce the article file and store as a temp file. Then read the article through another XSL templates.

2. Preprocess the RSS file by extracting the cdata into a namespaced xml. Then apply Tidy and xslt only on the generated xml file.

I would to do most of it using XSL.

Any pros/cons of these approaches? is there a better approach?

Attached is the XSL and an input XML file:

Code:
<?xml version="1.0" encoding="UTF-8"?>

<xsl:stylesheet version="2.0"
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform" 
    xmlns:html="http://www.w3.org/1999/xhtml"
    xmlns:xs="http://www.w3.org/2001/XMLSchema"
    xmlns:saxon="http://saxon.sf.net/"
    exclude-result-prefixes="xsl html xs saxon">

    <xsl:output method="xml" indent="yes" cdata-section-elements="description" encoding="UTF-8"/>


    <xsl:template match="/item">
         <xsl:text disable-output-escaping='yes'>&lt;!DOCTYPE HTML [&lt;!ENTITY mdash "—"&gt;] &gt; 
         </xsl:text>
          <html>
          <head></head>
          <body>
          <article>
          <div><xsl:apply-templates select="description"/></div>
          </article>
          </body>
     </html>
</xsl:template>

<xsl:template match="description">
      <xsl:value-of select="." disable-output-escaping="yes"/> 
       <!-- ideally, we don't want to have to pass in a fakeroot element to enable parsing -->
       <xsl:apply-templates select="saxon:parse(concat('&lt;fakeroot&gt;', . , '&lt;/fakeroot&gt;'))"/>
</xsl:template>

<xsl:template match="b|i|p|br|strong|em|span|div|h1|h2|h3|h4|h5| h6|ul|li|ol|dd|dl|dt|hr|table|th|tr|td|img|figure| figcaption|sub|sup|pre|a|blockqoute">
     <xsl:copy>
     <xsl:apply-templates select="node()|@*"/>
     </xsl:copy>
</xsl:template>


This is the input XML file
----------------------------

Code:
<?xml version="1.0" encoding="UTF-8"?>
<item>
    <description><![CDATA[<p></p><p>The Associated Press</p><p>HONG KONG &mdash; Hong Kong police have made their biggest ever cocaine bust, seizing more than 1,200 pounds (560 kilograms) of the drug and arresting eight people.</p><p>Police say the five men and three women arrested included five Mexican nationals, an American and a Colombian. They are to appear in court later Monday.</p><p>Police say narcotics bureau officers acting on a tip carried out raids at a warehouse and other locations across the city starting Friday. Police say the cocaine seized in the raids was worth about $600 million Hong Kong dollars ($77 million).</p><p>Police say the warehouse was believed to be a drug packaging and storage center.</p><p>___</p><p>September 18, 2011 11:40 PM EDT </p><p>Copyright 2011, The Associated Press. All rights reserved. This material may not be published, broadcast, rewritten or redistributed.</p>]]>
    </description>
    <summary>HONG KONG - Hong Kong police have made their biggest ever cocaine bust, seizing more
than 1,200 pounds (560 kilograms) of the drug and arresting eight people.</summary>
<item>

Last edited by vijaychhipa; September 21st, 2011 at 04:15 PM..
 
Old September 22nd, 2011, 12:49 PM
Friend of Wrox
 
Join Date: Nov 2007
Posts: 1,243
Thanks: 0
Thanked 245 Times in 244 Posts
Default

You seem to be using XSLT 2.0 so to parse HTML tag soup you have http://web-xslt.googlecode.com/svn/t.../htmlparse.xsl as one option, it is pure XSLT 2.0 to parse HTML tag soup (i.e. HTML you find on the web that is not well-formed X(HT)ML).
Another option with the commercial versions of Saxon 9 is http://www.saxonica.com/documentatio...parse-html.xml.
__________________
Martin Honnen
Microsoft MVP (XML, Data Platform Development) 2005/04 - 2013/03
My blog
 
Old September 22nd, 2011, 02:47 PM
Registered User
 
Join Date: Sep 2011
Posts: 2
Thanks: 0
Thanked 0 Times in 0 Posts
Default Unable to parse the named entities

Martin,
Thanks for your response. Yes, I am using XSLT 2.0
The main issue really is not being able to handle the named entities such as &mdash; The parse() method complains, the input is invalid
 
Old September 23rd, 2011, 06:01 AM
Friend of Wrox
 
Join Date: Nov 2007
Posts: 1,243
Thanks: 0
Thanked 245 Times in 244 Posts
Default

With Saxon if you want to parse HTML the parse-html method should be able to deal with references to all those entities that HTML defines. David Carlisle's htmlparse.xsl is also able to deal with them.
__________________
Martin Honnen
Microsoft MVP (XML, Data Platform Development) 2005/04 - 2013/03
My blog





Similar Threads
Thread Thread Starter Forum Replies Last Post
Parse JSTL XML inside JSF prss JSP Basics 0 October 8th, 2010 08:57 AM
cdata section elements and version in xsl:output bsridharg XSLT 1 July 7th, 2010 05:39 PM
Accessing and using an ActiveX from inside a HTML page hosted inside a WebBrowser con adyrotaru C# 2005 2 June 25th, 2009 04:21 PM
Render HTML inside CDATA with XSL c2c XSLT 0 September 10th, 2006 11:10 AM
cdata-section-elements in xsl:output ROCXY XSLT 1 March 2nd, 2006 11:44 AM





Powered by vBulletin®
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.
Copyright (c) 2020 John Wiley & Sons, Inc.