Hello,
We have a CDATA element inside "rss/channel/item/description" element in a RSS feed file.
The contents of the CDATA section is html fragments without any root html element or declarations. The text has lots of html entities such as —
We need to extract the html tags, and save it valid HTML
Possible solutinos:
1. we use a valueOf on the CDATA element to produce the article file and store as a temp file. Then read the article through another XSL templates.
2. Preprocess the RSS file by extracting the cdata into a namespaced xml. Then apply Tidy and xslt only on the generated xml file.
I would to do most of it using XSL.
Any pros/cons of these approaches? is there a better approach?
Attached is the XSL and an input XML file:
Code:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="2.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:html="http://www.w3.org/1999/xhtml"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:saxon="http://saxon.sf.net/"
exclude-result-prefixes="xsl html xs saxon">
<xsl:output method="xml" indent="yes" cdata-section-elements="description" encoding="UTF-8"/>
<xsl:template match="/item">
<xsl:text disable-output-escaping='yes'><!DOCTYPE HTML [<!ENTITY mdash "—">] >
</xsl:text>
<html>
<head></head>
<body>
<article>
<div><xsl:apply-templates select="description"/></div>
</article>
</body>
</html>
</xsl:template>
<xsl:template match="description">
<xsl:value-of select="." disable-output-escaping="yes"/>
<!-- ideally, we don't want to have to pass in a fakeroot element to enable parsing -->
<xsl:apply-templates select="saxon:parse(concat('<fakeroot>', . , '</fakeroot>'))"/>
</xsl:template>
<xsl:template match="b|i|p|br|strong|em|span|div|h1|h2|h3|h4|h5| h6|ul|li|ol|dd|dl|dt|hr|table|th|tr|td|img|figure| figcaption|sub|sup|pre|a|blockqoute">
<xsl:copy>
<xsl:apply-templates select="node()|@*"/>
</xsl:copy>
</xsl:template>
This is the input XML file
----------------------------
Code:
<?xml version="1.0" encoding="UTF-8"?>
<item>
<description><![CDATA[<p></p><p>The Associated Press</p><p>HONG KONG — Hong Kong police have made their biggest ever cocaine bust, seizing more than 1,200 pounds (560 kilograms) of the drug and arresting eight people.</p><p>Police say the five men and three women arrested included five Mexican nationals, an American and a Colombian. They are to appear in court later Monday.</p><p>Police say narcotics bureau officers acting on a tip carried out raids at a warehouse and other locations across the city starting Friday. Police say the cocaine seized in the raids was worth about $600 million Hong Kong dollars ($77 million).</p><p>Police say the warehouse was believed to be a drug packaging and storage center.</p><p>___</p><p>September 18, 2011 11:40 PM EDT </p><p>Copyright 2011, The Associated Press. All rights reserved. This material may not be published, broadcast, rewritten or redistributed.</p>]]>
</description>
<summary>HONG KONG - Hong Kong police have made their biggest ever cocaine bust, seizing more
than 1,200 pounds (560 kilograms) of the drug and arresting eight people.</summary>
<item>