Escaped CDATA

mattisimo · February 3rd, 2010, 07:16 AM

Hi, I am trying to consume the XML output from a webservice and extract the XML and translate it to HTML using XSL.

I have run into an issue where the content of the nodes is surrounded with CDATA which is escaped, as is the content of the CDATA. I would like to be able to remove the CDATA and output the contents as unescaped HTML.

Here's a sample of the XML, any suggestions gratefully received.

Code:

<?xml version="1.0" encoding="UTF-8"?>
<Articles>
	<Article>
		<ID>{GUID}</ID>
		<Title>Article TItle</Title>
		<DatePublished>2008-11-26 12:19:00</DatePublished>
		<Intro>&lt;![CDATA[&lt;strong&gt;Intro text that is surrounded by a strong tag. &lt;/strong&gt;]]&gt;</Intro>
		<Summary>&lt;![CDATA[&lt;![CDATA[Summary text that sometimes has a double CDATA tag for some reason.  May also have strong or p tags also. ]]&gt;]]&gt;</Summary>
		<Url></Url>
	</Article>
</Articles>

Thanks,

Matt

samjudson · February 3rd, 2010, 07:30 AM

I'm fairly sure that double CData elements is invalid XML, unless one of the CData elements is encoded as well.

Saxon contains a method called saxon:parse which can be used to part an elements text as if it were XML. You can then output this using xsl:copy for example:

<xsl:copy select="saxon:parse(//Info)" xmlns:saxon="http://saxon.sf.net/"/>

Martin Honnen · February 3rd, 2010, 07:33 AM

That is not easy to solve as the CDATA section markup is also escpaped.
If you had e.g.

Code:

<Intro><![CDATA[&lt;strong;&gt;foobar&lt;/strong&gt;]]></Intro>

you could simply use disable-output-escaping as in

Code:

<xsl:template match="Intro">
  <xsl:value-of select="." disable-output-escaping="yes"/>
</xsl:template>

But as your CDATA section markup is also escaped that approach does not work. Which XSLT processor do you use? In case of Saxon 9 you might be able to parse the contents of the 'Intro' or 'Summary' elements with an extension function.

mattisimo · February 3rd, 2010, 07:50 AM

Hi, thanks for your responses.

I wasn't using any XSLT processor at the moment, just hand coding the transform with basic templates and <xsl:value-of type stuff.

Perhaps I should investigate Saxon? The Parse solution looks good.

All the examples I have found so far online seem to have the CDATA unescaped which makes a lot more sense to me than to escape the lot.

I was wondering if I could do a simple replace on the starting and ending CDATA, replacing with an empty string and then unescape the rest? Does that sound reasonable? It might get around the illogical inclusion of double CDATAs.

Thanks again,

Matt

samjudson · February 3rd, 2010, 08:06 AM

Apologies, since Martin posted I've double checked - and the parse method doesn't work I'm afraid - at least not with the CData escaped as well.

Martin Honnen · February 3rd, 2010, 08:44 AM

Quote:

Originally Posted by mattisimo

Hi, thanks for your responses.

I wasn't using any XSLT processor at the moment, just hand coding the transform with basic templates and <xsl:value-of type stuff.

Perhaps I should investigate Saxon? The Parse solution looks good.

All the examples I have found so far online seem to have the CDATA unescaped which makes a lot more sense to me than to escape the lot.

I was wondering if I could do a simple replace on the starting and ending CDATA, replacing with an empty string and then unescape the rest? Does that sound reasonable? It might get around the illogical inclusion of double CDATAs.

Thanks again,

Matt

If you use XSLT 2.0 and you know those elements like 'Intro' or 'Summary' start and end with that escaped CDATA section markup then you could remove it as follows and then output the escaped HTML markup:

Code:

  <xsl:template match="Intro | Summary">
       <xsl:value-of select="replace(., '^(&lt;!\[CDATA\[)+|(\]\]&gt;)+$', '')" disable-output-escaping="yes"/>
  </xsl:template>

mattisimo · February 3rd, 2010, 10:48 AM

Hi again, thanks for your suggestion.

I've tried this approach in Altova XML Spy 2004 and it said that the replace function wasn't valid.

I tried writing as ASP.Net page to implement it instead in case this version of the software didn't support it and I get the same message:

'replace()' is an unknown XSLT function.

Here's the sample stylesheet:

Code:

<?xml version="1.0" encoding="UTF-16"?>
<xsl:stylesheet version="2.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
 xmlns:xs="http://www.w3.org/2001/XMLSchema"
 xmlns:f="http://fxsl.sf.net/">
	<xsl:output method="xml" encoding="utf-16" omit-xml-declaration="yes"/>
	<xsl:template match="/">
		<xsl:apply-templates/>
	</xsl:template>
	
	<xsl:template match="LiveWellArticle">
		<xsl:apply-templates select="Intro"/>
	</xsl:template>
	
	<xsl:template match="Intro">
		<xsl:param name="text" select="."/>
        <br/><xsl:value-of select="replace($text, '^(&lt;!\[CDATA\[)+|(\]\]&gt;)+$', '')" disable-output-escaping="yes"/>
        <br/><xsl:value-of select="$text" disable-output-escaping="yes"/>
        <br/><xsl:value-of select="." disable-output-escaping="yes"></xsl:value-of>
	</xsl:template>

</xsl:stylesheet>

In XML Pad it just seems to ignore the line with the replace as it outputs nothing for that line and outputs the full contents including the CDATA for the other two lines.

Am I missing something with the implementation of <xsl:value-of select="replace(.......)">?

Thanks again for your help with this. As you may have realised I'm new to XSL.

Martin Honnen · February 3rd, 2010, 10:54 AM

replace is defined in XPath 2.0 so you need to use an XSLT 2.0 processor with XSLT 2.0 stylesheet to use that function. As XSLT and XPath 2.0 exist since the beginning of 2007 an editor with 2004 in its name is not likely to support that function or XSLT/XPath 2.0 at all.
Try Saxon 9 or the free AltovaXML tools 2010.

mattisimo · February 3rd, 2010, 11:01 AM

Great, thanks for you help - I'll investigate these further.