Unicode translation using [xsl:output-character]

ROCXY · May 12th, 2006, 12:38 AM

Hi All,

I have problem in converting XML to XML in entities.

Following is my XML.

<para>This is MAYA xml ÃÂ± ÃÂ´ <iemph>ÃÂ± ÃÂ´</iemph>.</para>

Following is my expected XML.

<para>This is MAYA xml &agr; &dgr; α δ.</para>

The exact problem is that, I would like to convert unicode which was NOT coming inside <iemph> to &agr; &dgr; and unicode coming inside <iemph> to α δ. I would like to make that using "xsl:output-character".

Any help would be appreciated.
Thanks,
ROCXY

mhkay · May 12th, 2006, 03:35 AM

You can't achieve this with XSLT 2.0 character maps, as they are not sensitive to context. You could do it with disable-output-escaping provided your XSLT processor supports it, or you could write your own serialization post-processing code.

Michael Kay
http://www.saxonica.com/
Author, XSLT Programmer's Reference and XPath 2.0 Programmer's Reference

rwilkerson · May 15th, 2006, 11:13 AM

I'd like to piggyback on this question with a variation of the same question. Like ROCXY, I'm converting XML to XML and having special character issues. I have the following original XML:

<title>FranÃ§ois Hollande : "La droite a pris l'Etat en otage"</title>

When converted, though, I get this:

<title><![CDATA[Fran?ois Hollande : "La droite a pris l'Etat en otage"]]></title>

Within the conversion, I'm specifying cdata elements and encoding in the xsl:output tag via the cdata-section-elements and encoding attributes, respectively. I'm specifying UTF-8 encoding.

I should note that the source XML specifies an encoding value of "iso-8859-1". Since the original XML will load and display fine, I'm not sure what, if any significance this may have, but it seems worth mentioning.

Surely there's a way to do this...right? What am I missing?

Any guidance would be /greatly/ appreciated.

Thanks.

Rob Wilkerson

mhkay · May 15th, 2006, 11:43 AM

What XSLT processor are you using?

The serializer should never output numeric character references (like & #65535; [I think ampersands are getting lost in this forum]) within a CDATA section, because XML doesn't recognize them there. This looks like a bug in your processor.

Secondly, Unicode 65533 is a substitute character for use when a character is found that can't be output in the selected encoding. If the selected encoding is UTF-8, I can't see any reason why it would be used.

The first thing to check is that your input XML is correctly encoded. What is the actual encoding of the c-with-cedilla (use a hex editor to find out), and what is the encoding specified in the XML declaration of the input file? Do they match?

Michael Kay
http://www.saxonica.com/
Author, XSLT Programmer's Reference and XPath 2.0 Programmer's Reference

rwilkerson · May 15th, 2006, 11:58 AM

Hey Michael -

I noticed the same thing when I posted and updated my original post. What it actually does is replace the unicode character in the original XML with the replacement character. The forums may have made that substitution.

My CDATA block actually contains the "unfound replacement character" (for want of a better term). When rendered, either in a database or a web page, the character is translated as the question mark or as the square character.

I also updated the original post with the original XML encoding. It's iso-8859-1.

I'll check the hex encoding now...

Thanks for the quick response.

Rob Wilkerson

rwilkerson · May 15th, 2006, 12:19 PM

Well, the original XML I was using is gone, but the updated file (it's an RSS feed) contains another entry that includes the "c-with-cedilla" character so I tested that in the hex editor.

Original XML:
<description>Le juge a Ã©tÃ© longuement entendu, lundi, par sa hiÃ©rarchie concernant ses liens avec Jean-Louis Gergorin, soupÃ§onnÃ© d#38;#39;Ãªtre le #38;#34;corbeau#38;#34;. </description>

The c-with-cedilla appears to be rendering in the hex editor as E7. The accented e at the end of the same word is encoded as E9 and also won't render properly.

Thanks again for your help.

Rob Wilkerson