Converting Wiki-like text into XML

igraham · February 28th, 2008, 02:56 AM

In an XSLT 2.0 transform I'm trying to convert some wiki-like text, which may embed some matched XML tags, into XML as in the following example:

"

This should be a paragraph that includes tagged
stuff like <mytag>this</mytag> and <mytag>that</mytag>
for which the initial line breaks are irrelevant.

 but this is pre-formatted text
 that includes <mytag>tags</mytag>
 and that lines up
 exactly the way I want

"

The result I want from processing should be:

This should be a paragraph that includes tagged
stuff like <mytag>this</mytag> and <mytag>that</mytag>
for which the initial line breaks are irrelevant.

<pre>but this is pre-formatted text
 that includes <mytag>tags</mytag>
 and that lines up
exactly the way I want</pre>

Right now I'm almost succeeding, but using the ugliest approach you can imagine. You will notice that the <pre> section's indent was reduced by the minimum indent of its original lines. I couldn't figure out how to do this with complex content, so I did one pass using one template mode to convert the tags into plain text, something like "<mytag>" becoming "~L~mytag~R~" and "</mytag>" becoming "~L~/mytag~R~". Once I did that I was able to process the text to create the s and <pre>s exactly as I needed.

The last part that I'm having trouble with is using a named template to convert the text in the paragraphs back into XML. I'm using something like that below, although mine is hairier because I'm supporting several tags and attributes:

<xsl:analyze-string select="$text" regex="~L~(\w*)~R~(.*)~L~/\w*~R~">
 <xsl:matching-substring>
 <xsl:element name="mytag">
 <xsl:value-of select="regex-group(2)"/>
 </xsl:element>
 </xsl:matching-substring>
 <xsl:non-matching-substring>
 <xsl:value-of select="."/>
 </xsl:non-matching-substring>
</xsl:analyze-string>

I'm having two problems: one is that I can't parameterize an element name so that I can support arbitrary tags, and the other is that the expression for regex-group(2) is greedy and swallows too much if I have two tagged elements on a single line.

For the first issue, in my case I only need now to support two different allowed tags with no more than one attribute each, so I've been able to explicitly code all the cases. I suppose I could get around it by not generating XML at all, but instead generating text with actual '<' and '>' characters and then reprocessing that text.

For the second issue, I need something fancier than (.*) for the tag content so that it can't include any string that contains "~[LR]~". I'm a not an extensive user of regular expressions - is there anyone who could tell me how to express that?

But apart from the smaller question is the big one: how should I have approached this whole problem in the first place? I wouldn't choose to be doing this with XSLT except that the text I'm dealing with is embedded in a larger XML document that I'm already processing with XSLT.

Thanks,
Ian

samjudson · February 28th, 2008, 04:38 AM

To turn regular expressions to 'none greedy' append a '?' and then end of the qualifier:

.* = greedy
.*? = none greedy

As for your first problem - are you aware of this method:

Code:

<xsl:element name="{regex-group(1)}">
  ...

As to weather there is an easier way of doing the whole thing - I'm not entirely sure.

/- Sam Judson : Wrox Technical Editor -/

mhkay · February 28th, 2008, 04:50 AM

I've seen this kind of problem a number of times, there seem to be two separate ways to tackle it and no clear consensus on which is better. One is your approach: reduce everything to text, then put the markup back by analyze-string processing. The other approach is to add element structure to each text node, then regroup the hierarchy using for-each-group and similar transformations. I tend to favour the latter approach, but that doesn't mean yours is wrong.

The problems you're having seem to be small ones so I would stick with the general approach:

>I'm having two problems: one is that I can't parameterize an element name so that I can support arbitrary tags,

Just use <xsl:element name="{...some expression...}">

> and the other is that the expression for regex-group(2) is greedy and swallows too much if I have two tagged elements on a single line.

Just use non-greedy quantifiers: .*? instead of .*

Michael Kay
http://www.saxonica.com/
Author, XSLT Programmer's Reference and XPath 2.0 Programmer's Reference

igraham · February 28th, 2008, 03:40 PM

Thanks so much, both of you!

Michael, I'll take your advice and stick with my current approach on this one for now, but I'd still like to improve my understanding of grouping, especially in the context of processing marked up text, as my solution is vulnerable to sequences of characters in the source text that happen to match those of my text substitions.

It seems I would first need to group the mixed content back into representations of the original source lines - I suppose I would be able to do this by grouping all the items between \n characters, using group-ending-with="ends-with(text(),'\n')"? Is that the right idea?

mhkay · February 28th, 2008, 04:04 PM

group-ending-with must be a pattern, not an expression.

To be honest, I'm not really sure what rules you are applying. In your example all the newlines appear in top-level text nodes, but presumably one can't reply on that. You could, for example, put the whole tree through a modified-copy process in which any "\n\n" sequence within any text node is replaced by an empty <break/> element; you could then process the top-level children using <xsl:for-each-group group-starting-with="break" to change the structure from

<break/>mixedcontent<break/>mixedcontent<break/>

to

mixedcontentmixedcontent

You might have to repeat the process recursively if <break/> elements have been inserted as children or descendants of elements in the mixedcontent.

Then you can presumably detect which elements should be turned into <pre> elements based on the fact that they contain newlines followed by tabs, or some such rule.

Michael Kay
http://www.saxonica.com/
Author, XSLT Programmer's Reference and XPath 2.0 Programmer's Reference

igraham · February 28th, 2008, 05:14 PM

Yow, I hadn't thought about the issues of \n within the children. But I get the idea. The initial <break/> insertion would be just a bit more complicated because replacing \n\n isn't quite sufficient - the paragraph breaks for Wiki occur just based on whether there is indentation.

Converting Wiki to xhtml elegantly would make a wonderful example in the next edition of your excellent XSLT 2.0 book. Just the paragraph stuff like I've described, leaving it to enthusiastic readers to come up with all the list support and more.

Thanks again for all your help!

Ian

samjudson · February 28th, 2008, 06:09 PM

Quote:

quote:Converting Wiki to xhtml elegantly would make a wonderful example in the next edition of your excellent XSLT 2.0 book.

... too late, he's already written it. Due out beginning of May.

http://www.amazon.co.uk/dp/047019274...ackylabsnet-21

/- Sam Judson : Wrox Technical Editor -/

mhkay · February 28th, 2008, 06:18 PM

>too late, he's already written it

Actually, right now I'm working silly hours proof-reading the appendices, but it amounts to the same thing.

Michael Kay
http://www.saxonica.com/
Author, XSLT Programmer's Reference and XPath 2.0 Programmer's Reference