Wrox Programmer Forums
Go Back   Wrox Programmer Forums > XML > XSLT
|
XSLT General questions and answers about XSLT. For issues strictly specific to the book XSLT 1.1 Programmers Reference, please post to that forum instead.
Welcome to the p2p.wrox.com Forums.

You are currently viewing the XSLT section of the Wrox Programmer to Programmer discussions. This is a community of software programmers and website developers including Wrox book authors and readers. New member registration was closed in 2019. New posts were shut off and the site was archived into this static format as of October 1, 2020. If you require technical support for a Wrox book please contact http://hub.wiley.com
 
Old February 28th, 2008, 02:56 AM
Authorized User
 
Join Date: Jul 2007
Posts: 14
Thanks: 0
Thanked 0 Times in 0 Posts
Default Converting Wiki-like text into XML

In an XSLT 2.0 transform I'm trying to convert some wiki-like text, which may embed some matched XML tags, into XML as in the following example:

"

This should be a paragraph that includes tagged
stuff like <mytag>this</mytag> and <mytag>that</mytag>
for which the initial line breaks are irrelevant.

     but this is pre-formatted text
         that includes <mytag>tags</mytag>
         and that lines up
     exactly the way I want


"

The result I want from processing should be:

<p>This should be a paragraph that includes tagged
stuff like <mytag>this</mytag> and <mytag>that</mytag>
for which the initial line breaks are irrelevant.</p>

<pre>but this is pre-formatted text
    that includes <mytag>tags</mytag>
    and that lines up
exactly the way I want</pre>


Right now I'm almost succeeding, but using the ugliest approach you can imagine. You will notice that the <pre> section's indent was reduced by the minimum indent of its original lines. I couldn't figure out how to do this with complex content, so I did one pass using one template mode to convert the tags into plain text, something like "<mytag>" becoming "~L~mytag~R~" and "</mytag>" becoming "~L~/mytag~R~". Once I did that I was able to process the text to create the <p>s and <pre>s exactly as I needed.

The last part that I'm having trouble with is using a named template to convert the text in the paragraphs back into XML. I'm using something like that below, although mine is hairier because I'm supporting several tags and attributes:

<xsl:analyze-string select="$text" regex="~L~(\w*)~R~(.*)~L~/\w*~R~">
    <xsl:matching-substring>
     <xsl:element name="mytag">
        <xsl:value-of select="regex-group(2)"/>
     </xsl:element>
    </xsl:matching-substring>
    <xsl:non-matching-substring>
     <xsl:value-of select="."/>
    </xsl:non-matching-substring>
</xsl:analyze-string>


I'm having two problems: one is that I can't parameterize an element name so that I can support arbitrary tags, and the other is that the expression for regex-group(2) is greedy and swallows too much if I have two tagged elements on a single line.

For the first issue, in my case I only need now to support two different allowed tags with no more than one attribute each, so I've been able to explicitly code all the cases. I suppose I could get around it by not generating XML at all, but instead generating text with actual '<' and '>' characters and then reprocessing that text.

For the second issue, I need something fancier than (.*) for the tag content so that it can't include any string that contains "~[LR]~". I'm a not an extensive user of regular expressions - is there anyone who could tell me how to express that?

But apart from the smaller question is the big one: how should I have approached this whole problem in the first place? I wouldn't choose to be doing this with XSLT except that the text I'm dealing with is embedded in a larger XML document that I'm already processing with XSLT.

Thanks,
Ian
 
Old February 28th, 2008, 04:38 AM
samjudson's Avatar
Friend of Wrox
 
Join Date: Aug 2007
Posts: 2,128
Thanks: 1
Thanked 189 Times in 188 Posts
Default

To turn regular expressions to 'none greedy' append a '?' and then end of the qualifier:

.* = greedy
.*? = none greedy

As for your first problem - are you aware of this method:

Code:
<xsl:element name="{regex-group(1)}">
  ...


As to weather there is an easier way of doing the whole thing - I'm not entirely sure.

/- Sam Judson : Wrox Technical Editor -/
 
Old February 28th, 2008, 04:50 AM
mhkay's Avatar
Wrox Author
 
Join Date: Apr 2004
Posts: 4,962
Thanks: 0
Thanked 292 Times in 287 Posts
Default

I've seen this kind of problem a number of times, there seem to be two separate ways to tackle it and no clear consensus on which is better. One is your approach: reduce everything to text, then put the markup back by analyze-string processing. The other approach is to add element structure to each text node, then regroup the hierarchy using for-each-group and similar transformations. I tend to favour the latter approach, but that doesn't mean yours is wrong.

The problems you're having seem to be small ones so I would stick with the general approach:

>I'm having two problems: one is that I can't parameterize an element name so that I can support arbitrary tags,

Just use <xsl:element name="{...some expression...}">

> and the other is that the expression for regex-group(2) is greedy and swallows too much if I have two tagged elements on a single line.

Just use non-greedy quantifiers: .*? instead of .*





Michael Kay
http://www.saxonica.com/
Author, XSLT Programmer's Reference and XPath 2.0 Programmer's Reference
 
Old February 28th, 2008, 03:40 PM
Authorized User
 
Join Date: Jul 2007
Posts: 14
Thanks: 0
Thanked 0 Times in 0 Posts
Default

Thanks so much, both of you!

Michael, I'll take your advice and stick with my current approach on this one for now, but I'd still like to improve my understanding of grouping, especially in the context of processing marked up text, as my solution is vulnerable to sequences of characters in the source text that happen to match those of my text substitions.

It seems I would first need to group the mixed content back into representations of the original source lines - I suppose I would be able to do this by grouping all the items between \n characters, using group-ending-with="ends-with(text(),'\n')"? Is that the right idea?

 
Old February 28th, 2008, 04:04 PM
mhkay's Avatar
Wrox Author
 
Join Date: Apr 2004
Posts: 4,962
Thanks: 0
Thanked 292 Times in 287 Posts
Default

group-ending-with must be a pattern, not an expression.

To be honest, I'm not really sure what rules you are applying. In your example all the newlines appear in top-level text nodes, but presumably one can't reply on that. You could, for example, put the whole tree through a modified-copy process in which any "\n\n" sequence within any text node is replaced by an empty <break/> element; you could then process the top-level children using <xsl:for-each-group group-starting-with="break" to change the structure from

<break/>mixedcontent<break/>mixedcontent<break/>

to

<p>mixedcontent</p><p>mixedcontent</p>

You might have to repeat the process recursively if <break/> elements have been inserted as children or descendants of elements in the mixedcontent.

Then you can presumably detect which <p> elements should be turned into <pre> elements based on the fact that they contain newlines followed by tabs, or some such rule.

Michael Kay
http://www.saxonica.com/
Author, XSLT Programmer's Reference and XPath 2.0 Programmer's Reference
 
Old February 28th, 2008, 05:14 PM
Authorized User
 
Join Date: Jul 2007
Posts: 14
Thanks: 0
Thanked 0 Times in 0 Posts
Default

Yow, I hadn't thought about the issues of \n within the children. But I get the idea. The initial <break/> insertion would be just a bit more complicated because replacing \n\n isn't quite sufficient - the paragraph breaks for Wiki occur just based on whether there is indentation.

Converting Wiki to xhtml elegantly would make a wonderful example in the next edition of your excellent XSLT 2.0 book. Just the paragraph stuff like I've described, leaving it to enthusiastic readers to come up with all the list support and more.

Thanks again for all your help!

Ian

 
Old February 28th, 2008, 06:09 PM
samjudson's Avatar
Friend of Wrox
 
Join Date: Aug 2007
Posts: 2,128
Thanks: 1
Thanked 189 Times in 188 Posts
Default

Quote:
quote:Converting Wiki to xhtml elegantly would make a wonderful example in the next edition of your excellent XSLT 2.0 book.
... too late, he's already written it. Due out beginning of May.

http://www.amazon.co.uk/dp/047019274...ackylabsnet-21

/- Sam Judson : Wrox Technical Editor -/
 
Old February 28th, 2008, 06:18 PM
mhkay's Avatar
Wrox Author
 
Join Date: Apr 2004
Posts: 4,962
Thanks: 0
Thanked 292 Times in 287 Posts
Default

>too late, he's already written it

Actually, right now I'm working silly hours proof-reading the appendices, but it amounts to the same thing.

Michael Kay
http://www.saxonica.com/
Author, XSLT Programmer's Reference and XPath 2.0 Programmer's Reference





Similar Threads
Thread Thread Starter Forum Replies Last Post
Converting numbers to text mylifeishz Visual Basic 2005 Basics 1 July 5th, 2008 10:57 AM
Difficulties converting XML to XML using XSLT Reznik XSLT 7 June 3rd, 2008 05:45 AM
Converting Source Xml into Target Xml Using XSL. alapati.sasi XSLT 3 May 14th, 2007 10:54 AM
converting text files to xml anandthecoolest Visual Studio 2005 1 March 8th, 2007 03:24 PM
Converting XML to XML (making element mandatory) boondocksaint20 XSLT 8 April 28th, 2006 10:54 AM





Powered by vBulletin®
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.
Copyright (c) 2020 John Wiley & Sons, Inc.