Tokenize

JohnBampton · August 25th, 2009, 04:59 AM

Hello,

I want to use the tokenize function and for each string in the resulting sequence create an element. How can I get the value of the matched regular expression to use as a attribute of this element?

Regards,

John

mhkay · August 25th, 2009, 05:02 AM

The tokenize() function doesn't tell you anything about what separators were found, or in what way they matched the regular expression. If you need that information, you need to use xsl:analyze-string.

JohnBampton · August 25th, 2009, 05:16 AM

But the analyze string won't do what I am after as all the matches of the regex go into the matching-substring part. I can create the elements in the non-matching part but the attributes that are needed will be in the matching -part?

So what do i do?

mhkay · August 25th, 2009, 05:27 AM

>So what do i do?

Start by explaining the requirement. What's the input, what's the desired output?

JohnBampton · August 25th, 2009, 05:38 AM

This is the input:

http://www.sec.gov/Archives/edgar/da...66204e10vk.htm

You may have seen this before.

Now my boss told me to parse this thing, replacing all the < with [[ and > with ]] so that there was one wrapper root xml element and contents all basically text.

Then the task is to split the document up according to sections for each item

Starting with:
<root>
<preamble>text herr</preamble>
<tableofcontents>table of contents data here</tableofcontents>
<section label="item1">text here of item one</section>
<section label="item 1A">etc</section>
....
</root>

So I was going to tokenize the text base on the "item number" as the regex then loop through the resulting sequence and build sections elements but then I can't get the label elements.

Is there a better way to do this?

Regards,

John.

mhkay · August 25th, 2009, 05:51 AM

There's a mythical character in XML folklore known as the Desperate Perl Hacker or DPH. He's known for attempting amazing feats of transformation using regular expressions as his only weapon. I'm not sure I've ever met one before (I thought they were mythical), but this comes close.

I don't think this is the right design approach. You want to create a tree representation of the structure, and then do a transformation. Certainly if you're going down the pure regex route then you're using the wrong language - you'd be much better off with Perl.

JohnBampton · August 25th, 2009, 06:00 AM

Your right on the money. He is a perl programmer. LOL.

So how would I go about it do the tree structure transformation?

Martin Honnen · August 25th, 2009, 07:34 AM

Tree structure transformation means you parse the HTML document you have with a parser that allows you to create a tree suitable as an input tree for an XSLT transformation. So with XSLT 2.0 you can use the HTML parser implementation done by David Carlisle in pure XSLT 2.0 or if you use the Java version of Saxon you can plug in the TagSoup parser from http://home.ccil.org/~cowan/XML/tagsoup/. Or you can use the HTML Tidy tool to transform that HTML you have to XHTML, then you can feed that XHTML document to any XSLT processor.