This is the input:
http://www.sec.gov/Archives/edgar/da...66204e10vk.htm
You may have seen this before.
Now my boss told me to parse this thing, replacing all the < with [[ and > with ]] so that there was one wrapper root xml element and contents all basically text.
Then the task is to split the document up according to sections for each item
Starting with:
<root>
<preamble>text herr</preamble>
<tableofcontents>table of contents data here</tableofcontents>
<section label="item1">text here of item one</section>
<section label="item 1A">etc</section>
....
</root>
So I was going to tokenize the text base on the "item number" as the regex then loop through the resulting sequence and build sections elements but then I can't get the label elements.
Is there a better way to do this?
Regards,
John.