Handeling repeated, missing, and out of order elements in a transform

CAMc · October 18th, 2011, 12:26 PM

I can't solve some complications that are coming up in my XML to CSV conversion. I have looked at the existing threads on XML to CSV conversion, on this site and all over the web! But I still need help.

I am brand new to actual programing in all languages. However, I know XML and CSS, so I am the logical person to solve this problem at my non-profit (the only person available). I need to take some very messy XML and convert it to a very neat CSS file for upload to a website database. So far my code is so far off the mark, I am not even going to bother posting it. I have throughput, but it only does a quarter of what I need.

I don't really need a finished solution, but I need help with understanding the process I should follow to solve my problem in XSLT. I won't ask you all to code for me, just tell me the elements and template structure I need. I would also love if the community could explain the logic behind the process, so that I can modify it as needed.

I have xml that has records in all orders and numbers:

Code:

    <record-list>
    <record>
	<title>Title One</title
	<author>Author One</author>
	<subject>Subject One A
		Subject One B
		Subject One C</subject>
	<subject>Subject Two</subject>
	<subject>Subject Three</subject>
	<subject>Subject Four</subject>
    </record>
    <record>
	<subject>Subject Five</subject>
	<title>Title Two</title>
	<useless-element>Extra Stuff One</useless-element>
    </record><record>
	<title>Title Three</title>
	<subject>Subject Six</subject>
	<author/>
    </record>
    </record-list>

So I have multiple numbers of repeated elements, some missing elements, some empty elements, elements out of order, and some elements with extra line breaks.

I need a CSV file which reads as below, or with a different number of subject repeats (see requirements below)

Code:

    "Title","Subject","Subject","Subject","Author"
    "Title One","Subject One A ; Subject One B ; Subject One C","Subject Two","Subject Three","Author One"
    "Title Two", "Subject Five","","",""
    "Title Three","Subject Six","","",""

Requirements for the final output

-The number of columns of any repeated elements either needs to match the record with the most repeats of that element, or the program needs to chop off any repeats past a certain number.
-Each new record needs a line break and no other line breaks can exist in the files (only as record delimiters).
-The elements each need to be in the same order for each record.
-Each element text needs quotes around it (to handle intrinsic commas).
-Missing or empty elements need blank, comma surrounded quotes.
-Extra elements can't be sent through to the output

What I have done:

I have figured out how to get rid of the extra line breaks within the elements using the replace function. I can get the quotes, commas, and line breaks in the output with text elements and strip-space.

However, I don't know how to straighten out the order of the elements, handle the element repeats, or put through only some elements while still using the <record> element as the cue for the line-break.

Right now, I just need a solution that works, even if all sorts of manual manipulation or multiple style-sheets are required. I can even do a find and replace in a text editor, as long as the output is good. Please help with an XSLT solution, I don't even begin to know any other suitable programing languages (college matlab many years ago is not helping).

I think I need to run two transforms. I looked at the XSLT Cookbook, where two transforms are used sequentially for a similar problem. However, this solution is so generalized, I can't understand it. If I can't figure out how it works, I can't modify it for my needs. Sorry, but without a programming background, the explanations on this site, the web, and in the text are challenging at best. However, I think I am presenting a problem with some novel features, compared to others asked on this forum.

Any help, be it non-generalized code, or even just a suggested schematic procedure for multiple runs through my processor would be wonderful. I have been struggling with this for over a week and have made very little progress.

Thanks
CAMc

Martin Honnen · October 18th, 2011, 12:45 PM

As you mention the replace function I assume you want to solve that with XSLT 2.0.
However what I don't understand is why the "record" with "Title One" has four "subject" child elements, yet the CSV only seems to have three "subject columns". What determines the number of "subject" columns you want in the CSV?
Are the elements you want to map to columns in the CSV known i.e. do you simply want a solution for that particular XML document type with "title", "author" and "subject" elements?

CAMc · October 18th, 2011, 01:16 PM

Quote:

Originally Posted by Martin Honnen

As you mention the replace function I assume you want to solve that with XSLT 2.0.
However what I don't understand is why the "record" with "Title One" has four "subject" child elements, yet the CSV only seems to have three "subject columns". What determines the number of "subject" columns you want in the CSV?
Are the elements you want to map to columns in the CSV known i.e. do you simply want a solution for that particular XML document type with "title", "author" and "subject" elements?

Hi Martin,
Thanks for you interest!

Sorry I wasn't clear.

To answer the second question first, Yes, I know all the particular elements I want to map from the document to my CSV. The tag in the original will NOT necessarily be the same as the column header in the CSV. There will be other elements in the document I don't want to map, and some unique elements that will be mapped to repeated columns with the same title, but I know all the tags around the material I want in the original and how they relate to the new column headings.

For the repeated "subject" fields, I either need a stylesheet that truncates repeated elements (for example, only 10 subjects will be allowed in the final CSV) OR one which adjusts the CSV to have enough columns to fit whatever record has the largest number of repeats of an element. I do not know the largest number of repeated "subject" elements in the record set. I have tried the truncating option, b/c I think it might be simpler, but I can't get it to work.

thanks
-CAMc

mhkay · October 18th, 2011, 01:55 PM

I think this is a significantly difficult problem even for an experienced coder. The difficulty, in fact, isn't in writing the code, it's in deciding what the code should do in all possible input situations. Part of that is defining exactly what is the range of inputs that it needs to handle.

Assuming I haven't misunderstood the requirement, here are some suggestions (using XSLT 2.0, which I would strongly recommend).

1. Determine the set of distinct column names

Code:

<xsl:variable name="names" select="distinct-values(/record-list/record/*/name())"/>

2. For each name in the list, replicate it to the maximum number of occurrences in any record

Code:

<xsl:variable name="columns" select="
  for $name in $names,
       $count in max(for $r in /record-list/record return count(*[name() = $name])),
       $i in 1 to $count 
  return $name"/>

3. Write the rows

Code:

<xsl:for-each select="/record-list/record">
  <xsl:variable name="this" select="."/>
  <xsl:for-each select="1 to count($columns)">
    <xsl:variable name="name" select="$columns[$i]"/>
    <xsl:variable name="index" select="count(subsequence($columns, 1, $i)[. = $name]))"/>
    <xsl:text>"</xsl:text>
    <xsl:value-of select="$this/*[name()=$name][$index])"/>
    <xsl:text>"</xsl:text>
    <xsl:value-of select="if (position()=last()) then &#xa; else ','"/>
  </xsl:for-each>
</xsl:for-each>

CAMc · October 18th, 2011, 02:32 PM

Thanks Michael,

I think I understand the structure of what you did. I will give it a try, play around with things and get back to you all with a report
-Much appreciated,
Christine