Hi,
I have a requirement to extract the contents, including attributes, of 3 separate elements in approximately 3000 SGML files.
To explain further: Here is an example file
Code:
<!DOCTYPE DMODULE PUBLIC "-//AECMA Change 6 Legacy//DTD Air Vehicle Engines Equipment Description 19981030//EN">
<dmodule><idstatus>
<dmaddres>
<dmc><avee><modelic>BR84121AXXXXX</modelic><sdc>1AX</sdc><chapnum>AG3</chapnum>
<section>0</section><subsect>0</subsect><subject>00</subject><discode>01</discode>
<discodev>00</discodev><incode>G10</incode><incodev>A</incodev><itemloc>A
</itemloc></avee></dmc>
<dmtitle><techname> SUMMARY OF DATA AND LIST OF REFERENCES </techname>
<infoname>Fig 1 Sonar Type 2093 - Location of Major Units</infoname>
</dmtitle>
<issno issno="004" type="changed"></dmaddres>
<status>
<issdate year="2008" month="03" day="22">
<security class="2">
<rpc> </rpc>
<orig> </orig>
<applic></applic>
<techstd>
<autandtp>
<authblk>Cat 1A Chap 1</authblk>
<tpbase>BR 8412(1A)</tpbase>
</autandtp>
<authex></authex>
<notes></notes>
</techstd>
<qa>
<firstver type="tabtop"></qa>
<rfu>Amendment Issue 2</rfu>
<remarks>Stage 2</remarks>
</status>
</idstatus><content>
<refs>
<norefs></refs>
<descript>
<para0>
<figure id="f0011">
<title>Fig 1 Sonar Type 2093 - Location of Major Units</title>
<graphic boardno="00110001.tif"></figure>
</para0>
</descript>
</content></dmodule>
What i need to extract, is everything contained within the 'DMC' element near the top, including the contents of its child elements. Also, I need the 'id' attribute of the 'figure' element so that i capture the f0011 information (in this instance). And i also need to extract the 'boardnumber' attribute of the 'graphic' element so i can get the .tiff file names.
As i say, i need to do this to approximately 3000 files which are currently in SGML (see example above). I'm assuming i would first have to convert these files to XML? I'm also assuming this is straightforward enough - perhaps naively.
The biggest problem is then the XSLT part. What i ultimately want is a nice list at the end, ideally Excel but just a list is fine, perhaps with 3 columns: DMC, figure id, and graphic boardnumber, obviously populated with the data extracted from the 3000 or so files.
Is this possible?
Any solutions or tips would be most grateful. I'm even willing to offload this task and pay a fee to have this work done, as it could save us considerable time in manually creating an Excel spreadsheet with this data. It is quite an urgent task though.
Thanks,
Jake