I have several +2Gb XML files (from openstreetmap.org as .OSM) which I want to 'shred' to SQL server. For that 2Gb-1 byte is the max size I can handle.
The structure is like this:
Code:
<?xml version='1.0' encoding='UTF-8'?>
<osm version="" generator="">
<node id="" lat="" lon="" version="" changeset="" user="" uid="" timestamp=""/>
<node id="" lat="" lon="" version="" changeset="" user="" uid="" timestamp="">
<tag k="" v="" />
<tag k="" v="" />
</node>
<way id="" version="" changeset="" user="" uid="" timestamp="">
<nd ref=""/>
<nd ref=""/>
<tag k="" v="" />
<tag k="" v="" />
</way>
<relation id="" version="" changeset="" user="" uid="" timestamp="">
<member type="" ref="" role=""/>
<member type="" ref="" role=""/>
<tag k="" v="" />
<tag k="" v="" />
</relation>
</osm>
In short, millions of nodes , followed by millions of ways, followed by millions of relations, so splitting into one file per node/way/relation is not really an option.
I don't know that much about XML and don't even know if current XSLT processors can handle +2Gb sizes.
I managed to read the file line by line into a table and use a cursor in T-SQL to combine the lines to well-formed XML chunks for further processing. It does the trick, but takes 'forever' (several hours) for even far smaller files (60Mb). So I was wondering if XSLT 2.0 transform could speed up the chunking.
Cheers