|
Subject:
|
Strategies for large XML files
|
|
Posted By:
|
asearle
|
Post Date:
|
9/26/2006 3:06:12 AM
|
Hi Everyone,
As my users find their way around the XML/XSL reports that we have built, they are asking for more and more detailed data. This is great but it means that our XML files are getting bigger and bigger and the access time for retrieving data is getting longer and longer.
I am currently investigating switching the data sources (i.e. splitting up the data into various XML files that would, for example, hold distinct time periods) and hope that this will speed things up.
But I also wanted to ask if anyone out there has any ideas? I suppose that, at some point, I will be forced to resort to using a database back-end (with indexing) but I am curious to know where the practical limits of accessing large XML files is?
Any tips and any annacdotes about your experience would be very interesting for me.
Regards and thanks, Alan Searle.
|
|
Reply By:
|
joefawcett
|
Reply Date:
|
9/26/2006 3:35:27 AM
|
Unless you are prepared to invest in XML specific hardware, a solution that one company I worked with tried out, large XML files are difficult. Most of our data is held in a relational database and the XML emerges after the filters are applied. It is then transformed. We do store such things a s invoices as XML as it helps recreate them for audits etc. but they are not huge.
You can also use SAX or .NET's XmlReader for linear processing, this is often a good start for breaking down large documents that contain repetitive data; for example multiple invoices. You can then process the section using XSLT.
--
Joe (Microsoft MVP - XML)
|
|
Reply By:
|
asearle
|
Reply Date:
|
9/26/2006 9:58:29 AM
|
Thanks very much for this tip: I googled on the key words and there seems to be a some good sources I can use.
Cheers, Alan.
|
|
Reply By:
|
mhkay
|
Reply Date:
|
9/26/2006 12:37:17 PM
|
I've come across users who asked about processing "large" files and then discovered they meant 1Mb. The first thing is to provide some numbers.
A lot depends on the access pattern. If you're loading the document into memory in order to get one piece of information out of it, then the parsing time is the dominant factor; or rather, the relationship of the parsing time to your required response time. Otherwise it may be memory that's the limiting factor. Or it might be that you're doing complex joins and the queries are showing O(n^2) performance, in which case you can probably solve the problem using keys.
Michael Kay http://www.saxonica.com/ Author, XSLT Programmer's Reference and XPath 2.0 Programmer's Reference
|
|
Reply By:
|
asearle
|
Reply Date:
|
9/27/2006 2:46:30 AM
|
Hi Michael,
...
[quote]Originally posted by mhkay
I've come across users who asked about processing "large" files and then discovered they meant 1Mb. The first thing is to provide some numbers.
I seem to start having problems over 5Mb but this is because I 'graze' the XML file to get the contents of picklist that I display in the header.
My plan is to generate picklists separately and access them without reading the XML file.
I also want to split my data files (XMLs) and then connect them 'on-demand'. This would mean that I could probably reduce the size of each file to about 2mb.
A lot depends on the access pattern. If you're loading the document into memory in order to get one piece of information out of it, then the parsing time is the dominant factor; or rather, the relationship of the parsing time to your required response time. Otherwise it may be memory that's the limiting factor. Or it might be that you're doing complex joins and the queries are showing O(n^2) performance, in which case you can probably solve the problem using keys.
I do all the joining in an Oracle DB before I export to XML.
It's interesting that keys can help speed: I currently use them for generating my picklists and for grouping. I will investigate how I can use them more.
Many thanks for your tips.
Cheers, Alan.
|
|
Reply By:
|
joefawcett
|
Reply Date:
|
9/27/2006 3:31:16 AM
|
I've never had problems on a server with files less than 100Mb, in my experience DOM takes about 3 to 4 times the file size so 100Mb ~= abot 350Mb of RAM. For creating picklists rather tahn complex XSLT I'd use XmlReader/SAX, if you are using .NET and show an example of the file structure I'll try to come up with an example if you need it.
--
Joe (Microsoft MVP - XML)
|
|
Reply By:
|
mhkay
|
Reply Date:
|
9/27/2006 9:33:20 AM
|
If the size is 5Mb, that should be quite manageable. What kind of performance are you seeing, and what performance do you require?
There will be two aspects to the cost: XML parsing time and transformation time. The parsing time will be fixed, and there's no way of getting this down other than reducing the file size. The transformation time depends on your code, and it might be possible to get it down considerably. Try to measure the two components separately so you can see where the costs are being incurred.
Michael Kay http://www.saxonica.com/ Author, XSLT Programmer's Reference and XPath 2.0 Programmer's Reference
|
|
Reply By:
|
asearle
|
Reply Date:
|
9/28/2006 2:38:50 AM
|
Hi Michael, hi Joe,
I am in the middle of a redesign phase at the moment and plan to remove all 'picklist' generation to an external file which will mean that my code doesn't have to 'graze' the source XML file any more.
I will see what speed this brings and will then come back to you.
It is very encouraging that you say that larger files should be no problem. They will also be accessed over a network so I will see how that performs.
I'll give more feedback as soon as I have implemented my changes.
Many thanks, Alan.
|