Parse/Load/Search xml file size near about 1 GB

sandeep_akhare · September 18th, 2006, 09:35 AM

Hi All
I have given a problem set in which i need to develop dot net application which should Parse/Load/Search xml document of size ~ 1GB . And it is given that i should not use database for it .Please help me to solve this problem . how can i achieve this ?

mhkay · September 18th, 2006, 01:00 PM

It depends very much on the nature of the "search". You either need to allocate a fairly large amount of memory, or you need to search using a low-level technology such as Sax, Stax, or STX. There are some XSLT and XQuery products that can handle a limited range of searches using serial processing: for example in XSLT, Saxon-SA has a serial processing mode for a very restricted class of XPath expressions. Some products such as DataDirect XQuery have an option to do "document projection" in which the parts of the document that aren't accessed by the query aren't loaded into memory.

When I see constraints like "I should not use a database", my question is always "Why?". What are the real requirements that make a database an unacceptable solution?

Michael Kay
http://www.saxonica.com/
Author, XSLT Programmer's Reference and XPath 2.0 Programmer's Reference

sandeep_akhare · September 19th, 2006, 12:21 AM

Thank you Michael Sir for your reply. It is the Problem set which i have given to solve.They want it without using database or might be thinking that Why to again store in database if you have already have it in XML ? :):)

mhkay · September 19th, 2006, 02:43 AM

I think it's a always a good idea to question requirements. If "they" don't want a database, there could be any number of reasons: cost of purchase, cost of administration, performance of database loading. If you discover the real reasons you may find that they also rule out some non-database solutions - and you may find that they don't rule out some solutions that do use a database. Users, managers, and customers have a right to define the requirements, but they don't have a right to make design decisions - that's the job of the engineer.

Michael Kay
http://www.saxonica.com/
Author, XSLT Programmer's Reference and XPath 2.0 Programmer's Reference

sandeep_akhare · September 19th, 2006, 04:47 AM

Hi Michael Sir
   Yes Its true.I have taken part in Tech Fest and this problem set is from that Tech Fest Only so i can't ask them requirement of it.
The whole problem is like
Objective
    Define a development approach to Parse/Load/Search an XML document of size ~1GB
Description
    Project Gutenberg (www.gutenberg.org) maintains a list of books in a RDF format.There is an offline version of the same available at \\ht-dynapps\gutenberg
    You need to provide the following APIs that will allow you to use the contents:
    //Given a start and end index provides allows to incrementally get the books from the list(ala google way)
    public List<Book> GutenbergBookManager.getBooks(int start, in end)
    // Given an ID of the book searches the document return the book details
    public Book GutenbergBookManager.getBook(String id)
    // Given the search phrase returns the list of the books with matching subject (word occurring anywhere in the subject line)
    public List<Book> GutenbergBookManager.searchBook(String subject)
* Assume that you do not have the luxury to dump the data int0o a relational database.