Search a binary file of million records

David_0223 · July 15th, 2008, 02:33 PM

I think Alan8 was offering an advertisement ... ;)

And, personally, I would make each record the same size with multiple events per record.

But I'm interested in your comment earlier on Oubjectivity/DB and OODBMSs. I have never used an OODBMS and I was wondering how they store data. Like most of the community I have done OO programming with persistant data usually to an RDBMS. And you usually structure your data and classes to work together, often one class per table.

If you, or anyone on the forum has any experience with OODBMSs, it would interesting to see how they differ, programitically, from RDBMSs.

What you don't know can hurt you!

Old Pedant · July 15th, 2008, 03:01 PM

Typically, an OODBMS really *does* store objects. That is, no transformation takes place between the in-memory form or an object and the on-disk form thereof.

The exception, of course, is pointers ("object references" in the Java/C# worlds). Obviously, when the object is saved to disk, the pointers to other objects can't just stay as memory addresses. Instead, each pointer has to be converted into an Object IDentifier ("OID", pronounced as in the last syllable of "paranoid"). This is done via a process known as "swizzling".

Then, when the objects are needed back in memory, the OIDs have to be re-converted into pointers (via, what else?, "de-swizzling").

I *am* familiar with how this is done for C++ in Objectivity/DB and in Object Design's OODBMS (no longer in business), two very different ways, but I'm not sure what the code looks like for Java/C# interfaces. Clearly, the database engine's swizzling/deswizzling code has to be able to "get to" the actual contents of object references (that is, to the actualy addresses in memory), so I'd assume that some kind of "native level" interface is needed.

For efficiency, you want to be able to pull an object from disk to memory *WITHOUT* pulling in all the objects it is connected to. That is, you want to be able to do "lazy loading" of the objects. This is pretty easy to do with Windows memory management primitives. (I created a simple but very highly efficient OODBMS engine in about 6 months, way back in 1993-1994, for example. Wouldn't have been as simple on various other operating system.)

Anyway, the huge advantage these systems have is the speed with which they can move data to and from the disk. And if lazy de-swizzling is used, the speed *from* the disk can be truly impressive. But, again, they depend on having knowledge of and manipulation of the actual in-memory forms of the objects. [This isn't a necessary requirement...you can use Reflection in Java/C# to convert from in-memory form to canonical form and back. I did that once, too, as an experiment. But that is so much less efficient that it's simply not a good idea except as a "proof of concept" system, perhaps. Now watch somebody make a liar out of me and find a way to make it fast enough.]

Old Pedant · July 15th, 2008, 08:49 PM

By the by...

"Like most of the community I have done OO programming with persistant data usually to an RDBMS. And you usually structure your data and classes to work together, often one class per table."

Hmmm...not so in my own experience. Typically, one table per class *EXCEPT* when the class has members that are themselves collections or arrays. In which case you have to add another table per collection or array. And even if you have the same collection class used in different basic classes, you can't use the same table for instances thereof, because the relationship linking needed implies the need for different tables.

So more often than not, my own Object-Relational mapping ends up with two or more tables per class. [And if you use something like Hibernate, it does the same thing, of course.]

With an OODBMS, you don't even need to create a "mapping". The object structures in memory *ARE* the object structures on disk.

Quite frankly, I'm not sure why OODBMS products aren't used more. Granted, most of them have had a poor history of query performance. But that's changed a lot in the last few years. And with the added benefit of being able to make path-based queries, I can't help but think that there are many many Java/C#/VB.NET apps out there that would be better off talking to an OODBMS. Just imagine being able to do a query like this:

SELECT DISTINCT * FROM people
WHERE people.parent.parent.lastname = 'Smith'

["find all people who have a grandparent named 'Smith'"]
Look, ma, no joins!

planoie · July 15th, 2008, 11:10 PM

Quote:

quote:Originally posted by Old Pedant

SELECT DISTINCT * FROM people
WHERE people.parent.parent.lastname = 'Smith'

That smells like LINQ:

var peopleList = from people in myDataContext.People
where people.parent.parent.lastname = "Smith"
select people;

-Peter
compiledthoughts.com

Old Pedant · July 16th, 2008, 12:34 AM

Haven't used LINQ, but as I understand it that query in LINQ will be turned into a relational join and/or client-side [client of the DB, not HTML client] processing. Given how ADO.NET works with datasets and datatables, that all makes sense. So it's a cute trick, but it will never perform as well as a "native" query in an OODBMS will. Assuming the same level of index help, etc., of course.

Be fun to run benchmarks, but I certainly don't have time to do it. Hmmm...I still know a couple of people at Objectivity. Wonder if we could get them to try it? <grin style="evil"/>

DineshGirij008 · July 16th, 2008, 05:52 AM

Quote:

quote:Originally posted by David_0223

The system I manage stores it's data in multiple archive files where each file is the same size and no two files store overlapping data by time. So each archive has a start and end date. When retrieving data you retrieve data for a specific point over a specific date range. The database engine knows which files have which date ranges.

So it doesn't matter whether you are asking for data from yesterday or ten years ago, it takes the same time to find it. Also it doesn't matter whether you have a years worth of data or 20 years worth, it takes the same time to find the data.

This portion of your reply really interested me a lot. This is what I was looking for. "multiple archive files " .. I would like to know how did u manage to store data in to those files.
I would like to have some more info regarding this matter.. as to how you applied indexing in those files...and how the search time was the same on all conditions that u hav mentioned...

In my case, it is the employee records that run into thousands.. this records has fields such as EmpID, EmpName, EmpJoinDate, Salary..
I would like to search the file on each of these criterias..would u shed some light on this matter,

David_0223 · July 16th, 2008, 09:27 AM

Dinesh,

I did not develop the application I manage. I have been working with it for over 9 years so I know the basics of how it works, but duplicating it in an application of my own would be quite challenging. I would think that if it's employee records you want to manage any competent RDBM system would be better than anything you could develop yourself unless you work for a database software company and have that kind of expertise in house.

I can't see where you are likely to exceed 100,000 records which would be nothing to any decent RDBMS or OODBMS system and response to any query would be in the milliseconds. I do consulting for an oil company now, one of the systems I manage has over 80,000 tags. I average collecting one event every 2 minutes on each of the tags. Thats nearly 58 million events per day with 14 years of events online. That's not in the same ball park as an employee database.

However, for an enterprise scale accounting system you could easily collect millions of financial events per day. These would typically be stored in a general ledger and would need to be available for years. The application I manage was developed by OSI Software and is called PI (Plant Information). You can visit their website at http://www.osisoft.com.

In a nutshell, the application has a "snapshot" module that collects data from a variety of sources and holds the most recent data in memory while sending it on to the archive module. The archive module stores the data by date in the current archive and is responsible for retrieving data from the archives by date. When the current archive is full, the archive module creates a new archive. All of the archive files are the same size. When an archive is created a record of a given length is created for each tag, this is the primary record for the tag. All of the records in an archive are the same size. When events are stored in each record they are stored sequentially with the timestamp, value and status of the event. When the primary record is filled up, an overflow record is created AND the primary record is converted to an index record giving the location within the archive, by date, of the overflow records for this tag.

That was a big nutshell, but as susynct as I could make it. The details of how all of this is done propriatary. Hope this helps.

What you don't know can hurt you!

planoie · July 17th, 2008, 08:09 AM

Quote:

quote:Originally posted by DineshGirij008
My task here is to help my Project Manager with some tips as to how we can develop a flat file...

This smells like the typical problem of a non-technical person making technical decisions.

-Peter
compiledthoughts.com

robzyc · July 17th, 2008, 09:38 AM

Quote:

quote:Originally posted by planoie

Quote:

quote:Originally posted by DineshGirij008

Quote:

My task here is to help my Project Manager with some tips as to how we can develop a flat file...

This smells like the typical problem of a non-technical person making technical decisions.

-Peter
compiledthoughts.com

Agreed, smells bad.
Future TDWTF post in the making? ;)

I strongly urge you to go back to your PM and discuss the issues raised here with them Dinesh.. I tell you this because if it all hits the fan (which quite frankly sounds VERY likely), it will be you guys that take the fall, not the MD.

Only trying to help you :)

Rob
http://cantgrokwontgrok.blogspot.com