Search a binary file of million records

DineshGirij008 · July 14th, 2008, 05:47 AM

My company is on the verge of starting a new project in which the back end is supposed to be a File system instead of a database server. And, this file is going to contain millions of records.

Since a flat file has to be searched sequentially, it might take a lot of time to retrieve a specific record.
Some of my colleagues told me that, using indexing would make the search much faster. As to now I don't have much of an idea about how to implement indexing.

I would like to have some of your suggestions & solutions
regarding this matter..

any kind of help in this regard would be much appreciated..

thank you
:)

samjudson · July 14th, 2008, 05:54 AM

http://en.wikipedia.org/wiki/Index_(database)

/- Sam Judson : Wrox Technical Editor -/

planoie · July 14th, 2008, 10:23 AM

What is the justification for using a flat file something so large?
Why write your own data indexing when you could use a real database engine that already has all of that functionality?

-Peter
compiledthoughts.com

DineshGirij008 · July 15th, 2008, 12:43 AM

Quote:

quote:Originally posted by planoie
What is the justification for using a flat file something so large?
Why write your own data indexing when you could use a real database engine that already has all of that functionality?

-Peter
compiledthoughts.com

We are not planning to use database server for this project. We are planning to do something like PeachTree which is #1 Accounting Software in the US. It uses file system as its backend.

Ours is also a web based one and needs lots of security..
I would like to hav some ideas regarding the storing of data in files as well as using index for those datas in the file...Also , how to make the search faster in such a scenario...

So, plz give suggestions regarding this matter, which will highly appreciated..

thanx.. :-)

robzyc · July 15th, 2008, 03:14 AM

Wow, thats crazy.

Just a couple of points to reiterate what the other guys have said here:

- Just because something is "#1" it doesnt mean it is built "right". MS Office is one of the most widely used office packages out there. Does it mean that its put together right (or as good as could be based on MODERN software practices)? Probably not. Peachtree is really old, just because it has a following, it doesnt mean you should duplicate its architecture.

- Based on the above, we know that a database could outperform this by far, so why settle for second best?

- Also - there will be major development time needed to "fill the void" that not using a DB will incur.

- As for security, if you are using a file based system, the only security you really have is control over the ACL. If using a DB, you have an additional layer of securty (both domain access as well as database access).

- Also, there are the points of disaster recovery, transactional processing, performance that will also be "worse off" by using a file system.

I am pretty sure I don't speak alone here, but I think you're mad. :D

Rob
http://cantgrokwontgrok.blogspot.com

samjudson · July 15th, 2008, 03:46 AM

And just because PeachTree stores its data in a 'file' doesn't mean that file isn't actually a database. It could be an Access Database, a SQL Server Compact edition file or one of many other database file formats.

As for the basics of 'indexing' the link I provided above should have taken you to Wikipedia, where there are whole articles on the topics of indexing, both from a conceptual level and what that usually means in a database. Indexing is a huge topic of much academic debate, and if you don't even know what one is then I can guarantee you that writing your own is completely the wrong thing to be doing.

And thirdly (if you really needed any more arguments) the locking and contention issues you would have trying to run a web site off a single flat file would be horrendous.

/- Sam Judson : Wrox Technical Editor -/

David_0223 · July 15th, 2008, 10:55 AM

I'd just like to point out that there are times when a standard relational database such as MSSQL Server may not be the best solution. I manage a proprietary database system that stores process data. It's not unusual to find one of these systems with years of data online. The problem with a relational database is there is a limit to the size of a table beyond which it becomes very cumbersome to retrieve data.

The type of system he is looking for would be such a system. Basically, you would want to keep a running general ledger for years. The system I manage stores it's data in multiple archive files where each file is the same size and no two files store overlapping data by time. So each archive has a start and end date. When retrieving data you retrieve data for a specific point over a specific date range. The database engine knows which files have which date ranges. So it doesn't matter whether you are asking for data from yesterday or ten years ago, it takes the same time to find it. Also it doesn't matter whether you have a years worth of data or 20 years worth, it takes the same time to find the data.

I googled "Trasactional Database" and "Financial Database", you can try some different combinations, but I didn't see any generic data storage solutions that looked like they would do what this process database I manage would do.

I have to agree with the others that what you are talking about here is a huge undertaking. And you will probably want to use a commercially available RDBMS for much of your data, but your core transactional database you will probably have to develop yourself. The only others I could find were part of a financial software, which is what you want to write so ...

As to your actual question. I would look at the System.IO namespace particularly at the BinaryReader and BinaryWriter classes. BinaryReader has a method "BinaryReader.Read Method (Byte[], Int32, Int32)" which will allow you to read in a block of bytes at a specific index. BinaryWriter has "BinaryWriter.Write Method (Byte[], Int32, Int32)" which will allow you to write to a specific location in a file.

Structuring the file will be up to you.

What you don't know can hurt you!

Old Pedant · July 15th, 2008, 01:20 PM

And there is always the possibility of using an Object Oriented DBMS ("OODMBS") such as Objectivity/DB. I know it is capable of *adding* terabytes of data per day, not to mention accessing many many terabytes of data. And there are othe OODBMS products out there.

If it's only capacity that is the concern, the OODBMS's were designed for much higher capacity than most RDBMS products.

Alan8 · July 15th, 2008, 01:41 PM

Hi DineshGirij008. A few suggestions:

1. If it's a flat file, it's easy to compute the offset in the file where record N starts: (N - 1) * recordLength, assuming the first record is #1.

2. If you need to be able to handle many millions of records in seconds, see www.patternscope.com.

It's a data-mining tool, but it can handle huge amounts of data very quickly, since it processes the patterns that make up the data, rather than the raw data itself.

It can do queries as well as find patterns in your data.

Old Pedant · July 15th, 2008, 02:09 PM

Alan8 wrote:
"1. If it's a flat file, it's easy to compute the offset in the file where record N starts: (N - 1) * recordLength, assuming the first record is #1."

Ummm...and what if the "records" are not all the same length????

You're making a huge assumption there.

If they aren't the same length, it's still not impossible; you just have to first make a pre-scan through the file, finding the start position of each record and creating an index. And, of course, if you are doing that anyway, then you could create an index of key values. Or multiple indexes of multiple keys. And what, pray tell, have you now done? AHA! You've created the beginnings of a relational database engine. You're maybe 10% to 20% of the way there, aren't you?