The Alias Problem

byterbit · July 3rd, 2003, 06:26 PM

By Alias, I mean "otherwise called; otherwise known as" (thanks Merriam-Webster online) for a person.

My problem is this. I am designing a database, and I am unsure of which way to model the data.

The Real World: A person is arrested, and he is "booked" or processed into a jail or released. As is often the case, that person uses a fake name or alias. Eventually, if that person is a habitual offender (a modern term for career criminal) he or she will have a list of known fake names or aliases.

Now comes the database design part.

I have approached this two ways. Both ways seperate the person (at booking time) from their name used. In conceptual database modeling, the relationship appears as such:

+--------+ +-------+
| person |-|----0<| alias |
+--------+ +-------+

I am thinking of doing the following for the alias

+--------+ +-----------+ +------+
| person |-|----0<| aliasList |>0----|-| name |
+--------+ +-----------+ +------+

This would minimize the replication of common names such as "Edward" or "Jennifer", at the cost of more complex queries and joins. The AliasList table would contain the two primary keys (both artifical keys) to link the Person to their Names. For the name table, I am wondering if it would have:
(1) LastName, FirstName, MiddleName, Suffix
or should it have:
(2) Name, NameType [Last|First|Middle|Nickname]

The First solution was the first design, and one which a Sr. DBA did. The second design could handle "George 'Machine Gun' Kelly" better and result in much smaller list of names.

I ask you, forum members, if you have any better suggestions on how to implement this. I am wary of complicated joins, having been burnt in the past by database vendor performance verses database design theory.

Oh, and of course the project has no specs, no explicit performance requirements, and no database vendor decided on. The usual development process. :-)

--Matthew

Jeff Mason · July 3rd, 2003, 07:11 PM

If I were doing it, I think I would opt for solution number 1. It's simpler, and results in far simpler queries. Furthermore, if you think about the code you'll have to write to maintain the many-to-many join your second solution requires, I think you will agree that it is far more complex. The user interface for such things can get convoluted, too.

I base my opinion mostly on the following: Disk space is cheap, user time waiting for joins to finish isn't.

Joins are a necessary part of life, but let's not go overboard.

How many times, really, is "Edward" likely to be repeated in your database? And even if it is repeated 100,000 times, you've used up a bit over a half a meg of space to store all of them. Big deal. If you consider the entire database, how many names are likely to repeated overall? What is the actual probability that any given name is likely to be repeated?

Let's do a bit of math. Suppose you have 100 million names in your database. Let's be generous and say each name on average requires 30 characters. You need only 3 Gigs to store those names. How many duplicates are in those 100 million? 30%? I doubt its anywhere near that much, but let's say 1/2 the names are duplicated. So, the space required to store the artificial keys in the many-to-many table would be 4 bytes times 100 million on one "side" of the table and 4 bytes times 50 million on the other, and the space reduction by removing the duplicates would be 1.5 gigs. So we have 1.5 Gigs to store the unique names plus 600MB for the many-to-many table plus some overhead in the table itself which I'll ignore. So fully representing the names takes 3 Gigs, and the many-to-many approach avoiding duplicates takes a bit over 2 Gigs. Now, things are a good deal more complex than I've represented, as presumably some names would result in a higher savings. Thus, if there a lot of criminals using the alias "John Smith", then the savings would be higher. But it would only be higher on the 'alias' side of the many-to-many table. You still have the fixed "overhead" of 100 million people on the 'people' side.

But all in all we're still talking a few gigabytes one way or the other. I think that's a worthwhile tradeoff for the complexity of the queries and the code to maintain the data in the structure.

I'm not sure how the second solution handles Mr. Kelley "better" than the first.

If you insist on using the many-to-many table, I would recommend staying away from your alternative 2. Constructing somebody's fullname would require four self-joins, possibly 3 of them outer joins, for each name you want to present. Yuck.

Listen to your Senior DBA. He knows what he's talking about.

Jeff Mason
Custom Apps, Inc.
www.custom-apps.com

Kon Tiki on a Blue Sea · July 3rd, 2003, 07:25 PM

Well, if I have what you are asking, my first say would be that it really comes down to how often you are likely to search for bits of names. Is there statistical significance in searching for all the criminals named George, for instance? Probably not. There is only merit in splitting fields up if there is merit in searching the individual subsections separately.

However, since criminals have quite limited imaginations, and tend to name one another after their modus operandi ,is there perhaps not merit in being able to pull out all the "Machineguns" in the list together: so that one might view George "Machinegun" Kelly alongside his more deadly and infamous compatriot Oonagh "Machinegun" Molloy - in such a way that we are not hampered with results concerning the far less threatening Oonagh "Inflatableclawhammer" Molloy? That way, we could distinguish whether the recently rumored Maxwell "Machinegun" Hosenbaum, said to be touring eastside making holes in people, is not in fact, more likely to be an alias for the recently-escaped George "Machinegun" Kelly, than the newly incarcerated Oonagh "Machinegun" Molloy (she being a woman, and all - and behind bars, too boot).

Oomoo

--------------
Feline Tiik Design
To err is human... Only, some of us are more human than others.

David Cameron · July 3rd, 2003, 07:28 PM

I concur with Jeff. This would unnecessarily complicate things.

regards
David Cameron