Identity attribute

Jeff Mason · June 6th, 2003, 09:10 AM

This was pretty much my favorite forum in the old P2P world. It saddens me to see this new one empty and devoid of any content, so I thought I'd post something just so the forum doesn't look so forlorn.

Just so the post is on topic, how about I try to start trouble and stir up a discussion about IDENTITY columns? Why would anybody use these as primary keys when tables should have a perfectly good 'natural' key?

Jeff Mason
Custom Apps, Inc.
www.custom-apps.com

David Cameron · June 9th, 2003, 06:37 PM

Well just to play devil's advocate, I can give a couple of reasons.

The first is that you must be certain that the natural key is unique. This is harder than it sounds. I remember in a earlier discussion that you suggested postcodes as a good natural key. The problem is that in Sydney, Australia at times postcodes are shared by more than one suburb and suburbs have more than one postcode. This is acceptable to the post office and to the people living there. Certainly you could create a composite key, however there has to be a limit to the number of columns you want to use as the key. You could end up in the situation that the entire table is a key. Which would be the case in the example above as the above would have columns postcode & suburb. Composite keys also make JOINS more complicated. In this case you would need to JOIN on a integer column and a VarChar column. Performance may suffer and more space would be needed.

The second problem is one of understanding. I think people may find it easier to understand using an identity to always identify a row better than they understand using a natural key, particularly as I suspect that a natural key will often result in a composite key. This avoids the issue of trying to work out whether there is an good natural key.

regards
David Cameron

Jeff Mason · June 11th, 2003, 07:06 AM

Well, we can't get anybody else to participate - I'll take a stab at debating your points.

Of course the natural key must be unique. It is a prime tenet of relational theory that rows in a table must be distingushed from one another - if you had two rows with identical data, how would you know which row you were refering to in a given situation?

I define a 'natural key' as that column, or set of columns, which allow a human being to distingush one row from another.

You state that finding the natural key can be "...harder than it sounds". I submit that if you cannot find a natural key, then you are (1) not looking hard enough, or (2) your database model is incorrect.

As I recall, I never suggested that postcodes would be a good natural key (they might be, though - depending upon what they would be a key of - and that is, er, a key point :)). I was referring to a design I saw of a States code table. This table contained US state codes. The table had three columns: an integer identity column as the primary key, the 2 character state code abbreviation, and a varchar state long name.

My issue with this design is the presence of the identity column as the primary key. What purpose does it serve? It actually makes things less efficient, as an extra JOIN would be required to pick up the state code when displaying or printing addresses, for example. Thus, some address row in another table contains, among other columns, the 'StateID'. You have to JOIN to the States table to include the state abbreviation in the address and this, in my opinion, is wasteful and silly. Why not simply include the 2 char state code in the address, along with, of course, a foreign key constraint to the States table on this value.

What makes this worse is that the design of the States table has to include a unique constraint on the 2 character state code. By definition every state code is unique. So, we have a table with 2 unique columns and thus either of them is a candidate for the primary key. Why use an artificial one that would always (I hope) be hidden from any user? I'm guessing that the original designer of this table added the identity ID column as a reflex action - the inclusion of an identity column as the primary key of any table became dogma and it was added in this case without any thought.

A row in a table models a business entity or process; as such there is always a natural key and I submit that people find it far easier to interact with that model using what they know and understand as the attributes of the model that distingush one entity from another rather than some artificial key that has no meaning in the real world. Any sane application must insure that the natural key is unique anyway, since that is the value that people use to distingush rows. So, since you have to go through that effort anyway, why not make use of it and simply use the unique natural key as the primary key?

The fact that many applications don't bother to insure uniqueness of the natural key is evidenced by the many posts on this and other boards asking for help writing a query to find all the duplicate "records" that somehow snuck into somebody's database :D

Jeff Mason
Custom Apps, Inc.
www.custom-apps.com

owain · June 11th, 2003, 07:44 AM

I am relatively new to the world of databases (specifically from a design point of view) and I used to say what Jeff is saying, that if there is a 'natural' key then why use an identity key as the primary key. Then someone told me that it was a good idea to always have a separate identity key because this way you always know what the primary key is in any table, and what data type it is.

Also I would have thought that joining tables together based on a numeric key is far faster than using a string key (of course there are exceptions to this). In the example Jeff made above regarding the state abbreviation I agree with him, however it is not always the case.

A database that I work with contains a list of patients with their details. There is a PatientID field which is varchar and can be entered by the client, depending on the type of ID's there particular site uses. There is also a RecNo column which is an identity column and acts as the primary key.

Now you may say that it would be just as easy to use the PatientID as the primary key as this also has to be unique, and you would be right. However a while ago a nasty little bug burrowed its way into the software and altered some of the links to the child tables. The child rows would have been permanently orphaned if it had not been for this extra check, the identity column.

I admit that this is not the best database design in the world (far from it in fact), however in this particular case it saved my bacon. Of course you could say that this was a weakness in the software, not the database. Jeff does have a point about natural keys, but there is always an exception to any rule as David has pointed out.

Regards
Owain Williams

Jeff Mason · June 11th, 2003, 08:41 AM

This is, after all, an academic debate. ;)

What I object to is dogma, and I frequently see rampant cases of it in the use of identity columns as primary keys.

That having been said, I'll be the first one to admit that I use identity columns as the primary key in many of my tables. But when I do so, it's almost always because of one of two reasons:

1. The natural key is composite, or otherwise lengthy, and the table participates in foreign key relationships or is otherwise used extensively in JOIN operations.

2. The natural key has some level of volatility. Changing the primary key can be annoying, so if there is a chance this may happen I'll opt for a stable artificial key.

Item 1 is a performance issue and a programmer laziness issue - having to specify a composite key in a JOIN condition over and over again gets boring. I'm not convinced that there is all that much of a performance issue these days in using, say, a 50 byte varchar key versus a 4 byte integer key, but I'll accept there there is some difference on faith, though I can't resist noting that one misplaced cursor will consume any such performance gain tenfold.

Convincing yourself that Item 2 is always a valid reason can be dangerous. On more than one occasion I have seen a claimed need to change an account number primary key, say, only to discover that the real reason the account number needs to change is because there is some information encoded in the account number, and what really needed to change was some attribute of the account itself, and the system design has actually violated 2nd normal form. Of course, by then it was too late to properly define the account number, so it as a key indeed had to change - but we grumbled a lot, and reveled in righteous indignation that the design was poor. We also implemented an identity key. ;)

No disrespect intended, but I think the reason you gave for always using an identity column because you know what the key is and it's datatype, is a bit silly. You can only know what the primary key of a table is by looking to see if the column participates in a primary key constraint. I think what you may be describing here is instead some sort of naming convention, and as such it may have some merit. People tend to use something like <tablename>ID as the name of an identity column and by convention recognize that as the primary key, but there are no guarantees and no way to enforce that.

You need to know what the datatype is of any column in the table to do any meaningful manipulation, so why pick on the primary key for distingushed treatment here? In fact, since the identity column is artificial, it is rarely exposed to any user, so in fact it needs less manipulation than other columns. I've seen tables where the column names use a form of Hungarian notation to explicitly describe the datatype and IMO, all it really does is make queries harder to read.

As to your patient ID example, it sounds like your database design was flawed, in that you failed to set up foreign key constraints on this very important column. It is indeed the natural key and must be treated with the 'respect' that such an attribute deserves. Had you done so, you would have started receiving integrity errors when your bug creeped in. You might have found the bug sooner, and in any case you would have prevented any data damage, the same way your artificial key did.

This is one of my main complaints with identity keys. People define the artificial key as the primary key and think they are done. The natural key is what really distingushes one entity from another, at least to the people that must use the data, so when attention to the natural key is given short shrift, data integrity issues start to accumulate. Frequently they lie in wait until some poor slob has to create a data warehouse from this and all the crap raises it's ugly head.

I'm amused by the use of RecNo as a primary key - it sounds like it is a record number. If you keep thinking in terms of record numbers you'll keep thinking in terms of records in a file, and that is not what the rows in a relational database are. You'll keep thinking of such things as sequential files and you'll never grok the concept of sets...

Jeff Mason
Custom Apps, Inc.
www.custom-apps.com

David Cameron · June 11th, 2003, 07:23 PM

In the time that had passed I had forgotten about the fact that you were talking about state codes rather than postcodes, so sorry about that.

I'm not utterly convinced that every row necessarily has a natural key unless a primary key is added. I can't think of an example that is not at least partially artificial, but consider a situation where variations are being added to a cost. Table (Variations) has columns Amount, DateVaried, CostID (FK). Imagine then that two people are updating a cost at the same time and enter the same variation amount (eg 1000), the natural key is lost. This is a contrived example and is unlikely to happen, but is possible. If not possible right now with SQL Server, then in the future.

Quote:

quote:
1. The natural key is composite, or otherwise lengthy, and the table participates in foreign key relationships or is otherwise used extensively in JOIN operations.

...

Item 1 is a performance issue and a programmer laziness issue - having to specify a composite key in a JOIN condition over and over again gets boring.

This shouldn't be underestimated as an issue. Code grows larger -> longer to type -> more typing errors -> harder to manage -> harder to maintain. This is almost an issue of modularity. If you use an identity, the "join" from the composite key to the artificial key occurs once (in the table), ie the code needs to be "written" once and only needs to be maintained in one place. On the other hand a composite key that is used for joins means that the code is duplicated for each join to the PK.

Like all such issues it is a good question as to the correct time to convert the code to a "module"

regards
David Cameron

owain · June 13th, 2003, 06:42 AM

Jeff (and David), you clearly know what you are talking about, and I can only offer my suggestions as a humble SQL novice. In Jeff's first post he said:

Quote:

quote:Why would anybody use these [IDENTITY columns] as primary keys when tables should have a perfectly good 'natural' key?

Isn't this incredibly sweeping? I am sure there are plenty of times when IDENTITY columns can serve as (sometimes the only) primary key. For example an Invoice number or Order number. These are ideal candidates for IDENTITY columns.

Although I agree that most tables should have a perfectly good natural key, there are times when an IDENTITY column is the most appropriate.

Regards
Owain Williams

Jeff Mason · June 13th, 2003, 07:56 AM

Quote:

quote:Originally posted by owain
Isn't this incredibly sweeping?

Of course it is. :D

This is, after all, a debate about the dogmatic use of IDENTITY columns as primary keys. In order to stimulate such a debate, the use of incredibly sweeping assertions is encouraged.;)

I maintain that a properly constructed table, which appropriately models a real-world entity or process always has a natural key. It is this natural key which distingushes one entity from another. If you can't find such a key, you are not looking hard enough. Now, it may very well be that this key is not easily usable. Consider a table representing employees, for example. As we all know, uniquely distingushing one human from another can be difficult. Names don't work real well; they are not unique enough. Names combined with date of birth are better but still not guaranteed unique. I've heard proposals that exact date and time of birth combined with the exact latitude, longitude, and altitude of the place of birth would make a good key for a person. That would probably work, given sufficient precision of the altitude so that the case of two maternity wards on consecutive floors is distingushed. ;)

It's also probably not practical for something as simple as an Employees table.

Thus, in some cases the 'natural' key, while it surely exists, is not convenient, so we construct an artificial key which in effect acts as a surrogate for the natural (unwieldy) 'real' key.

As you point out, an order or invoice number might be another such a case.

The question then is whether an IDENTITY column is suitable mechanism for generating these artificial keys.

One big problem is that IDENTITY columns can leave 'holes' - rows which are abandoned by the user before the entry operation is completed. Thus there exist gaps in the sequence numbers which for some uses can be disconcerting.

Order Numbers or Invoice numbers are frequently typed in by people. It might be appropriate to detect input errors by including a check digit in the order number to guard against transpositions or other typing errors. IDENTITY values can't do this.

IDENTITY values can make you think you are imposing some physical ordering of the data - for example people select the minimum identity value in order to find the "first record" when neither the concept of "first" nor "record" really apply.

If you read my posts above, you'll see that I do not object to IDENTITY columns per se. What I do object to is the all too often scenario where a programmer sits down to design a table and starts by entering the first column as blah_ID integer Identity primary key, then stops, and says to himself, ok - now what is the rest of the data that belongs in here?

Jeff Mason
Custom Apps, Inc.
www.custom-apps.com

David Cameron · June 19th, 2003, 12:32 AM

Jeff,

I'd be interested to hear what you think about the idea of modularity as I applied it to identities.

Also I recently found some data that doesn't have a nice natural key. IIS Logs. Even a combination of date/time, page requested and client IP doesn't guarentee uniqueness. With the use of NAT is certainly possible that a two people in the same network could request the same page in a time close to each other. It may be possible that IIS may not be able to return two pages at identical times, however the possibility exists. None of the other logged values are good candidates for part of a natural key.

regards
David Cameron

Jeff Mason · June 19th, 2003, 06:48 AM

Quote:

quote:This is almost an issue of modularity. If you use an identity, the "join" from the composite key to the artificial key occurs once (in the table), ie the code needs to be "written" once and only needs to be maintained in one place. On the other hand a composite key that is used for joins means that the code is duplicated for each join to the PK.

I prefer the use of the term "surrogate" in this instance. Here, the IDENTITY key 'stands in' for the unwieldy composite, and acts as a synonym for it. This is, I think, a good use of the IDENTITY attribute. From that perspective, I suppose this is a form of modularity, although I think applying that term in this case is a bit of a stretch. To me, modularity is a form of encapsulation, and I'm not sure that introducing an extra level of indirection is sufficient justification on its own merits for the substitution. It probably is worth it for a composite key consisting of many columns. As the composite key becomes less complex, though, the tradeoff becomes harder to justify.

Consider my favorite example of the the state code table and its use as a foreign key in some addresses table. The addresses table would have a 'StateID' column as opposed to a 'State' column. I submit that the presence of the surrogate ID column makes the relationship between the two tables harder to understand. I cannot view an addresses row and have any idea what state it is in without also having the States table to refer to. This is not clearer nor easier to maintain in my view.

Modularity also implies some sort of isolation from change. Thus, if a primary key is volitile, that might be a good use of a artificial key - to protect the relationship from changes in the underlying data which describes the entity being referred to.

As to your discovery that the IIS logs do not appear to have a good natural key, I say consider the source. :D

You are correct, though - such examples abound, and we have to make do. The fact that they exist, though, is no excuse for you to design a table that way.

Jeff Mason
Custom Apps, Inc.
www.custom-apps.com