Monday, 19 November 2007
If your data has a clear "master" field or combination of fields, which can uniquely identify each row (such as customer name in
a table of customers or ISBN for a table of books), it is tempting to use that as the primary key of the table. However, my advice
is: don't do that, use a separate, numeric, artificial primary key instead. Yes, it's an extra column on the table,
requiring extra space, and you will have to generate it somehow, but that's not a big deal. Every database vendor provides some way
of auto-generating unique key values (e.g.
SEQUENCEs in Oracle, and
AUTOINCREMENT fields in SQLite), so
populating it is easy, and the complications it saves are more than worth the trade-off. You can still maintain the uniqueness of
the master columns by applying a unique index to those columns.
Save Space on Foreign Keys
Firstly, if you have any tables that are associated with the master table, and therefore have foreign key columns that refer to
the primary key of your master, then having a separate primary key can actually save space overall, as the data for the
master columns doesn't have to be duplicated across all the linked tables. This is especially important if there is more than one
"master column", such as
customer_last_name, or if the data for these columns is
Changing the master data
Secondly, if the "master columns" are actually the primary key of your table, changing the data in them is potentially
problematic, especially if they are used as a foreign key in other tables. Many online services use a customer's email address as
their "master column": each customer has one email address, and one email address refers to one customer. That's fine until a
customer changes their email address. Obviously, you don't want to lose all data associated with a given customer just because they
changed their email address, so you need to update the row rather than delete the old one and insert a new one. If the
email_address column is the primary key of the table, and therefore used as the foreign key in other tables, then
you've got to update the data not just in the master table, but in each dependent table too.
This is not impossible, but it's certainly more complex and time consuming. If you miss a table, the transaction may not complete
due to foreign key constraint violations, or (worse) the transaction may complete, but some of the data may be orphaned. Also, in
some database engines, the constraint violation will fire when you change either the master table or the dependent table, so you
need to execute a special SQL statement to defer the constraint checking until
COMMIT time. If you use an auto-generated
primary key, then only the data in the master table needs changing.
Changing the master columns
Finally, if the primary key is auto-generated, then not only is it easy to change the data in the master columns, but you
can actually change what the master columns are. For example, if you initially decide that
customer_last_name make an ideal primary key, then you're stuck if you then get another customer with the same
name. OK, so you make
primary key. Oops — now you've got to duplicate the address information across all the dependent tables. Now you encounter two
people with the same name at the same address (e.g. father and son), so you need to add a new designator to the key (e.g. Henry
Jones Junior, Richard Wilkins III). Again, you need to update all the dependent tables. If the primary key is
auto-generated, there's no problem — just update the unique constraint on the master table to include the appropriate columns,
and all is well, with the minimum of fuss.
Simplify your code
It's not going to simplify your code much, but using an auto-generated numeric key means that this is all you need to store as an identifier inside your program to refer to a particular row: much easier than storing the data from a combination of columns. Also, it's much easier to write code to update the data on one table than on multiple tables.
Don't use real table data as the primary key for a table: instead, use a separate, numeric, auto-generated column as the primary key. This will simplify the connections between tables, and make your life easier if the structure of the database or the data in the key columns changes.
In previous posts on Database Design, I've talked about:
- Using parameterized queries,
- Creating appropriate indexes,
- Using transactions, and
- Delaying the use of a database