In my work over the years that I have continually strived for the ideal structure of a lookup table. I must have designed hundreds of them by now in at least four different RDBMS engines, sometimes using a modeling tool, sometimes using a graphical table designer, sometimes writing the DDL by hand. I may have finally arrived at a structure I really like, so I thought I would share it, with the hope that other people may benefit from adopting it, or even better, suggest improvements to the design.
At first I thought that the meaning of "lookup table" might be obvious enough not to require much explanation, but later I changed my mind. I started where lots of people start these days and looked up lookup table in Wikipedia [1]; I realized that the term does not necessarily have a universal meaning. The Wikipedia entry (at the time I am writing this) defines the term in more general terms:
In computer science, a lookup table is a data structure, usually an array or associative array, used to replace a runtime computation with a simpler lookup operation. The speed gain can be significant, since retrieving a value from memory is often faster than undergoing an expensive computation.
This definition hints at my usage of the term, but Wikipedia's context assumption is different, whereas every time I've used this term it has been to describe a table in a relational database that is designed to provide a set of possible values for a column, which serves to limit the possible values for a column, standardize input, and reduce the amount of data stored in a table.
Another term I have encountered for this same concept is reference table (not to be confused with cross-reference table, which is used to define a many-to-many relationship between two other tables; lookup/reference tables are typically used to define one-to-many relationships).
Lookup tables are often used to populate dropdown lists and are also often used as a sorting/grouping mechanism for a query or report.
Just be clear about what I mean, this table structure...
| Customer Name | Customer Type Name |
| ACME Products | Retail |
| Some Store, Inc. | Retail |
| Pity the Fool Enterprises | Wholesale |
| Crazy Time Industries | Wholesale |
...becomes this when we add a lookup table to the design:
| Customer Name | Customer Type ID |
| ACME Products | 1001 |
| Some Store, Inc. | 1001 |
| Pity the Fool Enterprises | 1002 |
| Crazy Time Industries | 1002 |
| Customer Type ID | Customer Type Name |
| 1001 | Retail |
| 1002 | Wholesale |
With the above definition in mind, I have described below my (current) preferred design for a lookup table.
| Column |
| Surrogate Primary Key |
| Short Name |
| Description |
| Active Flag |
| Display Order |
| Created Stamp |
| Last Modified Stamp |
A description of each column follows:
Surrogate Primary Key
Depending on your stance on the endless debate over surrogate vs. natural keys [2], you may prefer a natural key, but I'm a surrogate key person myself. Naturally, I'm sure you'll agree with me and design your database using surrogate keys. :-) Regardless of what kind of key you choose, though, if you're using anything like the lookup table pattern I'm describing here, your lookup table will need a primary key, and that primary key will be used as a foreign key in other tables, along the lines of the "Customer Type ID" example above.
Surrogate primary keys have different implementation mechanisms in different database engines. Oracle has its SEQUENCE object, SQL Server has IDENTITY columns, and Access has the AutoNumber. Usually these are numeric keys, but I have started to notice a trend in the SQL Server world towards using GUIDs for all keys; often the explanation for this is that database replication in SQL Server requires GUID-based keys, though I'm not so sure that's as much of a concern for lookup tables.
Short Name
The "Short Name" column might also be labeled just "Name." If you imagine your lookup table showing up in a dropdown list or in a column on a report or query output, you can probably imagine as well that a longer name could make things awkward; you dropdown list or report column might need to be exceedingly wide in order to accommodate longer names. My preferred method is to use a "Name" column in conjunction with a "Description" column, which can be used to help users and administrators understand the meaning conveyed by the Name. I can imagine in some circumstances where instead of, or in addition to, a Description column a "Long Name" column might make sense. Along the same lines, it is sometimes also appropriate to have more than one "Name" column for different purposes--perhaps one for display, and one for sorting (see also "Display Order," below).
Description
As described above under "Short Name," the "Description" column is intended to convey additional meaning. For a simple lookup table, a "Description" column may be overkill. However, I have found this column very handy in the presentation tier; most commonly, this has come up when there is doubt about whether the user will be able to figure out the meaning of a lookup table entry based on just the "Name." If there is a possibility that the Name could be confusing, or that entries might have similar names that are difficult to distinguish, then you can display the Description in the user interface to help the user out, perhaps in a "tooltip" that floats over the field in the UI or some similar mechanism.
Active Flag
This is one of those I had to learn about the hard way. Here's what happens: when you design and build your application for your client, they have exactly six Customer Types. So you design a lookup table, put those six entries in it, and proceed to design the rest of the database, using the CustomerType lookup table key as a foreign key in various tables. Most likely, you are using referential integrity (RI), so once you start adding "parent" rows that utilize your "child" lookup table rows, the database engine will prevent you from removing any of your lookup table rows because they are locked in by the RI. You also write code that retrieves the rows for the lookup table for dropdown lists, reports, etc.
Six months later, your client wants to drop one of those Customer Types because it's not relevant anymore. You can't just remove it from the table because of the RI, and you wouldn't want to do that anyway because all that old data is still valid. But you have a Catch-22 because the now defunct Customer Type is still showing up in various dropdown lists, report parameter prompts, etc.
If you had designed your lookup table with an "Active" flag column *and* written all of your queries to filter on the Active flag, then "deactivating" the defunct Customer Type would be as simple as flipping the Active flag. You get the best of both worlds: the old Customer Type will still show up as a label when querying old data, but it won't be available as an option any more for the creation of new data.
Display Order
You may have noticed that people (especially your client) do not think about the order in which they want lookup table values to be sorted. It's almost always after the fact that this comes up (unless you're savvy enough to ask the question in advance and your client client is forward thinking enough to have the right answer). Regardless, it sure is nice to be able to manipulate the sort order of a dropdown list or other query/report without having to write new code or (as is often the case) implement some kind of hack in the presentation tier. Save yourself the trouble and build a Display Order column into all of your lookup tables. I generally use a numeric type for this column that is the same width as the numeric surrogate primary key. As you write any queries that select from this table, include an ORDER BY clause that uses the Display Order column. When requirements change relative to the sorting of your lookup table, all you have to do is manipulate this value. (In a SQL Server environment, you might also consider using the DisplayOrder column for a, clustered index [3].)
There is one exception to this rule, I think: some lookup tables might so obviously lend themselves to a pure alphanumeric sort on the "Name" column that a Display Order column seems silly.
Created Stamp
This column is a simple date/time column that reflects when the lookup table row was first created. Some might argue that Created and Last Modified audit columns are overkill for lookup tables, but I have found them useful and almost always use them. They are especially handy in an application that includes an administrative interface that enables users (not just database administrators and developers) to manage the data in lookup tables.
Last Modified Stamp
This column is a simple date/time column that reflects when the lookup table was last modified. In some applications it might also make sense to have "Created By" and "Last Modified By" columns to reflect which user made a change.
A lookup table is also much like a dimension table [4] in a dimensional data warehouse [5] design, a context in which the fascinating topic of the slowly changing dimension [6] also comes into play. The connections between lookup tables in a transactional database design and corresponding slowly changing dimensions in a downstream data warehouse is an area for additional interesting discussion, I think. For example, typically a lookup table in an OLTP system does not support changes over time (though with the suggested Active Flag, I believe this could be done), but Kimball Group consultant Joy Mundy advocates tracking "attribute changes" over time [7]:
The key flaw Kimball Group typically sees in existing enterprise data warehouses is a lack of historically accurate attributes. Almost all data warehouses track transaction history--what was sold, to whom and for how much--but many do a poor job of associating the transaction with the attributes as they were at the time of the transaction. This attribute history is very important for data mining and other smart enterprise applications. Suppose, for example, that you want to build an application that will present product choices to a potential customer based on where he lives. If the data warehouse only keeps the customer's current address, then for the purposes of queries, analysis and prediction, it looks as if the customer has always lived where he does now. We lose the vitally important information about how the customer's behavior changed when he moved from a cold climate to a warm climate. And once that history is gone from the data warehouse, it's difficult or impossible to reconstruct it.
Another thing I've wondered is whether OLTP database designers could do certain things with their table designs to make it easier for OLAP database designers and ETL coders to come along later and do their thing.
I'm sure my quest for the "ideal" lookup table design will continue. Your thoughts are most welcome.
Thanks for reading,
Dan
P.S.
A quick Google search turned up an opinion piece by Don Peterson called Lookup Table Madness [8]. It appears that the author makes a case against the whole practice of using lookup tables, which sounds very interesting to me--but unfortunately that particular site locks their content behind passwords, so I have not read it.
(Irrelevant rant: Granted, the account I would need to read that article is free, but I always find myself annoyed when a page comes up at the top of the search results and when I click through I find that the real content is hidden. When that happens, I make a mental note to avoid that site in the future. If people want to hide their content, that's fine by me, but I think search engines should reciprocate by penalizing these pages rather than keeping them at the top of the results).