Ideal Design/Structure for Lookup Tables
In my work over the years that I have continually strived for the ideal structure of a lookup table. I must have designed hundreds of them by now in at least four different RDBMS engines, sometimes using a modeling tool, sometimes using a graphical table designer, sometimes writing the DDL by hand. I may have finally arrived at a structure I really like, so I thought I would share it, with the hope that other people may benefit from adopting it, or even better, suggest improvements to the design.
At first I thought that the meaning of "lookup table" might be obvious enough not to require much explanation, but later I changed my mind. I started where lots of people start these days and looked up lookup table in Wikipedia; I realized that the term does not necessarily have a universal meaning. The Wikipedia entry (at the time I am writing this) defines the term in more general terms:
In computer science, a lookup table is a data structure, usually an array or associative array, used to replace a runtime computation with a simpler lookup operation. The speed gain can be significant, since retrieving a value from memory is often faster than undergoing an expensive computation.
This definition hints at my usage of the term, but Wikipedia's context assumption is different, whereas every time I've used this term it has been to describe a table in a relational database that is designed to provide a set of possible values for a column, which serves to limit the possible values for a column, standardize input, and reduce the amount of data stored in a table.
Another term I have encountered for this same concept is reference table (not to be confused with cross-reference table, which is used to define a many-to-many relationship between two other tables; lookup/reference tables are typically used to define one-to-many relationships).
Lookup tables are often used to populate dropdown lists and are also often used as a sorting/grouping mechanism for a query or report.
Just be clear about what I mean, this table structure...
| Customer Name | Customer Type Name |
| ACME Products | Retail |
| Some Store, Inc. | Retail |
| Pity the Fool Enterprises | Wholesale |
| Crazy Time Industries | Wholesale |
...becomes this when we add a lookup table to the design:
| Customer Name | Customer Type ID |
| ACME Products | 1001 |
| Some Store, Inc. | 1001 |
| Pity the Fool Enterprises | 1002 |
| Crazy Time Industries | 1002 |
| Customer Type ID | Customer Type Name |
| 1001 | Retail |
| 1002 | Wholesale |
With the above definition in mind, I have described below my (current) preferred design for a lookup table.
| Column |
| Surrogate Primary Key |
| Short Name |
| Description |
| Active Flag |
| Display Order |
| Created Stamp |
| Last Modified Stamp |
A description of each column follows:
Surrogate Primary Key
Depending on your stance on the endless debate over surrogate vs. natural keys, you may prefer a natural key, but I'm a surrogate key person myself. Naturally, I'm sure you'll agree with me and design your database using surrogate keys. :-) Regardless of what kind of key you choose, though, if you're using anything like the lookup table pattern I'm describing here, your lookup table will need a primary key, and that primary key will be used as a foreign key in other tables, along the lines of the "Customer Type ID" example above.
Surrogate primary keys have different implementation mechanisms in different database engines. Oracle has its SEQUENCE object, SQL Server has IDENTITY columns, and Access has the AutoNumber. Usually these are numeric keys, but I have started to notice a trend in the SQL Server world towards using GUIDs for all keys; often the explanation for this is that database replication in SQL Server requires GUID-based keys, though I'm not so sure that's as much of a concern for lookup tables.
Short Name
The "Short Name" column might also be labeled just "Name." If you imagine your lookup table showing up in a dropdown list or in a column on a report or query output, you can probably imagine as well that a longer name could make things awkward; you dropdown list or report column might need to be exceedingly wide in order to accommodate longer names. My preferred method is to use a "Name" column in conjunction with a "Description" column, which can be used to help users and administrators understand the meaning conveyed by the Name. I can imagine in some circumstances where instead of, or in addition to, a Description column a "Long Name" column might make sense. Along the same lines, it is sometimes also appropriate to have more than one "Name" column for different purposes--perhaps one for display, and one for sorting (see also "Display Order," below).
Description
As described above under "Short Name," the "Description" column is intended to convey additional meaning. For a simple lookup table, a "Description" column may be overkill. However, I have found this column very handy in the presentation tier; most commonly, this has come up when there is doubt about whether the user will be able to figure out the meaning of a lookup table entry based on just the "Name." If there is a possibility that the Name could be confusing, or that entries might have similar names that are difficult to distinguish, then you can display the Description in the user interface to help the user out, perhaps in a "tooltip" that floats over the field in the UI or some similar mechanism.
Active Flag
This is one of those I had to learn about the hard way. Here's what happens: when you design and build your application for your client, they have exactly six Customer Types. So you design a lookup table, put those six entries in it, and proceed to design the rest of the database, using the CustomerType lookup table key as a foreign key in various tables. Most likely, you are using referential integrity (RI), so once you start adding "parent" rows that utilize your "child" lookup table rows, the database engine will prevent you from removing any of your lookup table rows because they are locked in by the RI. You also write code that retrieves the rows for the lookup table for dropdown lists, reports, etc.
Six months later, your client wants to drop one of those Customer Types because it's not relevant anymore. You can't just remove it from the table because of the RI, and you wouldn't want to do that anyway because all that old data is still valid. But you have a Catch-22 because the now defunct Customer Type is still showing up in various dropdown lists, report parameter prompts, etc.
If you had designed your lookup table with an "Active" flag column *and* written all of your queries to filter on the Active flag, then "deactivating" the defunct Customer Type would be as simple as flipping the Active flag. You get the best of both worlds: the old Customer Type will still show up as a label when querying old data, but it won't be available as an option any more for the creation of new data.
Display Order
You may have noticed that people (especially your client) do not think about the order in which they want lookup table values to be sorted. It's almost always after the fact that this comes up (unless you're savvy enough to ask the question in advance and your client client is forward thinking enough to have the right answer). Regardless, it sure is nice to be able to manipulate the sort order of a dropdown list or other query/report without having to write new code or (as is often the case) implement some kind of hack in the presentation tier. Save yourself the trouble and build a Display Order column into all of your lookup tables. I generally use a numeric type for this column that is the same width as the numeric surrogate primary key. As you write any queries that select from this table, include an ORDER BY clause that uses the Display Order column. When requirements change relative to the sorting of your lookup table, all you have to do is manipulate this value. (In a SQL Server environment, you might also consider using the DisplayOrder column for a, clustered index.)
There is one exception to this rule, I think: some lookup tables might so obviously lend themselves to a pure alphanumeric sort on the "Name" column that a Display Order column seems silly.
Created Stamp
This column is a simple date/time column that reflects when the lookup table row was first created. Some might argue that Created and Last Modified audit columns are overkill for lookup tables, but I have found them useful and almost always use them. They are especially handy in an application that includes an administrative interface that enables users (not just database administrators and developers) to manage the data in lookup tables.
Last Modified Stamp
This column is a simple date/time column that reflects when the lookup table was last modified. In some applications it might also make sense to have "Created By" and "Last Modified By" columns to reflect which user made a change.
A lookup table is also much like a dimension table in a dimensional data warehouse design, a context in which the fascinating topic of the slowly changing dimension also comes into play. The connections between lookup tables in a transactional database design and corresponding slowly changing dimensions in a downstream data warehouse is an area for additional interesting discussion, I think. For example, typically a lookup table in an OLTP system does not support changes over time (though with the suggested Active Flag, I believe this could be done), but Kimball Group consultant Joy Mundy advocates tracking "attribute changes" over time:
The key flaw Kimball Group typically sees in existing enterprise data warehouses is a lack of historically accurate attributes. Almost all data warehouses track transaction history--what was sold, to whom and for how much--but many do a poor job of associating the transaction with the attributes as they were at the time of the transaction. This attribute history is very important for data mining and other smart enterprise applications. Suppose, for example, that you want to build an application that will present product choices to a potential customer based on where he lives. If the data warehouse only keeps the customer's current address, then for the purposes of queries, analysis and prediction, it looks as if the customer has always lived where he does now. We lose the vitally important information about how the customer's behavior changed when he moved from a cold climate to a warm climate. And once that history is gone from the data warehouse, it's difficult or impossible to reconstruct it.
Another thing I've wondered is whether OLTP database designers could do certain things with their table designs to make it easier for OLAP database designers and ETL coders to come along later and do their thing.
I'm sure my quest for the "ideal" lookup table design will continue. Your thoughts are most welcome.
Thanks for reading,
Dan
P.S.
A quick Google search turned up an opinion piece by Don Peterson called Lookup Table Madness. It appears that the author makes a case against the whole practice of using lookup tables, which sounds very interesting to me--but unfortunately that particular site locks their content behind passwords, so I have not read it.
(Irrelevant rant: Granted, the account I would need to read that article is free, but I always find myself annoyed when a page comes up at the top of the search results and when I click through I find that the real content is hidden. When that happens, I make a mental note to avoid that site in the future. If people want to hide their content, that's fine by me, but I think search engines should reciprocate by penalizing these pages rather than keeping them at the top of the results).
MUCK Tables
Thank you, Ron, for that summary! I have encountered MUCK tables before in applications I have been tasked to maintain or rewrite (I've also heard them called "code tables"). In fact, I've seen some *huge* MUCK tables. The worst part of the evil is that (in the cases I've seen anyway) you have no referential integrity at all.
I'm glad to know that I would probably agree with Don Peterson's article, though the term "lookup table" has always seemed a fine one to me.
Dan
Table of Tables
I also found the term Table of Tables commonly used for what you describe as a MUCK table.
-Rich
GUID PKs
Coincidentally, the very next day after I wrote about using GUIDs as primary keys, Jeff Atwood added a (per usual) good article on the topic to the Coding Horror blog.
Dan
Display Logic
What happens when your client wants to support multiple languages? I generally do not like having display logic fields (name, sort order) in the database. These are view attributes and should probably be evaluated in the view layer.
Multiple Language Requirements
Useful article. Any insight into best approach for handling multiple language requirements?
Localization Strings
Support for multiple languages is an excellent suggestion, and one I will have to work into the main post at some point.
My first reaction is that is depends on what kind of localization scheme you are using (assuming that is the context you have in mind). For example, I have seen many schemes where there is a localization token associated with a "resource," which might be something that needs to appear in a label, or a descriptive field, or a dropdown list or listbox (all which could apply to lookup tables). So instead of, or in addition to, the Name and Description columns in the lookup table, you could have NameToken and DescriptionToken, and the middle or presentation tier in your system would translate that token to the appropriate language at runtime. Something like this:
CustomerType
CustomerTypeID
CustomerTypeName
DefaultName
NameToken
DefaultDescription
DescriptionToken
Active
DisplayOrder
...etc...
If instead you need your language support to be at the database level, you might create a Language lookup table with a row for each language you need; then you would have child table to the lookup table that had the lookup table's primary key as a foreign key. Then you could have an entry in the child table for translated value corresponding to the parent lookup table row, with of course a foreign key from the Language lookup table. Something like this:
Language
LanguageID
LanguageName
CharacterSet
CustomerType
CustomerTypeID
CustomerTypeName
DefaultName
DefaultDescription
Active
DisplayOrder
...etc...
CustomerTypeLanguage
CustomerTypeLanguageID
CustomerTypeID
LanguageID
TranslatedName
TranslatedDescription
I introduced the "Default" idea because I encountered a high traffic web site once that served 98% of its pages in English, yet every single page view did a full localization lookup on every resource.
Anyone else have thoughts on this one?
Dan


Lookup Table Madness
Thanks for the good article, Dan. About the only difference between your structure and the one I've evolved is that you have a field called Active, whereas I have one called Obsolete. It feels good to have somebody agree with my decision!
As it happens, I already had an account at the site with article on Lookup Table Madness, so I've taken the liberty of summarizing it here. The only direct quote is indicated as such and the summary and examples are my own.
The author, Don Peterson, refers to something called "Massively Unified Code-Key tables, or MUCK tables." A MUCK table is basically a structure that allows for the inclusion of multiple 'lookup types' in one table. Thus, instead of separate tables for DocumentStatus and DocumentType, you end up with a table called LookupType with 2 entries (DocStatus and DocType) and a second table with the actual lookup values, keyed by LookupType.
I'm sure that we've all been down that road before! At least I hope I'm not alone in that particular folly. While a MUCK table certainly makes it easy to add new lookups, the reality is that the application needs to be recoded anyway if the new data is to be utilized. In practice, I've seen first hand how quickly this concept leads to attempts to use the RDBMS to create a program that very poorly simulates an RDBMS.
Peterson also dislikes the term 'lookup table' because there is nothing particularly special about the table in the first place. It contains data, it's related to other tables, it's used for reference and validation, etc., in the same way that things like Customer and InventoryItem are used.