There are a couple of rules to follow when choosing a primary key for a table: all records in a primary key must be unique, a primary key cannot contain NULL values, a composite primary key cannot exceed 16 columns and a key length of 900 bytes, primary key values shouldn’t be changed
There are two types of a primary key – a natural key and a surrogate key and there is a lot of debating whether to choose one or another as a primary key
A natural key
A natural key, otherwise called an intelligent or a domain key, is a key with a business value and logically related to a table, meaning that the data of a natural key exists in nature. If a more than one column is defined as a primary key on a table it is called a composite primary key. For example, the Customer table has a composite primary key combined from the FirstName, the LastName, and the Email columns:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
CREATE TABLE Customer ( FirstName varchar(20) NOT NULL, LastName varchar(20) NOT NULL, Email varchar(20) NOT NULL, PhoneNumber int NULL, PRIMARY KEY CLUSTERED ( FirstName, LastName, Email )) |
Pros for using a natural key as a primary key
- A natural candidate key for a primary key already exists in a table – there is no need for adding additional column
- A natural key can be used in a client’s code as a search criteria
Cons for using a natural key as a primary key
- If a primary key is a combination of the several varchar columns it becomes large. SQL Server will automatically add a clustered index on a primary key, if a table already doesn’t have one. In this case an index also becomes big (much bigger on varchar data type columns than on an integer data type column) and the number of index pages which are used to store the index keys is increased. This increases the number of reads required to read the index and degrades overall index performance
- If a primary key column(s) have varchar data type the JOIN statements are slower as compared to the integer data-type joins
In the following example a primary key is combined from three columns to achieve uniqueness of a primary key
If the business logic changes, you’ll need to change all references to an existing primary key. Some examples are the change in ISBN (International Standard Book Number) from the 10 digit universal identifier into the 13 digit identifier. In the case of using SSN (Social Security Number) as a primary key you need to consider a possibility that SSNs may be reused after a person’s death and also that in cases of fraud, or an identity thefts an individual will get a new SSN.In the following example if a Customer’s SSN changes that change will have to be reflected in the two tables that reference the Customer table
The record cannot be entered into a table until the value of a primary key is known.
A surrogate key
A surrogate key is a unique number generated by SQL Server or a database itself, and has no business logic. In SQL Server it is most often used as an IDENTITY column or a Globally Unique Identifiers (GUID) column – a globally unique 128 bit long data type.
Pros for using a surrogate key as a primary key
- If the business logic changes a surrogate key will not change since it has no business value
- Surrogate keys are typically integers, which only require 4 bytes to store, so the primary key index structure will be smaller
- A naming system for a surrogate key is easier to create
Cons for using a surrogate key as a primary key
- When a surrogate primary key is used an extra indexes may be required on the columns that used to be a part of a natural primary key. These indexes may be necessary to preserve uniqueness of those columns, and they may make an update of a table slower
- Having a surrogate key may require an additional joins when searching a record. For example, the user can enter a known natural primary key (e.g. username) and retrieve wanted information (e.g. real name) based on a foreign key relationship with a natural key without accessing the primary key table. In a case of a surrogate key as a primary key the user would have to look up in the primary key table to retrieve information stored in a table with a foreign key relationship
- A surrogate key cannot be used in a search
Performance test
The CPU time needed to parse and compile the JOIN statement with varchar data type columns and a composite primary key:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 |
SET STATISTICS TIME ON GO SELECT c.FirstName , c.LastName , i.Quantity FROM dbo.Invoice i INNER JOIN dbo.Customer c ON c.FirstName=i.CustomerFName AND c.LastName=i.CustomerLName AND c.Email=i.Email GO SET STATISTICS TIME OFF GO SQL Server parse and compile time: CPU time = 4 ms, elapsed time = 4 ms. (5 row(s) affected) SQL Server Execution Times: CPU time = 0 ms, elapsed time = 0 ms. SQL Server parse and compile time: CPU time = 0 ms, elapsed time = 0 ms. |
The parse and compile time of the JOIN statement with an integer data type primary key:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 |
SET STATISTICS TIME ON GO SELECT c.FirstName , c.LastName , i.Quantity FROM dbo.Invoice i INNER JOIN dbo.Customer c ON c.CustomerID = i.CustomerID GO SET STATISTICS TIME OFF GO SQL Server parse and compile time: CPU time = 0 ms, elapsed time = 2 ms. (5 row(s) affected) SQL Server Execution Times: CPU time = 0 ms, elapsed time = 0 ms. SQL Server parse and compile time: CPU time = 0 ms, elapsed time = 0 ms. |
Refactoring an existing primary key
In a situation when one key is already implemented on a table, but another type of a primary key is more suitable you can implement database refactorings and introduce a surrogate key to replace an existing natural key with, or replace an existing surrogate key with a natural key
Choosing a primary key by replacing a natural key with a surrogate key is a method mostly used to reduce coupling between a database schema and external applications in cases when an existing natural key may change. Also, a large natural key may reduce performance
The opposite database refactoring method of introducing a surrogate key to a table is the replacing an existing surrogate key with a natural key. Motivation for using this refactoring is mostly to maintain a key strategy or to remove unnecessary keys – sometimes a surrogate key column is introduced to a table when it actually wasn’t needed
Regardless of a chosen primary key type for your database tables you should choose a single strategy and be consistent in applying it throughout your database. By refactoring your database to consolidate key strategy you can achieve code consistency because having a variety of keys your code to access a database is also implemented in various ways which increases the maintenance of code having your developers to follow all different approaches. Also, if your company has a corporate standard for a preferred key strategy you may discover that your schema doesn’t comply with the rules, so you’ll need to refactor your primary keys and implement the consolidate key strategy refactoring
Useful resources:
SQL by Design: How to Choose a Primary Key
Surrogate vs Natural Primary Keys – Data Modeling Mistake 2 of 10
Primary Keys: IDs versus GUIDs
- Creating and using CRUD stored procedures - April 7, 2014
- Database table partitioning in SQL Server - April 4, 2014
- SQL Database design: Choosing a primary key - March 16, 2014