Effective Database Indexing
What is a Normalised Database?
In layman's terms, normalisation is the process of structuring relational databases in a way that reduces data redundancy by breaking out and linking data into smaller chunks of updateable data.
This article focuses primarily on databases that work in a normalised structure, and will explore an area most people are familiar with (or can imagine) which is financial transactions, clients and contacts.
Why Normalised?
Some level or normalisation can bring a huge amount of enhancement to most datasets, and while data lakes and unnormalised data processing is gaining traction in some aspects of business use, most businesses would probably benefit from having their main data stored in some sort of normal form as it can;
- Speed up updates (see below)
- Make it easier for data interrogation
- Normally provides a smaller data footprint
- Conforms to industry norms
Our Approach
Our standard approach is to look at data as if it is stored in three different ways, and when building out new SQL Server based systems we try to keep them in different schemas.
This approach has worked with previous clients of ours, and we have even extended substantial speed improvements to their system providers.
We are going to aim to add a separate sub-article for each section in due course, and add a section to explore concepts around system neutral reporting between multiple databases.
Index Overview
While SQL Server focused, the same principles apply to many different systems. The number and types of indexes can improve or reduce read and write performance independently.
Clustered
You are limited to one per table, and this defines how the data is stored on disk.
Tables that have an index of this type are called a Clustered Table, and those that don't have are referred to as a Heap.
Non-clustered
You can almost think of this as a separate table that references to each row, however in SQL Server, the actual storage changes dependant on the table type (clustered/heap)
Uniqueness
Both of these indexes can be unique, and when used properly, this can bring some real enhancements into how you store your data.
Compound Indexes
All indexes can use one or more columns, however a clustered index needs to be under 900 bytes.
Hang on, what about the Primary Key?
When people are referring to a "Primary Key", they are quite often talking about a "Unique Clustered Index", and quite a few people automatically store this on a table within an integer based identity field that goes up by one each time a new record is created, this can then be referenced by another table using a Foreign Key.
A Foreign Key can in fact reference any unique index, and even reference multiple columns.
Reference Data
This area should include all of the top level information, things like Account Types and Payment Types that are then referenced by another table further down the chain. The benefit here is a single update can be used to change multiple rows in a normalised database, while unnormalised would need to update every row.
Standard Usage
In general we ideally use an identity column as a unique clustered index. We will create four tables and a schema below.
Reference Tables
CREATE SCHEMA RefGOCREATE TABLE Ref.AddressType(AddressTypeID INT IDENTITY(1,1) CONSTRAINT PK_Ref_AddressType PRIMARY KEY CLUSTERED,AddressTypeName NVARCHAR(100))CREATE TABLE Ref.ClientType(ClientTypeID INT IDENTITY(1,1) CONSTRAINT PK_Ref_Client PRIMARY KEY CLUSTERED,ClientTypeName NVARCHAR(100))CREATE TABLE Ref.ContactType(ContactTypeID INT IDENTITY(1,1) CONSTRAINT PK_Ref_ContactType PRIMARY KEY CLUSTERED,ContactTypeName NVARCHAR(100))CREATE TABLE Ref.TransactionType(TransactionTypeID INT IDENTITY(1,1) CONSTRAINT PK_Ref_TransactionType PRIMARY KEY CLUSTERED,TransactionTypeName NVARCHAR(100))
Business Data
This mid level of area would include Accounts, Clients and Contacts or other areas that may be referenced by something else, and also reference the type information.
This level is normally the hardest to work with in terms of deciding where to put your main index, as it will probably be a mix of different approaches.
Below is the table to create Address, Client and Contact tables. In this code there is an additional (joining) table that joins Client, Address and Address type fields, and here we have created a clustered index that runs differently from the other tables. This is because in most applications, this would be a read intensive table, and we can accept a minimal increase to insert performance. If this was an application built by us, we would probably separate out client contact details in a similar fashion.
Business Tables
CREATE SCHEMA BusGOCREATE TABLE Bus.[Address](AddressID INT CONSTRAINT PK_Bus_Address PRIMARY KEY CLUSTERED,AddressName NVARCHAR(100),AddressTypeID INT CONSTRAINT FK_Bus_Client_AddressTypeID FOREIGN KEY REFERENCES Ref.AddressType(AddressTypeID),AddressLine1 NVARCHAR(MAX)--Use more detail as required...)CREATE TABLE Bus.Client(ClientID INT IDENTITY(1,1) PRIMARY KEY CLUSTERED,ClientName NVARCHAR(100),ClientType INT CONSTRAINT FK_Bus_Client_ClientType FOREIGN KEY REFERENCES Ref.ClientType(ClientTypeID))--Use one table to handle all client addressesCREATE TABLE Bus.ClientAddress(ClientAddressID INT IDENTITY(1,1) CONSTRAINT PK_Bus_ClientAddressID PRIMARY KEY NONCLUSTERED,AddressTypeID INT,ClientID INT,AddressID INT,CONSTRAINT UQ_Bus_ClientAddress UNIQUE NONCLUSTERED (ClientID,AddressTypeID)--This ensures one type per client, can slow down inserts slightly)CREATE UNIQUE CLUSTERED INDEX CDX_Bus_ClientAddress ON Bus.ClientAddress(ClientID,AddressTypeID,AddressID)CREATE TABLE Bus.Contact(ContactID INT IDENTITY(1,1) CONSTRAINT PK_Bus_Contact PRIMARY KEY CLUSTERED,ContactName NVARCHAR(100),ContactTypeID INT CONSTRAINT FK_Bus_Contact_ContactTypeID FOREIGN KEY REFERENCES Ref.ContactType(ContactTypeID)--Could be broken out into a joining table if desired--Use more detail as required...)
Transactional Data
This area includes things like notes, payments and orders, and generally point to both the business and reference areas.
While unique keys are good for identification, in general usage it is probably not how you want to order the data on disk, as read times would be affected. There is only one table created below, but it should give you an idea.Transactional Tables
CREATE SCHEMA TraGOCREATE TABLE Tra.[Transaction](TransactionID INT IDENTITY(1,1) CONSTRAINT PK_Tra_TransactionID PRIMARY KEY NONCLUSTERED,TransactionDate DATETIME CONSTRAINT DF_Tra_Transaction_TransactionDate DEFAULT GETUTCDATE(),--Use GETDATE() for local time.TransactionTypeID INT CONSTRAINT FK_Tra_Transaction_TransactionTypeID FOREIGN KEY REFERENCES Ref.TransactionType(TransactionTypeID),ClientID INT CONSTRAINT FK_Tra_Transaction_ClientID FOREIGN KEY REFERENCES Bus.Client(ClientID),ContactID INT CONSTRAINT FK_Tra_Transaction_ContactID FOREIGN KEY REFERENCES Bus.Contact(ContactID),TransactionAmount DECIMAL(18,2)--Use more detail as required...)CREATE CLUSTERED INDEX CDX_Tra_Transaction ON Tra.[Transaction](TransactionDate,TransactionTypeID,ClientID,ContactID)
Joins and Reporting
In the fictional database above, we have tried to represent real life as closely as possible. This is by no means an approach that must be taken, and you are ultimately responsible for how you use the information above.
As the data has gone into the third tier, indexing focus has been moved to how the data would be read from an application or report, and this would invariably involve the joins on between tables, and any points that could or would be included in the WHERE clauses.
Further Reading
Rebuild or reorganise indexes