Normalization is a technique for organizing data in a database. It is important that a database is normalized to minimize redundancy (duplicate data) and to ensure only related data is stored in each table. It also prevents any issues stemming from database modifications such as insertions, deletions, and updates.
The stages of organization are called normal forms. In this tutorial we will be redesigning a database for a construction company and ensuring that it satisfies the three normal forms:
First Normal Form (1NF):
- Data is stored in tables with rows uniquely identified by a primary key
- Data within each table is stored in individual columns in its most reduced form
- There are no repeating groups
Second Normal Form (2NF):
- Everything from 1NF
- Only data that relates to a table’s primary key is stored in each table
Third Normal Form (3NF):
- Everything from 2NF
- There are no in-table dependencies between the columns in each table
Note that there are actually six levels of normalization; however, the third normal form is considered the highest level necessary for most applications so we will only be discussing the first three forms.
Let's get started!
This tutorial is adapted from Next Tech's Database Fundamentals course which comes with an in-browser MySQL database and interactive tasks and projects to complete. You can get started here for free!
Our Database: Codey's Construction
The database we will be working with in this tutorial is for Codey's Construction company (Codey is a helpful coding bot that works with you in the course mentioned earlier). As you can see from the schema above, the database contains the tables projects
, job_orders
, employees
, project_employees
. Recently, they have decided to add the table customers
to store customer data.
Unfortunately, this table has not designed in a way that satisfies the three forms of normalization... Let's fix that!
First Normal Form
First normal form relates to the duplication and over-grouping of data in tables and columns.
Codey’s Construction's table customers
violates all three rules of 1NF.
- There is no primary key! A user of the database would be forced to look up companies by their name, which is not guaranteed to be unique (since unique company names are registered on a state-by-state basis).
- The data is not in its most reduced form. The column
contact_person_and_role
can be further divided into two columns, such ascontact_person
andcontact_role
. - There are two repeating groups of columns - (
project1_id
,project1_feedback
) and (project2_id
,project2_feedback
).
The following SQL statement was used to create the customers
table:
CREATE TABLE customers (
name VARCHAR(255),
industry VARCHAR(255),
project1_id INT(6),
project1_feedback TEXT,
project2_id INT(6),
project2_feedback TEXT,
contact_person_id INT(6),
contact_person_and_role VARCHAR(300),
phone_number VARCHAR(12),
address VARCHAR(255),
city VARCHAR(255),
zip VARCHAR(5)
);
By modifying some columns, we can help redesign this table so that it satisfies 1NF.
First, we need to add a primary key column called id
with data type INT(6)
:
ALTER TABLE customers
ADD COLUMN id INT(6) AUTO_INCREMENT PRIMARY KEY FIRST;
With this statement, we added an automatically incrementing primary key as the first column in the table.
To satisfy the second condition, we need to split the contact_person_and_role
column:
ALTER TABLE customers
CHANGE COLUMN contact_person_and_role contact_person VARCHAR(300);
ALTER TABLE customers
ADD COLUMN contact_person_role VARCHAR(300) AFTER contact_person;
Here, we simply renamed it as contact_person
, and added a column contact_person_role
immediately after it.
To satisfy the third condition, we need to move the columns containing project IDs and project feedback to a new table called project_feedbacks
. First, let's drop these columns from the customers table
:
ALTER TABLE customers
DROP COLUMN project1_id,
DROP COLUMN project1_feedback,
DROP COLUMN project2_id,
DROP COLUMN project2_feedback;
And then create the project_feedbacks
table:
CREATE TABLE project_feedbacks (
id INT(6) AUTO_INCREMENT PRIMARY KEY,
project_id INT(6),
customer_id INT(6),
project_feedback TEXT
);
As you can see, there are no more repeating groups in either the project_feedbacks
table or the customers
table. We still know which customer said what since project_feedbacks.customer_id
refers back to the customers
table.
Now our customers
table satisfies 1NF! Let's move on to second normal form.
Second Normal Form
To achieve second normal form, a database must first satisfy all the conditions for 1NF. After this, satisfying 2NF requires that all data in each table relates directly to the record that the primary key of the table identifies.
We are in violation of 2NF because the contact_person
, contact_person_role
and phone_number
columns track data that relate to the contact person, not the customer. If the contact person for a customer changes, we would have to edit all of these columns, running the risk that we will change the values in one of the columns but forget to change another.
To help Codey's Construction fix this table to satisfy 2NF, these columns should be moved to a table containing data on the contact person. First, let's remove the columns in customers
that are not related to our primary key:
ALTER TABLE customers
DROP COLUMN contact_person,
DROP COLUMN contact_person_role,
DROP COLUMN phone_number;
Note that we kept the contact_person_id
so we still know who to contact. Now, let's create our new table contact_persons
so we have somewhere to store data about each contact.
CREATE TABLE contact_persons (
id INT(6) PRIMARY KEY,
name VARCHAR(300),
role VARCHAR(300),
phone_number VARCHAR(15)
);
Now, if the contact person for a customer changes, the construction company just has to insert a record into the contact_persons
table and change the contact_person_id
in the customers
table.
Third Normal Form
For a database to be in third normal form, it must first satisfy all the criteria for 2NF (and therefore, also 1NF).
Then, each column must be non-transitively dependent on the table’s primary key. This means that all columns in a table should rely on the primary key and no other column. If column_a
relies on the primary key and also on column_b
then column_a
is transitively dependent on the primary key so the table does not satisfy 3NF.
Does your brain hurt from reading that? Don't worry! It's explained more below.
The table currently has transitively dependent columns. The transitively dependent relationship is between city
and zip
. The city in which a customer is located relies on the customer, so this satisfies 2NF; however, the city also depends on the zip code. If a customer relocates, there may be a chance we update one column but not the other. Because this relationship exists, the database is not in 3NF.
To fix our database to satisfy 3NF, we need to drop the city
column from customers
, and create a new table zips
to store this data:
ALTER TABLE customers
DROP COLUMN city;
CREATE TABLE zips (
zip VARCHAR(5) PRIMARY KEY,
city VARCHAR(255)
);
That's it! Finding issues that violate 3NF can be difficult, but it's worth it to ensure that your database is resilient to errors caused by only partially updating data.
I hope you enjoyed this tutorial on database normalization! Codey's Construction's database now satisfies the three forms of normalization.
If you'd like to continue learning about databases, Next Tech's Database Fundamentals course covers all you need to know to get started with databases and SQL. By helping an interactive coding bot named Codey, you will learn how to create and design databases, modify data, and how to write SQL queries to answer business problems. You can get started for free here!
Top comments (4)
I whish we teach this in database class, everything makes perfect sense and is clean with NF! My former boss taught me 3NF principles, I did not know it was formalized in this form, very instructive. I am gonna seek for the other forms 🤓
Glad you liked the post!
Thanks for writing this article, this really helped me wrap my head around normalization.
No problem! :)