Skip to main content

DAT 601 - Robin Le Couteur - Normalization Explained - 4/6/2018



Friday, April 6, 2018

Normalization Explained

Update - Warning, this post has a number of errors lol. The BCNF example is not entirely correct, but it conveys some useful concepts to think about ;)

I have written this blog as a way to concrete normalization concepts into my own mind, as well as to help as many other people as I can to understand normalization themselves. The information is not guaranteed to be correct, but I have done my best to make each form of normalization as simple as possible to understand using visualizations and examples.
Before I start, it’s important to understand some concepts.

Cardinality

I find thinking about attributes as having cardinality within the table.
In a Person table, PersonID has a 1 to 1 relationship with DoB. Think of it like the person only has one DoB, and the DoB is of that one person.
For a multivalued attribute, it’s like one to many. One person can have many phone numbers, and each phone number only belongs to one person.
For many to many, you see that in join tables. In a database, you cannot have a many to many relationship between tables BUT you CAN between ATTRIBUTES.
Here’s an example: as you can see, an SID can appear next to several CID’s cause a student is enrolled in many courses, and a particular CID will appear next to multiple SID’s. You get rid of many to many between tables by making it between attributes in a join table.


Dependency

In tables, attributes have dependencies. If you do not understand dependency I HIGHLY recommend that you do some research and get some understanding because normalizing is basically making the database only have functional dependency for attributes, and functional or multivalued dependency between keys.

Dependency means is something cannot exist without something else, like arm cannot exist without person, so arm is dependent on person. There are different types on dependency as shown below.

Functional dependency. DoB is functionally dependant on PersonID because it is directly related. Functional dependency is really simply just when something has a one of something else, so Person has one of DoB. Basically a one to one relationship between 2 attributes except one depends on the other and cannot exist without it.

Multivalued dependency. This is like Functional dependency except something has multiple of something. One person has multiple of limbs. It’s a dependency because each of the limbs cannot exist without the person, and it’s multivalued because there is more than one limb.

Partial dependency. This is explained more in the 2NF section as that is where this dependency is removed, but basically, it’s when you have two or more PK’s in one table. This excludes FK’s that are PK’s, so if the there is 2 PK’s and neither of them are FK’s as well in that table, you have an issue. Remember that a PK can mean a composite PK made from 2 columns. A crude example that isn’t perfect is firstname and lastname. Together they identify one person therefore they are a single PK.

Transitive functional dependency.  This occurs when one thing is dependent on another, but only through something else like a chain. An example is a person, their arm, and their hand. Hand depends on arm, and arm depends on person, so hand has a transitive dependency with person because although hand depends on person, it only does because of its dependency to arm. This dependency is removed in 3NF

UNF

UNF is simple in that there is just a table with a bunch of attributes. this is the data we are working with.

Below is an example of a table in UNF. 



1NF

For a table to be in the First Normal Form, it should follow the following 4 rules:
if we look at the UNF table above vs the 1NF table below, you can see the rules being applied.
1.    It should only have single(atomic) valued attributes/columns.
This means that you cannot have multiple of an attribute in a cell, you cant have multiple movies in one cell like in the UNF example. "Action, Action" in Category is invalid because there is 2 atomic values in the one cell.

2.      Values stored in a column should be of the same domain.
Common sense, just store categories in the categories column and names in the names column, don't put any names in the category column.

3.      All the columns in a table should have unique names. 
Common sense again, don't have two columns in a table and call them both 'Name'.

4.      And the order in which data is stored, does not matter. 
It shouldn't matter what order you have rows or data in cells. Things like having multiple values in a cell can cause issues here. Adjusting the table to comply with rule 1 generally makes it compliant with this rule as well




2NF

For a table to be in the Second Normal Form,

1.    It should be in the First Normal form;

2.    And, it should not have Partial Dependency. Partial dependency is when there are multiple keys in a table and not all attributes depend on all the keys. Simply speaking its just when MovieRented and PersonID are both PK's and both in one table, FirstName depends on PersonID and Category depends on MovieName. You shouldn't have these 2 PK's in a table so you put the PK's in different tables with their attributes. Also, if you have a composite key made from 2 columns that Identify one entity and all the attributes are related to that CK then the table is 2NF as well.

Here, you need to identify the primary keys. You just look at what different things there are, like here we are storing data on two things, the person renting the movie, and the movie itself. The person entity doesn't have a naturally occurring PK that is valid for the database, so we will add PersonID as a simple PK, but movie name suffices at this stage. You could argue that movie name could repeat for different movies, so it isn't unique, so you could add a PK MovieID; I won't bother at this stage. In the Movie table the PersonID is added as an FK and a PK because it’s the combination of person and movie that identify an individual instance of a movie being rented.






3NF

For a table to be in the Second Normal Form,
  • It should be in 2NF;
  • And have no transitive functional dependencies. As explained in the dependencies section, a transitive dependency is when there is a chain of dependencies. Specifically, a transitive dependency can be seen in a table if a non-key attribute changes and it causes another attribute to change as well.
In our case the tables are already in 3rd normal form in my opinion, so I shall add extra columns called Category and CategoryDescription. Now CategoryDescription is like the hand, Category is the arm, and MovieRented is the person.

If we change the movie, the category and its description will both change meaning that they are both functionally dependent on MovieRented, but since Category has one of CategoryDescription, CategoryDescription has a transitive relationship with MovieRented. 

The illustration below shows this:


Also, since Category isn’t functionally dependent to PersonID, Category is partially dependent to the composite key of PersonID, MovieRented. This doesn’t comply with 2NF so we will put MovieRented in a new table with category, then put category in a new table with CategoryDesctription. 

This Database is now 3NF.(updated image as of 1/06/2018, fixed cardinality  )

BCNF

Now for the interesting one. BCNF is actually really simple. In 2NF and 3NF you ignore the FKs and FK’s that are also PKs. BCNF is literally just doing 2NF and 3NF in the key.

In our example, we add 2 more tables, Store and Staff, they are related to this table of rental records, so their PK’s are put here as FK PK’s. Now this table shows who rented the movie, the movie itself, the staff that issued the movie, and the store it was issued at.

Now the problem here is that you will notice that a staff member only works at one store, so if you mention that the movie was issued by Staff2 or Staff3, you know that it could only be issued at the South branch.

StoreBranch has a functional dependency to Issuing Staff, but it’s in a transitive relationship with the rest of the key. You wouldn’t resolve this in 2NF or 3NF because they only deal with attributes that aren’t part of the key. In BCNF you resolve this anomaly in the key. A way to spot it a Boyce Codd anomaly is:

1.    There will be at least 3 Columns in the key of a table or 3 columns of candidate key(May have 2 PK columns and a attribute which could be paired with one of the 2 columns to form a different candidate key), and;

2.    you will see 2 or more columns with more than one recurring pattern(North and Staff1 appear in a pair more than once)

3.    Data from one of the columns will only appear with a particular piece of data from the other column(Staff1 only appears with North, and Staff2 & 3 only appear with South)
You can also just think of the business rules. Here one store must have many staff, and each staff must work for one store. All the relationships between columns in a composite key should be many to many, like Person can rent many movies, and a movie can be rented by many people.

Once you see the anomaly, you separate the two columns out as shown in the example. (update - note that the IssuingStaff column could just be an attribute that could be part of another candidate key. A candidate key is just a potential PK and doesn't necessarily have to be identified as a PK. In BCNF the point is that one column in the PK(minimum 2 columns in the PK for this to occur) might actually depend on an attribute.)

3NF

BCNF

Comments