Friday, April 6, 2018
Normalization Explained
Update - Warning, this post has a number of errors lol. The BCNF example is not entirely correct, but it conveys some useful concepts to think about ;)
I have written this blog as a way to concrete normalization
concepts into my own mind, as well as to help as many other people as I can to
understand normalization themselves. The information is not guaranteed to be
correct, but I have done my best to make each form of normalization as simple
as possible to understand using visualizations and examples.
Before I start, it’s important to understand some concepts.
Cardinality
I find thinking about attributes as having cardinality within
the table.
In a Person table, PersonID has a 1 to 1 relationship with DoB.
Think of it like the person only has one DoB, and the DoB is of that one person.
For a multivalued attribute, it’s like one to many. One person
can have many phone numbers, and each phone number only belongs to one person.
For many to many, you see that in join tables. In a database,
you cannot have a many to many relationship between tables BUT you CAN between
ATTRIBUTES.
Here’s an example: as you can see, an SID can appear next to
several CID’s cause a student is enrolled in many courses, and a particular CID
will appear next to multiple SID’s. You get rid of many to many between tables
by making it between attributes in a join table.
Dependency
In tables, attributes have dependencies. If you do not
understand dependency I HIGHLY recommend that you do some research and get some
understanding because normalizing is basically making the database only have
functional dependency for attributes, and functional or multivalued dependency
between keys.
Dependency means is
something cannot exist without something else, like arm cannot exist without person,
so arm is dependent on person. There are different types on dependency as shown
below.
Functional dependency.
DoB is functionally dependant on PersonID because it is directly related.
Functional dependency is really simply just when something has a one of something else, so Person has one of DoB. Basically a one to one relationship
between 2 attributes except one depends on the other and cannot exist without
it.
Multivalued
dependency. This is like Functional dependency except something has multiple of something. One person has multiple of limbs. It’s a dependency
because each of the limbs cannot exist without the person, and it’s multivalued
because there is more than one limb.
Partial dependency.
This is explained more in the 2NF section as that is where this dependency is
removed, but basically, it’s when you have two or more PK’s in one table. This
excludes FK’s that are PK’s, so if the there is 2 PK’s and neither of them are FK’s
as well in that table, you have an issue. Remember that a PK can mean a
composite PK made from 2 columns. A crude example that isn’t perfect is firstname
and lastname. Together they identify one person therefore they are a single PK.
Transitive functional
dependency. This occurs when one
thing is dependent on another, but only through something else like a chain. An
example is a person, their arm, and their hand. Hand depends on arm, and arm
depends on person, so hand has a transitive dependency with person because although
hand depends on person, it only does because of its dependency to arm. This dependency
is removed in 3NF
UNF
UNF is simple in that there is just a table with a bunch of
attributes. this is the data we are working with.
Below is an example of a table in UNF.
1NF
For a table to be in the First
Normal Form, it should follow the following 4 rules:
if we look at the UNF table
above vs the 1NF table below, you can see the rules being applied.
1.
It should only have single(atomic) valued
attributes/columns.
This means that you cannot have multiple of an attribute in
a cell, you cant have multiple movies in one cell like in the UNF example.
"Action, Action" in Category is invalid because there is 2 atomic
values in the one cell.
2. Values
stored in a column should be of the same domain.
Common sense, just store categories in the categories column
and names in the names column, don't put any names in the category column.
3. All
the columns in a table should have unique names.
Common sense again, don't have two columns in a table and
call them both 'Name'.
4. And
the order in which data is stored, does not matter.
It shouldn't matter what order you have rows or data in
cells. Things like having multiple values in a cell can cause issues here.
Adjusting the table to comply with rule 1 generally makes it compliant with
this rule as well
2NF
For a table to be in the
Second Normal Form,
1. It should be in the First Normal form;
2. And,
it should not have Partial Dependency. Partial dependency is when
there are multiple keys in a table and not all attributes depend on all the
keys. Simply speaking its just when MovieRented and PersonID are both PK's and
both in one table, FirstName depends on PersonID and Category depends on
MovieName. You shouldn't have these 2 PK's in a table so you put the PK's in
different tables with their attributes. Also, if you have a composite key made
from 2 columns that Identify one entity and all the attributes are related to that
CK then the table is 2NF as well.
Here,
you need to identify the primary keys. You just look at what different things
there are, like here we are storing data on two things, the person renting the
movie, and the movie itself. The person entity doesn't have a naturally
occurring PK that is valid for the database, so we will add PersonID as a
simple PK, but movie name suffices at this stage. You could argue that movie
name could repeat for different movies, so it isn't unique, so you could add a
PK MovieID; I won't bother at this stage. In the Movie table the PersonID is
added as an FK and a PK because it’s the combination of person and movie that
identify an individual instance of a movie being rented.
3NF
For a table to be in the
Second Normal Form,
- It should be in 2NF;
- And
have no transitive functional dependencies. As explained
in the dependencies section, a transitive dependency is when there is a
chain of dependencies. Specifically, a transitive dependency can be seen
in a table if a non-key attribute changes and it causes another attribute
to change as well.
In our case the tables are already in 3rd
normal form in my opinion, so I shall add extra columns called Category and CategoryDescription.
Now CategoryDescription is like the hand, Category is the arm, and MovieRented
is the person.
If we change the movie, the category and its
description will both change meaning that they are both functionally dependent on MovieRented, but since Category has one
of CategoryDescription, CategoryDescription has a transitive relationship with
MovieRented.
The illustration below shows this:
Also, since Category isn’t functionally dependent to PersonID,
Category is partially dependent to the composite key of PersonID, MovieRented.
This doesn’t comply with 2NF so we will put MovieRented in a new table with
category, then put category in a new table with CategoryDesctription.
This Database
is now 3NF.(updated image as of 1/06/2018, fixed cardinality )
BCNF
Now for the interesting one. BCNF is actually really simple.
In 2NF and 3NF you ignore the FKs and FK’s that are also PKs. BCNF is literally
just doing 2NF and 3NF in the key.
In our example, we add 2 more tables, Store and Staff, they are
related to this table of rental records, so their PK’s are put here as FK PK’s.
Now this table shows who rented the movie, the movie itself, the staff that issued
the movie, and the store it was issued at.
Now the problem here is that you will notice that a staff
member only works at one store, so if you mention that the movie was issued by Staff2
or Staff3, you know that it could only be issued at the South branch.
StoreBranch has a functional dependency to Issuing Staff,
but it’s in a transitive relationship with the rest of the key. You wouldn’t resolve
this in 2NF or 3NF because they only deal with attributes that aren’t part of
the key. In BCNF you resolve this anomaly in the key. A way to spot it a Boyce
Codd anomaly is:
1. There
will be at least 3 Columns in the key of a table or 3 columns of candidate key(May have 2 PK columns and a attribute which could be paired with one of the 2 columns to form a different candidate key), and;
2. you
will see 2 or more columns with more than one recurring pattern(North and
Staff1 appear in a pair more than once)
3. Data
from one of the columns will only appear with a particular piece of data from
the other column(Staff1 only appears with North, and Staff2 & 3 only appear
with South)
You can also just think of the business rules. Here one
store must have many staff, and each staff must work for one store. All the
relationships between columns in a composite key should be many to many, like
Person can rent many movies, and a movie can be rented by many people.
Once you see the anomaly, you separate the two columns out as
shown in the example. (update - note that the IssuingStaff column could just be an attribute that could be part of another candidate key. A candidate key is just a potential PK and doesn't necessarily have to be identified as a PK. In BCNF the point is that one column in the PK(minimum 2 columns in the PK for this to occur) might actually depend on an attribute.)
3NF
BCNF




Comments
Post a Comment