Data modeling is as important as knowing your way around relational databases using SQL, because it is one of the main ways where you can scale your work where you can offer reusability and integrity of your data tables if they are modeled well.

Data modeling is difficult to practice without a context. Unless you’re actively building models in a database for scalability, it is often a neglected skill for practice as it always feels theoretical until you meet data problems on the users’ end.

I have met teams who modeled and serviced data through APIs where the data would change every time we do a backfill, because they did not model their entity-relationship schemas correctly.

Data Normalization

Let’s open up the file(s) in the 01-Ins_Data_Normalization folder to get started.

Data normalization is the process of organizing and structuring data to reduce redundancy and dependency issues.

  • Because if duplicate columns exist everywhere in your data model, it will be difficult to ensure consistency.
  • Also, data integrity is a key part of data analysis. Your business stakeholders need to trust your data and report, or else, they will not use your work.

If you’re involved with data collection and starting from scratch for a project, data normalization is going to be a key element in your work.

Outcomes of data normalization

  • Each table has a well-defined purpose.
  • Data is stored efficiently.

Normalization Process

  • First Normal Form (1NF): This form requires that each column contains only atomic values. It eliminates repeated groups and ensures that each row has a unique identifier.
  • Second Normal Form (2NF): This form extends the previous form. Also, it states that non-key column should be fully functionally dependent on the primary key.
  • Third Normal Form (3NF): This form builds upon 2NF and states that no non-key column should be transitively dependent on the primary key. It eliminates transtive dependencies.

First Normal Form (1NF)

Why do we need do we need each field to have a single value?
  • We can leverage SQL queries to count the number of items in a table, or to filter out distinct values.
    • If a record contains multiple values, we cannot calculate against the records within the table accurately.

Second Normal Form (2NF)

Why do we need primary keys to identify the rows uniquely?
  • If a human-readable name is used, it will be difficult to distinguish the difference when another record with the same or similar name appears.
    • It is never best practice to leave it to intuition to interpret a record.
    • To truly ensure unique records, we use Universally Unique Identifier (UUID) libraries when we generate primary keys.

Third Normal Form (3NF)

Why do we need to remove transitive dependencies?
  • Transitive dependencies lead to redundant data
    • Redundant data adds an overhead to storage, especially when we deal with enterprise system with petabytes of data.
    • Redundant data has the propensity to cause deletion, updation and insertion anomalies. Example:
      • If store address appears in 2 different tables, and someone moved, you have to update 2 tables.
      • However, by eliminating transitive dependence, we should be able to update the address once, even if the data has references in multiple tables.

Example of transitive dependency

StudentCourseInstructor
AliceMathSmith
BobPhysicsJohnson
CarolMathSmith
DaveChemistryJohnson
EvePhysicsSmith

In the above table, the primary key is a combination of the “Student” and “Course” column.

Suppose we want to track the department to each instructor. Let’s see how it looks like if we do that:

StudentCourseInstructorDept
AliceMathSmithMathematics
BobPhysicsJohnsonPhysics
CarolMathSmithMathematics
DaveChemistryJohnsonChemistry
EvePhysicsSmithPhysics

In the above table, “Dept” depends on the “Instructor” column. However, it introduces a transitive dependency issue because “Dept” is not directly dependent on the primary key.

  • You will not be able to reuse the information on the relationship between “Dept” and “Instructor”, since it is tied to the “Student” and “Course” primary key.
  • It is also tedious and error prone to repeat “Dept” and “Instructor” in other tables that might need the same information.

Thus, we separate the “Instructor” and “Department” from the rest of the table into a new table:

InstructorDepartment
SmithMathematics
JohnsonPhysics
JohnsonChemistry

This will prevent transitive dependency, and allow other tables to use the same data by using their primary keys.

Everyone Do: Pet Normalizer

Let’s open up the file(s) in the 02-Evr_Data_Normalization folder to get started.

Intro to Foreign Keys

Let’s open up the file in the 03-Ins_Foreign_Keys folder to get started.

A foreign key is a column or a set of columns in a table that refers to the primary key or a unique key in another table.

  • Basically, this is the key you use to do joins on.

Students Do: Foreign Keys

Intro to Data Relationships

Let’s open up the file(s) in the 05-Ins_Data_Relationships folder to get started.

It can be very expensive to use a data schema before designing the database, and thus we design our data models first before ingesting the data on a typical basis.

  • This can be true for transactional data where data is sensitive and there must be as little error as possible to suit the business needs.
  • There are cases where we store the data first before modeling, because data modeling takes time where we need to understand the data before we can model the data to suit our business needs. That’s usually in the realm of Big Data.

A junction table, also known as an association table, bridge table, or linking table, is used in a database when you have a many-to-many relationship between two entities.

A junction table is required to resolve this type of relationship because it allows you to represent and manage the associations between the two entities.

Students Do: Data Relationships

Entity Relationship Diagrams (ERD)

Let’s open up the file(s) in the 07-Ins_ERD folder to get started.

Typically, you’re not asked to build an entire ERD from scratch unless you’re a tech lead who is designing a new system. However, you may be asked to extend an existing ERD because you have created metrics that can be shared and reused across the organization.

Tech interviews do ask questions on data modeling.

Ref: Quick Database Diagrams (Quick DBD) – https://app.quickdatabasediagrams.com/#/

  • The above website cannot be used for your corporate work, unless you pay for it. The free version will expose your ERD to the world.

Everyone Do: Designing an ERD, Part 1

Let’s open up the file(s) in the 08-Evr_Desining_ERD folder to get started.

Everyone Do: Designing an ERD, Part 2

Let’s open up the file(s) in the 09-Evr_ERD folder to get started.

Introduction to Unions

Let’s open up the file(s) in the 10-Ins_Unions folder to get started.

The UNION operator is used in a query to combine multiple SELECT statements into a single result set.

  • The datasets do not need to have any relationship between each other.
  • UNION automatically removes dups.
  • Very rarely used practically, especially when the data analysis requires data integrity.

If you want to include duplicate records, use UNION ALL. It is pretty rare you would want duplicate records unless you’re on a project that is trying to surface them for quality control.

Everyone Do: Unions

Let’s open up the file(s) in the 11-Evr_Unions folder to get started.