I often let folks know that in a data analytics course, if there is anything you need to excel in, it will be databases and SQL. That’s where employers assess your capabilities and skills.

SQL (often pronounced as “sequel”) stands for Structured Query Language. Just like Python, it has its syntax to learn.

Just to reiterate, many different types of relational databases function similarly. You will be able to apply your learnings from one relational database engine to another.

Everyone Do: Create a Database

Let’s open up the file(s) in the 01-Evr_Creating_a_Database folder to get started.

We are going to create a database together.

  • Open pgAdmin4 in your local machine and ensure that you’re seeing the pgAdmin dashboard.
  • In the pgAdmin editor, right-click the newly established server to create a new database.
    • From the menu, select Create, and then select Database.
    • Enter animals_db as the database name. Make sure the owner is set as postgres (default setting), and then click Save
  • At this point, your new database, animals_db, is not connected to the server. Clicking on the database will create a connection to Postgres.

A database is a collection of data stored within an engine. Within database, there are tables, where we have relational data for analysis.

Additional Notes

Using pgAdmin4 isn’t the only User Interface (UI) that is able to access a relational database. I do use DBeaver Community Edition (for Mac and Windows), and my tech lead uses SquirrelSQL in his Mac setup

  • We need to access different database engines and thus we can’t only be using pgAdmin4. Thus, a more generic UI access suits our use case.
  • This is out of scope from our class material.

Create a Table

Let’s open up the file(s) in the 02-Ins_Creating_Tables folder to get started.

We will be creating a table within the database.

  • From the left-hand menu in pgAdmin, right-click animals_db and select Query Tool.
    • This is a code editor, where you type in commands for your database.
  • Type the following lines of code (this is basically a table schema):
	
		CREATE TABLE people (
			name VARCHAR(30) NOT NULL,
			has_pet BOOLEAN DEFAULT false,
			pet_type VARCHAR(10) NOT NULL,
			pet_name VARCHAR(30),
			pet_age INT
		);
	
  • Click on the play icon to run the statement.

Notes on creating a table:

  • The above is a default syntax style, but there can variants to how to create a table. This is where the documentation will show different ways of how to create a table.
  • Each engine may have slightly different nuances in terms of SQL style. Microsoft SQL is different from Postgres SQL, and those are different from MySQL. We’ll have to refer to documentation to get over the differences.
  • Data stored in a database/table is persistent. This means data or schema created within the database will stay unless it is explicitly removed or overwritten.

Inserting data into your table

Try inserting records into your table by using the following code here:

	
		INSERT INTO people (name, has_pet, pet_type, pet_name, pet_age)
		VALUES ('Jacob', true, 'dog', 'Misty', 10),
			('Ahmed', true, 'rock', 'Rockington', 100),
			('Peter', true, 'cat', 'Franklin', 2),
			('Dave', true, 'dog', 'Queso', 1);

		SELECT * FROM people;
	
  • The above creates 4 records, according to the table schema that we defined earlier.
  • Single quotation marks must be included if the data is a string. Numbers don’t need any quotation marks, avoid it. In fact, the engine will recognize your number is a string if you used them.

Querying data

SELECT * indicates that it is a wildcard. We are extracting every column we can find in a table.

  • For optimization purposes, we seldom use a wildcard. Because if the data is massive (multiple columns or huge volume of data), it can be painful trying to extrack all the columns.

You can extract a single column by specifying the column within your query:

SELECT pet_name FROM people;

You can even filter the data by using comparison operators to specify what you need:

SELECT pet_type, pet_name FROM people WHERE pet_type = 'dog' AND pet_age < 5;

Notes on querying for data:

  • SELECT command extracts data from your table(s). It does not affect the actual data within your tables.
    • Your filters do not affect the data tables within.
    • Using operators such as AND and OR, you can shape how you want to extract the data.
    • Comparisons are done with a single symbol, i.e. in Python (“==”) vs SQL (“=”).

Students Do: Creating Tables

The Value of Unique Values

Let’s open up the file(s) in the 04-Ins_Values_of_Uniques folder to get started.

Unique values are important for data integrity and reports. How we design unique values within a database/table is very important to ensure that your analysis is accurate.

To illustrate, we will insert duplicate data here:

INSERT INTO people (name, has_pet, pet_type, pet_name, pet_age) VALUES ('Ahmed', true, 'rock', 'Rockington', 100); 
SELECT * FROM people;

Duplicate records cause all sorts of problems. It makes filtering difficult, and remove the record can be a painful process:

  • Your database doesn’t know which record is true, and only you know. Thus, if you try to delete inaccurately, it can have unintended consequences on other rows.

Deleting tables

This is the syntax to delete tables: DROP TABLE people;

How to ensure if a table has unique records?

This is an innate feature of relational databases. If we define a primary key, the database will ensure that your records within the table will be unique to the key you specified.

Let’s try it with the following code:

		-- Re-create the table "people" within animals_db
		CREATE TABLE people (
			id SERIAL PRIMARY KEY,
			name VARCHAR(30) NOT NULL,
			has_pet BOOLEAN DEFAULT false,
			pet_type VARCHAR(10) NOT NULL,
			pet_name VARCHAR(30),
			pet_age INT
		);

		-- Insert data into the table
		INSERT INTO people (name, has_pet, pet_type, pet_name, pet_age)
		VALUES ('Jacob', true, 'dog', 'Misty', 10),
			('Ahmed', true, 'rock', 'Rockington', 100),
			('Ahmed', true, 'rock', 'Rockington', 100),
			('Peter', true, 'cat', 'Franklin', 2),
			('Dave', true, 'dog', 'Queso', 1),
			('Dave', true, 'dog', 'Pringles', 7);

		-- Query all fields from the table
		SELECT *
		FROM people;
  • PRIMARY KEY is to define the column to be a unique identifier.
  • SERIAL means the database is going to create the primary key for you as you ingest
    • It auto increments the integer as you ingest the data.
    • Typically, we don’t use SERIAL in real world situation much (I can’t say it is never used, but it depends on context). We would use another unique key library to ensure that the keys are truly unique.
  • It is best practice to only filter and query through primary keys and not other columns to specify a certain record, unless you know what you’re doing.

Updating a record

This is the syntax for updating a record: UPDATE people SET has_pet = true, pet_name = 'Rocket', pet_age = 8 WHERE id = 6;

  • The WHERE command is a filter to specify your change within your table.

Deleting a record

This is the syntax for deleting a record: DELETE FROM people WHERE id = 3;

  • For best practice, we always create backups of the database/tables if we are doing a bulk delete. Because there is no “Recycle Bin” concept within databases typically.

Students Do: Making and Using an ID

Import Data

Let’s open up the file(s) in the 06-Ins_Importing_Data folder to get started.

Importing data is something that you might do often, especially you’re collecting data from another source.

We’ll do the following to try to import data:

  • Create a new database called: Miscellaneous_DB
    • SQL doesn’t appreciate spaces. We typically use underscores to denote space.
  • Create a table named: fauna_vertabrate
  • Create columns from the given importing_data_solution.sql so that the data within the columns can be matched while import is in progress.
    • Some tools allow you to map different columns to the table columns within your database. However, on a typical setting, you’ll need equal columns in name and data type to make sure that the import is successful.
    • There is a difference in importing sql files versus CSV files direct. I personally prefer sql files because they tend to be more accurate.
  • Right-click on Miscellaneous_DB, and click on Refresh to reflect the changes on your GUI.
  • Scroll to Schemas and expand the menu. Then expand the Tables menu.
  • Right-click on the table, and select Import/Export from the menu.
    • There are command-line imports as well. If the data volume is small, using a GUI may suffice. However, if the data volume is huge, we use command-line because it can handle the load better.
    • Select the appropriate settings so that the import will proceed smoothly.
  • Run SELECT * FROM fauna_vertabrate; to see if the data import is done correctly.
    • If the data volume is huge, we might just want to list the first 5 rows just like in pandas:
	
		SELECT *
		FROM fauna_vertabrate
		LIMIT 5;
	

Students Do: Hide and Seek

Students Do: Using CRUD

CRUD is an acronym that stands for:

  • Create
  • Read
  • Update
  • Delete

These are all the actions that you have done previously, and this is a technical term to let people know how to interact with a relational database.

Joins

Let’s open up the file(s) in the 09-Ins_Joins folder to get started.

Joins are particularly important when you need to reference data from two or more tables.

Remember how we did Pandas joins earlier by merging two dataframes into one (Lesson 4.3, activity 1)? The concept is the same in terms of left, right, inner and outer joins.

We are going to create two new tables, names and commodity so that we can simulate the behavior.

  • Copying code from joins.sql to create the tables, and then we import the corresponding data from names.csv and commodity.csv.
    • Refresh your database because your GUI doesn’t always reflect the changes immediately on your screen, although it should already happened in the background.

There are 5 primary types of joins that can be used in PostgresSQL:

  • INNER JOIN returns records that have matching values in both tables.
  • LEFT JOIN returns all records from the left table and the matched records from the right table.
  • RIGHT JOIN returns all records from the right table and the matched records from the left table.
  • CROSS JOIN returns records that match every row of the left table with every row of the right table. This type of join has the potential to make very large tables.
  • FULL OUTER JOIN places null values within the columns that do not match between the two tables, after an inner join is performed.
  • Joins typically need exact keys/columns to match so that they can sync and return relevant data. That’s why it is called a relational database.

Students Do: Joining Bird Bands

Let’s open up the file(s) in the 10-Par_Joins folder to get started.