I often let folks know that in a data analytics course, if there is anything you need to excel in, it will be databases and SQL. That’s where employers assess your capabilities and skills.
SQL (often pronounced as “sequel”) stands for Structured Query Language. Just like Python, it has its syntax to learn.
Just to reiterate, many different types of relational databases function similarly. You will be able to apply your learnings from one relational database engine to another.
Everyone Do: Create a Database
Let’s open up the file(s) in the 01-Evr_Creating_a_Database
folder to get started.
We are going to create a database together.
- Open pgAdmin4 in your local machine and ensure that you’re seeing the pgAdmin dashboard.
- In the pgAdmin editor, right-click the newly established server to create a new database.
- From the menu, select Create, and then select Database.
- Enter
animals_db
as the database name. Make sure the owner is set as postgres (default setting), and then click Save
- At this point, your new database,
animals_db
, is not connected to the server. Clicking on the database will create a connection to Postgres.
A database is a collection of data stored within an engine. Within database, there are tables, where we have relational data for analysis.
Additional Notes
Using pgAdmin4
isn’t the only User Interface (UI) that is able to access a relational database. I do use DBeaver Community Edition (for Mac and Windows), and my tech lead uses SquirrelSQL in his Mac setup
- We need to access different database engines and thus we can’t only be using
pgAdmin4
. Thus, a more generic UI access suits our use case. - This is out of scope from our class material.
Create a Table
Let’s open up the file(s) in the 02-Ins_Creating_Tables
folder to get started.
We will be creating a table within the database.
- From the left-hand menu in pgAdmin, right-click animals_db and select Query Tool.
- This is a code editor, where you type in commands for your database.
- Type the following lines of code (this is basically a table schema):
CREATE TABLE people (
name VARCHAR(30) NOT NULL,
has_pet BOOLEAN DEFAULT false,
pet_type VARCHAR(10) NOT NULL,
pet_name VARCHAR(30),
pet_age INT
);
- Click on the play icon to run the statement.
Notes on creating a table:
- The above is a default syntax style, but there can variants to how to create a table. This is where the documentation will show different ways of how to create a table.
- Each engine may have slightly different nuances in terms of SQL style. Microsoft SQL is different from Postgres SQL, and those are different from MySQL. We’ll have to refer to documentation to get over the differences.
- Data stored in a database/table is persistent. This means data or schema created within the database will stay unless it is explicitly removed or overwritten.
Inserting data into your table
Try inserting records into your table by using the following code here:
INSERT INTO people (name, has_pet, pet_type, pet_name, pet_age)
VALUES ('Jacob', true, 'dog', 'Misty', 10),
('Ahmed', true, 'rock', 'Rockington', 100),
('Peter', true, 'cat', 'Franklin', 2),
('Dave', true, 'dog', 'Queso', 1);
SELECT * FROM people;
- The above creates 4 records, according to the table schema that we defined earlier.
- Single quotation marks must be included if the data is a string. Numbers don’t need any quotation marks, avoid it. In fact, the engine will recognize your number is a string if you used them.
Querying data
SELECT *
indicates that it is a wildcard. We are extracting every column we can find in a table.
- For optimization purposes, we seldom use a wildcard. Because if the data is massive (multiple columns or huge volume of data), it can be painful trying to extrack all the columns.
You can extract a single column by specifying the column within your query:
SELECT pet_name FROM people;
You can even filter the data by using comparison operators to specify what you need:
SELECT pet_type, pet_name FROM people WHERE pet_type = 'dog' AND pet_age < 5;
Notes on querying for data:
SELECT
command extracts data from your table(s). It does not affect the actual data within your tables.- Your filters do not affect the data tables within.
- Using operators such as
AND
andOR
, you can shape how you want to extract the data. - Comparisons are done with a single symbol, i.e. in Python (“==”) vs SQL (“=”).
Students Do: Creating Tables
Let’s open up the file in the 03-Stu_Creating_Tables
folder to get started.
The Value of Unique Values
Let’s open up the file(s) in the 04-Ins_Values_of_Uniques
folder to get started.
Unique values are important for data integrity and reports. How we design unique values within a database/table is very important to ensure that your analysis is accurate.
To illustrate, we will insert duplicate data here:
INSERT INTO people (name, has_pet, pet_type, pet_name, pet_age) VALUES ('Ahmed', true, 'rock', 'Rockington', 100);
SELECT * FROM people;
Duplicate records cause all sorts of problems. It makes filtering difficult, and remove the record can be a painful process:
- Your database doesn’t know which record is true, and only you know. Thus, if you try to delete inaccurately, it can have unintended consequences on other rows.
Deleting tables
This is the syntax to delete tables: DROP TABLE people;
How to ensure if a table has unique records?
This is an innate feature of relational databases. If we define a primary key, the database will ensure that your records within the table will be unique to the key you specified.
Let’s try it with the following code:
-- Re-create the table "people" within animals_db
CREATE TABLE people (
id SERIAL PRIMARY KEY,
name VARCHAR(30) NOT NULL,
has_pet BOOLEAN DEFAULT false,
pet_type VARCHAR(10) NOT NULL,
pet_name VARCHAR(30),
pet_age INT
);
-- Insert data into the table
INSERT INTO people (name, has_pet, pet_type, pet_name, pet_age)
VALUES ('Jacob', true, 'dog', 'Misty', 10),
('Ahmed', true, 'rock', 'Rockington', 100),
('Ahmed', true, 'rock', 'Rockington', 100),
('Peter', true, 'cat', 'Franklin', 2),
('Dave', true, 'dog', 'Queso', 1),
('Dave', true, 'dog', 'Pringles', 7);
-- Query all fields from the table
SELECT *
FROM people;
PRIMARY KEY
is to define the column to be a unique identifier.SERIAL
means the database is going to create the primary key for you as you ingest- It auto increments the integer as you ingest the data.
- Typically, we don’t use
SERIAL
in real world situation much (I can’t say it is never used, but it depends on context). We would use another unique key library to ensure that the keys are truly unique.
- It is best practice to only filter and query through primary keys and not other columns to specify a certain record, unless you know what you’re doing.
Updating a record
This is the syntax for updating a record: UPDATE people SET has_pet = true, pet_name = 'Rocket', pet_age = 8 WHERE id = 6;
- The
WHERE
command is a filter to specify your change within your table.
Deleting a record
This is the syntax for deleting a record: DELETE FROM people WHERE id = 3;
- For best practice, we always create backups of the database/tables if we are doing a bulk delete. Because there is no “Recycle Bin” concept within databases typically.
Students Do: Making and Using an ID
Let’s open up the file(s) in the 05-Stu_Making_IDs
folder to get started.
Import Data
Let’s open up the file(s) in the 06-Ins_Importing_Data
folder to get started.
Importing data is something that you might do often, especially you’re collecting data from another source.
We’ll do the following to try to import data:
- Create a new database called:
Miscellaneous_DB
- SQL doesn’t appreciate spaces. We typically use underscores to denote space.
- Create a table named:
fauna_vertabrate
- Create columns from the given
importing_data_solution.sql
so that the data within the columns can be matched while import is in progress.- Some tools allow you to map different columns to the table columns within your database. However, on a typical setting, you’ll need equal columns in name and data type to make sure that the import is successful.
- There is a difference in importing sql files versus CSV files direct. I personally prefer sql files because they tend to be more accurate.
- Right-click on
Miscellaneous_DB
, and click on Refresh to reflect the changes on your GUI. - Scroll to Schemas and expand the menu. Then expand the Tables menu.
- Right-click on the table, and select Import/Export from the menu.
- There are command-line imports as well. If the data volume is small, using a GUI may suffice. However, if the data volume is huge, we use command-line because it can handle the load better.
- Select the appropriate settings so that the import will proceed smoothly.
- Run
SELECT * FROM fauna_vertabrate;
to see if the data import is done correctly.- If the data volume is huge, we might just want to list the first 5 rows just like in pandas:
SELECT *
FROM fauna_vertabrate
LIMIT 5;
Students Do: Hide and Seek
Let’s open up the file(s) in the 07-Stu_Hide_and_Seek
folder to get started.
Students Do: Using CRUD
Let’s open up the file(s) in the 08-Stu_CRUD
folder to get started.
CRUD is an acronym that stands for:
- Create
- Read
- Update
- Delete
These are all the actions that you have done previously, and this is a technical term to let people know how to interact with a relational database.
Joins
Let’s open up the file(s) in the 09-Ins_Joins
folder to get started.
Joins are particularly important when you need to reference data from two or more tables.
Remember how we did Pandas joins earlier by merging two dataframes into one (Lesson 4.3, activity 1)? The concept is the same in terms of left, right, inner and outer joins.
We are going to create two new tables, names and commodity so that we can simulate the behavior.
- Copying code from
joins.sql
to create the tables, and then we import the corresponding data fromnames.csv
andcommodity.csv
.- Refresh your database because your GUI doesn’t always reflect the changes immediately on your screen, although it should already happened in the background.
There are 5 primary types of joins that can be used in PostgresSQL:
- INNER JOIN returns records that have matching values in both tables.
- LEFT JOIN returns all records from the left table and the matched records from the right table.
- RIGHT JOIN returns all records from the right table and the matched records from the left table.
- CROSS JOIN returns records that match every row of the left table with every row of the right table. This type of join has the potential to make very large tables.
- FULL OUTER JOIN places null values within the columns that do not match between the two tables, after an inner join is performed.
- Joins typically need exact keys/columns to match so that they can sync and return relevant data. That’s why it is called a relational database.
Students Do: Joining Bird Bands
Let’s open up the file(s) in the 10-Par_Joins
folder to get started.