Companies are willing to spend on big data, but they aren’t willing to be inefficient about it. Being inefficient could mean millions of dollars.

At some point as a data analyst or a scientist, you want your work to persist and be resuable across the organization. This is a proof of value to your work.

We typically do not optimize while we are experimenting or prototyping. That’s because it will hinder innovation as it dilutes our focus.

However, once your prototype is proven and you want folks to use your work, that’s where optimization becomes important.

Introducing Parquet

Let’s open up the file(s) in the 01-Ins_Data_Storage folder to get started.

All the data must play its “part”

Let’s open up the file(s) in the 03-Ins_Partitioning folder to get started.

Caching

Let’s open up the file(s) in the 05-Ins_Cache folder to get started.

Common Table Expressions (CTE): https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select-cte.html#common-table-expression-cte

Caching Flight Delays

Let’s open up the file(s) in the 06-Evr_Cache folder to get started.