Companies are willing to spend on big data, but they aren’t willing to be inefficient about it. Being inefficient could mean millions of dollars.
At some point as a data analyst or a scientist, you want your work to persist and be resuable across the organization. This is a proof of value to your work.
We typically do not optimize while we are experimenting or prototyping. That’s because it will hinder innovation as it dilutes our focus.
However, once your prototype is proven and you want folks to use your work, that’s where optimization becomes important.
Introducing Parquet
Let’s open up the file(s) in the 01-Ins_Data_Storage
folder to get started.
Students Do: Practicing Parquet
Let’s open up the file(s) in the 02-Stu_Practicing_Parquet
folder to get started.
All the data must play its “part”
Let’s open up the file(s) in the 03-Ins_Partitioning
folder to get started.
Students Do: Writing to Parquet
Let’s open up the file(s) in the 04-Stu_Partitioning
folder to get started.
Caching
Let’s open up the file(s) in the 05-Ins_Cache
folder to get started.
Common Table Expressions (CTE): https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select-cte.html#common-table-expression-cte
Caching Flight Delays
Let’s open up the file(s) in the 06-Evr_Cache
folder to get started.