22.3 Optimizing Spark: Storage, Partitioning, and Caching

Companies are willing to spend on big data, but they aren’t willing to be inefficient about it. Being inefficient could mean millions of dollars.

At some point as a data analyst or a scientist, you want your work to persist and be resuable across the organization. This is a proof of value to your work.

We typically do not optimize while we are experimenting or prototyping. That’s because it will hinder innovation as it dilutes our focus.

However, once your prototype is proven and you want folks to use your work, that’s where optimization becomes important.

Introducing Parquet

Let’s open up the file(s) in the 01-Ins_Data_Storage folder to get started.

Students Do: Practicing Parquet

Let’s open up the file(s) in the 02-Stu_Practicing_Parquet folder to get started.

All the data must play its “part”

Let’s open up the file(s) in the 03-Ins_Partitioning folder to get started.

Students Do: Writing to Parquet

Let’s open up the file(s) in the 04-Stu_Partitioning folder to get started.

Caching

Let’s open up the file(s) in the 05-Ins_Cache folder to get started.

Common Table Expressions (CTE): https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select-cte.html#common-table-expression-cte

Caching Flight Delays

Let’s open up the file(s) in the 06-Evr_Cache folder to get started.

22.3 Optimizing Spark: Storage, Partitioning, and Caching

Introducing Parquet

Students Do: Practicing Parquet

All the data must play its “part”

Students Do: Writing to Parquet

Caching

Caching Flight Delays

About The Author

Jonathan Moo

Leave a reply Cancel reply

Categories

22.3 Optimizing Spark: Storage, Partitioning, and Caching

Introducing Parquet

Students Do: Practicing Parquet

All the data must play its “part”

Students Do: Writing to Parquet

Caching

Caching Flight Delays

About The Author

Jonathan Moo

Related Posts

13.1 Extract, Transform, and Load (ETL)

12.1 Intro to NoSQL and MongoDB

22.4 Databricks

4.2 Exploring Pandas

Leave a reply Cancel reply

Categories