We are finalizing some of the techniques to prep data for data analysis.
This is by far not exhaustive, and as you grow your skills, knowledge and experience, you will grow your repertoire as you practice more and more.
Merging DataFrames
Let’s open up the file in the 01-Ins_Merging
folder to get started.
Merging datasets is a staple activity, especially when we want to get insights across multiple datasets.
Data modeling will be crucial because we want to ensure data integrity when we merge data, but that will part of your future coursework.
Students Do: Census Merging
Let’s open up the file in the 02-Stu_Census_Merging
to get started.
Binning Data
Let’s open up the file in the 03-Ins_Binning
folder to get started.
Categorizing data with specific conditions is necessary for all types of data analysis. We used to do it manually, but as Pandas evolves, it lowered the barrier of entry for doing it manually.
Students Do: Binning Movies
Let’s open up the README.md
file in the 04-Stu_MovieRatings_Binning
folder to get started.
Mapping
Let’s open up the file in the 05-Ins_Mapping
to get started.
If you read the official documentation, map
is a function that applies transformation logic on each value in an entire column: https://pandas.pydata.org/pandas-docs/version/0.19/generated/pandas.Series.map.html
Notes
- Notice that when we apply formatting on NaN values, it becomes an object (string) within Pandas.
- You would want to remove NaN values first before applying formatting.
${:.2f}
means we want to round the data to 2 floating points.- This is not the only way to round data. I typically use the
numpy
library, which we will cover in future coursework.- If your work requires high precision values, such as architecture and buildings, you will use
numpy
to ensure accuracy.
- If your work requires high precision values, such as architecture and buildings, you will use
What you have learned is not exhaustive
There are depths to column-level transformation as it is beyond the scope of the class. However, it is useful to know so that you can research later.
- Using lambda x to use functions over the values if your transformation logic is too complex: https://pandas.pydata.org/pandas-docs/version/0.19/generated/pandas.Series.map.html
- Example:
file_df['INCOME_BOOL'] = file_df['INCOME'].map(lambda x: False if x == 0 else True)
- We are creating a new column based on each value within the ‘INCOME’ column.
x
is the variable that contains the value of each row as it iterates.
- Example:
- Using apply to access other column’s values for transformation: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html
- Example:
file_df['NET_INCOME'] = file_df.apply(lambda x: x['INCOME'] - x['COSTS'], axis=1)
- Using
apply
,x
can assume any column’s value as though it is a dictionary.
- Using
- Example:
Crowdfunding Cleaning
Let’s open up the file in the 06-Evr_Crowdfunding_Cleaning
to get started.
Introduction to Bug Fixing
Look at the file(s) in the 07-Ins_Intro_to_Bugfixing
folder.
Bug fixing is something that is caught, not only taught. It is like riding a bicycle. You can’t learn bicycle just be reading about it, but you actually have to be doing it to be better at it.
As you do more bug fixing, your troubleshooting skills will grow as well.
Being able to debug code is key to excellence, especially when you’re working in a team. You will need to ensure quality and excellence with your team mates’ work in order to produce good products.
Bug Fixing Bonanza
Let’s open up the file in the 08-Evr_Bugfixing_Bonanza
to get started.