This is part of data collection, where you’re harvesting public data for your analysis.
Fundamentally, it isn’t too common to scrap data from the web except for specific use cases, because it is heavily dependent on the business context.
- For example, you wouldn’t typically expect a logistics or an oil-and-gas company to scrap data from the web unless there is a specific context to it.
However, the techniques are good to learn, and it is very useful for entrepreneurship.
Tools to scrape websites
We are going to use the following tools to automate behavior on a browser. This is akin to have a robot move your mouse after you program a few repeated steps.
- Splinter: a tool that automates our web browser actions, which allows us to automatically scan and repeat interactions on websites.
- ChromeDriver: A browser automation framework and ecosystem.
- Beautiful Soup: A Python library that allows you to extract information out of web pages, and parse them into variables.
- html5lib and lxml: Libraries that are complimentary and necessary for Beautiful Soup to work.
Installing The Web Automation Tools
You’ll need to install Chrome Driver here: https://splinter.readthedocs.io/en/latest/install/external.html
pip install "splinter[selenium4]"
pip install html5lib
pip install lxml
Hello HTML
Let’s open up the file(s) in the 01-Ins_Hello_HTML
folder to get started.
To scrape information from websites, we need to understand the basic construct of a website.
HTML stands for HyperText Markup Langauge(HTML), where:
- Your browser interprets certain tags and structure of a text page, and reformats and reshapes it accordingly.
CSS stands for Cascading Style Sheets (CSS), and it further enhances flavor to the aesthetics of a web page.
The goal isn’t about being a web designer, but because we are using these markup languages to extract data, it will be important to know what these things are.
Students Do: My First HTML
Let’s open up the file(s) in the 02-Stu_first_html
folder to get started.
Introduction to Beautiful Soup
Let’s open up the file in the 03-Ins_BeautifulSoup
folder to get started.
Students Do: From Soup to Nuts
Let’s open up the file in the 04-Stu_Soup_to_Nuts
folder to get started.
Styling HTML Elements with CSS
Let’s open up the file(s) in the 05-Ins_CSS_Intro
folder to get started.
Styling can be important for a data analyst, especially if your product or visualization is web-based.
Example: Highcharts
Introduction to CSS Selectors
Let’s open up the file(s) in the 06-Ins_CSS
folder to get started.
Students Do: CSS My List
Let’s open up the file(s) in the 07-Stu_CSS
folder to get started.