This is part of data collection, where you’re harvesting public data for your analysis.

Fundamentally, it isn’t too common to scrap data from the web except for specific use cases, because it is heavily dependent on the business context.

  • For example, you wouldn’t typically expect a logistics or an oil-and-gas company to scrap data from the web unless there is a specific context to it.

However, the techniques are good to learn, and it is very useful for entrepreneurship.

Tools to scrape websites

We are going to use the following tools to automate behavior on a browser. This is akin to have a robot move your mouse after you program a few repeated steps.

  • Splinter: a tool that automates our web browser actions, which allows us to automatically scan and repeat interactions on websites.
  • ChromeDriver: A browser automation framework and ecosystem.
  • Beautiful Soup: A Python library that allows you to extract information out of web pages, and parse them into variables.
  • html5lib and lxml: Libraries that are complimentary and necessary for Beautiful Soup to work.

Installing The Web Automation Tools

You’ll need to install Chrome Driver here: https://splinter.readthedocs.io/en/latest/install/external.html

		pip install "splinter[selenium4]"
		pip install html5lib
		pip install lxml

Hello HTML

Let’s open up the file(s) in the 01-Ins_Hello_HTML folder to get started.

To scrape information from websites, we need to understand the basic construct of a website.

HTML stands for HyperText Markup Langauge(HTML), where:

  • Your browser interprets certain tags and structure of a text page, and reformats and reshapes it accordingly.

CSS stands for Cascading Style Sheets (CSS), and it further enhances flavor to the aesthetics of a web page.

The goal isn’t about being a web designer, but because we are using these markup languages to extract data, it will be important to know what these things are.

Students Do: My First HTML

Introduction to Beautiful Soup

Let’s open up the file in the 03-Ins_BeautifulSoup folder to get started.

Students Do: From Soup to Nuts

Styling HTML Elements with CSS

Let’s open up the file(s) in the 05-Ins_CSS_Intro folder to get started.

Styling can be important for a data analyst, especially if your product or visualization is web-based.

Example: Highcharts

Introduction to CSS Selectors

Let’s open up the file(s) in the 06-Ins_CSS folder to get started.

Students Do: CSS My List