COSC203 Web, Databases, and Networks
Toggle Dark/Light/Auto modeToggle Dark/Light/Auto modeToggle Dark/Light/Auto modeBack to homepage

Lab 15: Web Scraping

🎯 Lab Objective

In this lab you will learn to scrape data from websites.

  • robots.txt contains website scraping policy.
  • A web scraper a program that extracts data from websites.
  • A web crawler a web scraper that also performs link discovery.

Table of Contents

1. Web Scraping

Web scraping is the process of extracting data from websites.

βœ… Tip
Some people call web scraping data mining. But data mining is a much broader term that includes many other techniques.

Web scraping should be conducted ethically and legally, respecting website terms of service and intellectual copyright. If you are not careful you could get your IP banned, or face legal action. OpenAI is being sued for scraping the training data for ChatGPT without consent.

Nevertheless, web scraping is a legal practice and can be a powerful tool in many situations. Websites that really dislike their data being scraped will prevent access with paywalls, logins, or human verification systems like CAPTCHA.

Web Scraping

You can check a websites scraping policy by looking at their robots.txt file. For example, https://www.facebook.com/robots.txt prohibits almost all web scraping, and so does https://trademe.co.nz/robots.txt

πŸ“ Task 1: robots.txt

Read the robots.txt for a website.

  1. Choose a website
  2. Take their domain name e.g. www.example.com
  3. Add /robots.txt to the end.
  4. Visit the URL in your browser.

Is your chosen website friendly to web scrapers?


2. Selenium

We are going to use Selenium, a popular tool for web scraping. It can render JavaScript-heavy sites, supports multiple browsers, and can emulate user behavior. However, for simpler tasks or static websites, alternatives like BeautifulSoup or Requests may be more efficient.

The website we we will scrape data from is Zyte’s Web Scraping Sandbox: www.toscrape.com.

You may use either Python or Java for this lab.

πŸ“ Task 2: Selenium Setup
Select either 🐍 Python or β˜•οΈ Java below and follow the steps to get Selenium running.
βœ… Tip
Python has a more simple syntax imho, but the choice is yours.
  1. On the lab machines run this script:
    • K:\setup\enable-python.ps1
    • Right click > Run with PowerShell
  2. Install the Selenium package: pip install selenium
  3. Make sure the below code works.
    • it should first open a browser
    • visit Google, search for “COSC203”
    • wait 5 seconds, then close the browser.
from selenium import webdriver
from selenium.webdriver.common.by import By
from time import sleep

driver = webdriver.Firefox()
# driver = webdriver.Chrome()

# visit a URL
driver.get("http://www.google.com/")

sleep(1)

# find the search box, type in a query, then submit
search_box = driver.find_element(By.NAME, "q")
search_box.send_keys("COSC203")
search_box.submit()

sleep(5)

# close the browser
driver.quit()
  1. If working on the lab machines, run this script first:
    • K:\setup\enable-java.ps1
    • Right click > Run with PowerShell
  2. Download this repo
package cosc203;

// import
import java.util.Scanner;
import org.openqa.selenium.*;
import org.openqa.selenium.firefox.FirefoxDriver;
import org.openqa.selenium.safari.SafariDriver;

public class App {
    public static void main(String[] args) throws InterruptedException {
        WebDriver driver = new FirefoxDriver();
        // WebDriver driver = new SafariDriver();

        // visit a URL
        driver.get("http://www.google.com/");
        
        Thread.sleep(1000);
        
        // find the search box, type in a query, then submit
        WebElement searchBox = driver.findElement(By.name("q"));
        searchBox.sendKeys("COSC203");
        searchBox.submit();

        System.out.println("\n\nPress Enter to close browser.");
        new Scanner(System.in).nextLine();
        driver.close();
    }

}

Java code troubleshooting tips.

Can’t find JDK?

  • In a terminal run
    • java --version
    • javac --version
  • Update pom.xml to match your Java version
    • probably 17 or 18.
    <maven.compiler.source>18</maven.compiler.source>
    <maven.compiler.target>18</maven.compiler.target>

Browser Driver Crashes?

We found Firefox to be the most reliable, but you can use either Chrome or Safari.

Safari on MacOS

  1. Open Safari
    • Safari > Preferences > Advanced > Show Develop menu in menu bar
    • Develop > Allow Remote Automation
  2. Update App.java to use Safari
import org.openqa.selenium.safari.*;

WebDriver driver = new SafariDriver();

Google Chrome

  1. Update App.java to use Chrome
import org.openqa.selenium.chrome.*;

WebDriver driver = new ChromeDriver();
  1. If you get an error about chromedriver not being found, you need to download it.
    System.setProperty("webdriver.chrome.driver", "path/to/chromedriver.exe");
    WebDriver driver = new ChromeDriver();
    

3. Scraping Quotes

We’ll start with the “quotes” website.

πŸ“ Task 3: View Page Source
  1. In a browser visit:
  2. Inspect page source
    • Right Click > View Page Source
  3. Find these elements in the HTML:
    • <div class="quote">
    • <span class="text">
    • <small class="author">
    • <a href="/page/2/">Next β†’</a>

Let’s start scraping!

πŸ“ Task 4: Scraping Quotes

Start with the provided example code

  1. change the url to https://quotes.toscrape.com/

  2. retrieve a list of all quote elements:

🐍 Python

quotes = driver.find_elements(By.CLASS_NAME, "quote")

β˜•οΈ Java

List<WebElement> quotes = driver.findElements(By.className("quote"));
  1. Print the number of quotes found.
    • It should be 10

4. The WebElement Class

The find element(…) methods returns a single WebElement, and find elements(…) returns a list of WebElements.

Both accept a By selector as an argument, which identifies which element(s) to find.

🐍 Pythonβ˜•οΈ JavaDescription
By.IDBy.id(String)Matches elements by the id attribute.
By.CLASS_NAMEBy.className(String)Matches elements by the class attribute.
By.TAG_NAMEBy.tagName(String)Matches elements by the tag name (e.g., div, a, input).
By.LINK_TEXTBy.linkText(String)Matches anchor elements (<a>) by the exact text.
By.PARTIAL_LINK_TEXTBy.partialLinkText(String)Matches anchor elements by partial text.
By.CSS_SELECTORBy.cssSelector(String)Matches elements using CSS selectors.

Once you have a WebElement you can interact with it.

Getting the text of an element

🐍 Python

element = driver.find_element(By.ID, "x")
text = element.text

β˜•οΈ Java

WebElement element = driver.findElement(By.id("x"));
String text = element.getText();

Clicking an element

🐍 Python

element.click()

β˜•οΈ Java

element.click();

Sending Key Strockes

🐍 Python

element.send_keys('some_text')

β˜•οΈ Java

element.send_keys("some_text");

Now, lets scrape multiple pages.

πŸ“ Task 5: Clicking Links

The link we want to click is the “Next β†’” link at the bottom of the page. It’s an <a> tag with the text “Next β†’”. So we can use the By.LINK_TEXT selector.

  1. Find and click the “Next β†’” link

🐍 Python

next_button = driver.find_elements(By.LINK_TEXT, "Next β†’")
if len(next_button) > 0:
    next_button[0].click()

β˜•οΈ Java

List<WebElement> nextButton = driver.findElements(By.linkText("Next β†’"));
if (nextButton.size() > 0) {
    nextButton.get(0).click();
}
  1. Wrap your code a loop to visit all the pages.
    • The last page doesn’t have a “Next β†’” link
    • so end the loop when you can’t find the link.
    • Each iteration of the loop should find 10 quotes.

Extract the data from each quote.

πŸ“ Task 6: Extracting Data

The quote element has three child elements: text, author, and tags. We can find these elements easily by calling find_element again on the parent element.

  1. The following code snippets will print the text of each quote.

🐍 Python

for q in quotes:
    child_element = q.find_element(By.CLASS_NAME, "text")
    print(child_element.text)

β˜•οΈ Javaβ˜•οΈ Java

```java
for (WebElement q : quotes) {
    WebElement text = q.findElement(By.className("text"));
    System.out.println(text.getText());
}
  1. Print the author of each quote.
  2. Count the total number of quotes.
  3. Only print quotes by “Mark Twain”

HTML Attributes

Debugging can be tricky, but it’s much easier if you can print the raw HTML of WebElements.

🐍 Python

html = element.get_attribute("outerHTML")
print(html)

β˜•οΈ Java

String html = element.getAttribute("outerHTML");
System.out.println(html);

You can also use .get_attribute(...) to get the other attributes… like all the class names.

🐍 Python

classes = elements.get_attribute("class")
print(classes)

β˜•οΈ Java

String classes =elements.getAttribute("class");
System.out.println(classes);

5. Scraping to JSON

There is also another web scraping sandbox which simulates an online bookstore https://books.toscrape.com/

πŸ“ Task 7: Scraping Books

Here is the link: https://books.toscrape.com/

We want to answer the below questions. But scraping all the pages takes several minutes. So we will scrape the data once, save it to a JSON file, and then answer the questions from the JSON file.

  1. How many books are there?
  2. Which book is cheapest?
  3. Which book is most expensive?
  4. Which books are rated 5-stars?

The JSON file might look something like:

[
    {
        "title": "A Light in the Attic",
        "price": "Β£51.77",
        "rating": "Three stars",
        "availability": "In stock"
    },
    {
        "title": "Tipping the Velvet",
        "price": "Β£53.74",
        "rating": "One star",
        "availability": "In stock"
    },
    ...
]

If adapting your previous code, you’ll need to make some changes.

  • The text of the “Next β†’” link to “next” (all lowercase, no arrow).

Below is how you might save the data as JSON.

🐍 Python

Python supports JSON natively.

import json

book_data = [] # empty list

# create and add a book to the list
book = {}
book["title"] = "A Light in the Attic"
book["price"] = "Β£51.77"
book_data.append(book)

# save to file
with open("books.json", "w") as f:
    json.dump(book_data, f)

β˜•οΈ Java

Java does not support JSON natively, so we need to use a library. We will use simple-json.

// imports
import org.json.simple.JSONArray;
import org.json.simple.JSONObject;
import java.io.FileWriter;
import java.io.IOException;

// create JSON array
JSONArray bookData = new JSONArray();

// add JSON object to the array
JSONObject book = new JSONObject();
book.put("title", "A Light in the Attic");
book.put("price", "Β£51.77");
bookData.add(book);

// save to file
try (FileWriter file = new FileWriter("books.json")) {
    file.write(bookData.toJSONString());
} catch (IOException e) {
    e.printStackTrace();
}

The dependencies for simple-json should already be included in the pom.xml file. If not, add the following to the <dependencies> section:

<dependency>
    <groupId>com.googlecode.json-simple</groupId>
    <artifactId>json-simple</artifactId>
    <version>1.1.1</version>
</dependency>

Good Luck!


X. Marking Off

This lab is worth marks. be sure to get signed off.