Scraping information on books - Intermediate Data Programming

You will use BeautifulSoup to scrape a webpage and extract a list of product titles from an e-commerce page of books.

First run the code below to see an example of getting some book titles!

import requests
from bs4 import BeautifulSoup

# URL of an e-commerce or book listing page (https didn't work in class)
url = "http://books.toscrape.com/"  # Example website for scraping practice

# Send a request to the webpage
response = requests.get(url)

# Parse the webpage content
soup = BeautifulSoup(response.text, "html.parser")

# Find all elements with the atttribute "title"
titles = soup.select('[title]')

# Extract text from each title
book_titles = [title.get("title") for title in titles]

# Display the results
print("Book Titles:")
for title in book_titles:
    print(title)

Book Titles:
A Light in the Attic
Tipping the Velvet
Soumission
Sharp Objects
Sapiens: A Brief History of Humankind
The Requiem Red
The Dirty Little Secrets of Getting Your Dream Job
The Coming Woman: A Novel Based on the Life of the Infamous Feminist, Victoria Woodhull
The Boys in the Boat: Nine Americans and Their Epic Quest for Gold at the 1936 Berlin Olympics
The Black Maria
Starving Hearts (Triangular Trade Trilogy, #1)
Shakespeare's Sonnets
Set Me Free
Scott Pilgrim's Precious Little Life (Scott Pilgrim #1)
Rip it Up and Start Again
Our Band Could Be Your Life: Scenes from the American Indie Underground, 1981-1991
Olio
Mesaerion: The Best Science Fiction Stories 1800-1849
Libertarianism for Beginners
It's Only the Himalayas

Now we really need to extract the title and price of each book from the webpage. To do this you will need to get the parent of each book (the code has been started for you below).

Use the beautifulsoup documentation here to help get this information from the container.

Note that using the debugger with breakpoints here and looking at the items returned may help.

The hints are included in comments.

books = soup.find_all("article")  # Find all book containers instead

# Extract title and price for each book
book_data = []

for book in books:
    # Extract title (inside <h3> tag under <a> title attribute)
    title = ""

    # Extract price (inside <p> tag with class 'price_color'), hint: class_ helps.
    price = "" 

    # Store as tuple
    book_data.append((title, price))

# Display the extracted data
print("Books with Prices:")
for title, price in book_data:
    print(f"{title} - {price}")

Your output should look something like this (with probably different books):

Books with Prices:
A Light in the Attic - Â£51.77
Tipping the Velvet - Â£53.74
Soumission - Â£50.10
Sharp Objects - Â£47.82
Sapiens: A Brief History of Humankind - Â£54.23

Congratulations, you have scraped your web data. Raise your paddlepop and get signed off.

The next step is to add to add saving the data out to a CSV file!

Step 2: Save to CSV¶

import csv
import requests
from bs4 import BeautifulSoup

# Now we'll add scrpaing book_data (Using  the previous solution here and extend it
# for rating and availability

def scrape_books(page_url):
    """Scrapes book details (title, price, rating, availability) from a single page."""
    
    response = requests.get(page_url)
    soup = BeautifulSoup(response.text, "html.parser")

    books = soup.find_all("article", class_="product_pod")
    book_data = []

    for book in books:
        title = book.h3.a["title"]
        price = book.find("p", class_="price_color").get_text(strip=True)       
        # ************************************
        # TODO: Add rating and availability 
        # ************************************
        rating = ""
        availability = ""

        book_data.append([title, price, rating, availability])

    return book_data

# Base URL
base_url = "http://books.toscrape.com/catalogue/page-{}.html"

# Scrape data from page 1 and page 2
all_books = []
for page in range(1, 3):  # Loop through two pages
    url = base_url.format(page)
    print(f"Scraping {url}...")
    all_books.extend(scrape_books(url))

# Save to CSV
csv_filename = "books_data.csv"

with open(csv_filename, "w", newline="", encoding="utf-8") as file:
    writer = csv.writer(file)
    # ************************************
    # TODO: Add the output data to write here for headers and rows.
    writer.writerow(["TODO"])
    # ************************************


print(f"Data successfully saved to {csv_filename}")

# Open and print CSV content
with open(csv_filename, "r", encoding="utf-8") as file:
    for line in file:
        print(line.strip())  # Print each line, removing extra spaces