π Quick Recap (Day 18)
You navigated the standard library, using
osfor file paths,sysfor interpreter info,datetimefor dates,jsonfor serialization, andcollections.Counterfor counting data.
π― What Youβll Learn Today
How to send HTTP requests to retrieve web page content using
requests.How to parse HTML with BeautifulSoup.
Techniques to locate and extract specific elements from the page.
How to handle common errors like missing pages or elements.
π Understanding Web Scraping
Web scraping is the process of programmatically collecting information from websites when no direct API is available. Typical uses include:
Price monitoring: Track changes in product prices over time.
Data collection: Build datasets for research or analysis.
Content aggregation: Gather news headlines, articles, or reviews.
Important: Always check and respect a siteβs robots.txt file and terms of service before scraping.
π Fetching Web Pages with requests
Install the library if needed:
pip install requestsSend a GET request:
import requests url = 'https://example.com' response = requests.get(url)Check the response status:
if response.status_code == 200: html = response.text else: print(f"Failed to fetch: {response.status_code}")Inspect the first few characters:
print(html[:200]) # preview HTML
π Parsing HTML with BeautifulSoup
Install BeautifulSoup:
pip install beautifulsoup4Create a parser object:
from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'html.parser')Navigate the document tree:
Find first occurrence:
title = soup.find('h1') if title: print(title.text)Find all elements:
links = soup.find_all('a') for link in links: href = link.get('href') if href: print(href)CSS selectors:
items = soup.select('p.intro') for item in items: print(item.text)
π Handling Errors Gracefully
Missing pages:
bad_resp = requests.get(url + '/nonexistent') if bad_resp.status_code != 200: print(f"Error {bad_resp.status_code}: Page not found")Missing elements:
subtitle = soup.find('h2') print(subtitle.text if subtitle else 'No subtitle on page')
π§ββοΈ Take the Wand and Try Yourself
Task: Build a simple scraper for https://example.com:
Create
scrape_example.py.Fetch the home page and check for status code 200.
Parse with BeautifulSoup.
Extract and print:
The
<h1>text or a fallback message.All
hrefvalues from<a>tags.
Attempt to fetch a nonexistent page (
/nonexistent) and print an error if it returns 404.
Solution Outline:
import requests
from bs4 import BeautifulSoup
url = 'https://example.com'
resp = requests.get(url)
if resp.status_code == 200:
soup = BeautifulSoup(resp.text, 'html.parser')
# H1 extraction
h1 = soup.find('h1')
print(h1.text if h1 else 'No H1 found')
# Link extraction
for a in soup.find_all('a'):
href = a.get('href')
if href:
print(href)
else:
print(f"Error {resp.status_code}: Unable to fetch main page")
# Nonexistent page
bad = requests.get(url + '/nonexistent')
if bad.status_code == 404:
print('Page not found: 404')Expected output:
Example Domain
https://www.iana.org/domains/example
Page not found: 404Run:
python scrape_example.pyOnce you see the title, the link, and handle the 404 correctly, youβve mastered the basics of web scraping!
Up next: Day 20: NumPy & Data Manipulation