πŸ”„ Quick Recap (Day 18)

  • You navigated the standard library, using os for file paths, sys for interpreter info, datetime for dates, json for serialization, and collections.Counter for counting data.

🎯 What You’ll Learn Today

  1. How to send HTTP requests to retrieve web page content using requests.

  2. How to parse HTML with BeautifulSoup.

  3. Techniques to locate and extract specific elements from the page.

  4. How to handle common errors like missing pages or elements.

πŸ“– Understanding Web Scraping

Web scraping is the process of programmatically collecting information from websites when no direct API is available. Typical uses include:

  • Price monitoring: Track changes in product prices over time.

  • Data collection: Build datasets for research or analysis.

  • Content aggregation: Gather news headlines, articles, or reviews.

❝

Important: Always check and respect a site’s robots.txt file and terms of service before scraping.

πŸ“– Fetching Web Pages with requests

  1. Install the library if needed:

    pip install requests
  2. Send a GET request:

    import requests
    
    url = 'https://example.com'
    response = requests.get(url)
  3. Check the response status:

    if response.status_code == 200:
        html = response.text
    else:
        print(f"Failed to fetch: {response.status_code}")
  4. Inspect the first few characters:

    print(html[:200])  # preview HTML

πŸ“– Parsing HTML with BeautifulSoup

  1. Install BeautifulSoup:

    pip install beautifulsoup4
  2. Create a parser object:

    from bs4 import BeautifulSoup
    
    soup = BeautifulSoup(html, 'html.parser')
  3. Navigate the document tree:

    • Find first occurrence:

      title = soup.find('h1')
      if title:
          print(title.text)
    • Find all elements:

      links = soup.find_all('a')
      for link in links:
          href = link.get('href')
          if href:
              print(href)
    • CSS selectors:

      items = soup.select('p.intro')
      for item in items:
          print(item.text)

πŸ“– Handling Errors Gracefully

  • Missing pages:

    bad_resp = requests.get(url + '/nonexistent')
    if bad_resp.status_code != 200:
        print(f"Error {bad_resp.status_code}: Page not found")
  • Missing elements:

    subtitle = soup.find('h2')
    print(subtitle.text if subtitle else 'No subtitle on page')

πŸ§™β€β™‚οΈ Take the Wand and Try Yourself

Task: Build a simple scraper for https://example.com:

  1. Create scrape_example.py.

  2. Fetch the home page and check for status code 200.

  3. Parse with BeautifulSoup.

  4. Extract and print:

    • The <h1> text or a fallback message.

    • All href values from <a> tags.

  5. Attempt to fetch a nonexistent page (/nonexistent) and print an error if it returns 404.

Solution Outline:

import requests
from bs4 import BeautifulSoup

url = 'https://example.com'
resp = requests.get(url)
if resp.status_code == 200:
    soup = BeautifulSoup(resp.text, 'html.parser')
    # H1 extraction
    h1 = soup.find('h1')
    print(h1.text if h1 else 'No H1 found')
    # Link extraction
    for a in soup.find_all('a'):
        href = a.get('href')
        if href:
            print(href)
else:
    print(f"Error {resp.status_code}: Unable to fetch main page")

# Nonexistent page
bad = requests.get(url + '/nonexistent')
if bad.status_code == 404:
    print('Page not found: 404')

Expected output:

Example Domain
https://www.iana.org/domains/example
Page not found: 404

Run:

python scrape_example.py

Once you see the title, the link, and handle the 404 correctly, you’ve mastered the basics of web scraping!

Up next: Day 20: NumPy & Data Manipulation


Keep Reading