web scrappers

 

Top Python-Based Web Scraping Libraries

  1. BeautifulSoup

    • Great for parsing HTML and XML documents.

    • Easy to use and integrate with requests.

    • Best for smaller projects or one-off scrapes.

    • Website: https://www.crummy.com/software/BeautifulSoup/

  2. Scrapy

    • A powerful and scalable scraping and crawling framework.

    • Built-in support for handling requests, items, pipelines, and asynchronous scraping.

    • Great for large-scale or complex scraping projects.

    • Website: https://scrapy.org/

  3. Selenium

    • Automates web browsers (used for scraping JavaScript-heavy websites).

    • Slower than Scrapy but supports dynamic pages.

    • Can be used with headless browsers like Chrome and Firefox.

    • Website: https://www.selenium.dev/

  4. Playwright (Python)

    • Headless browser automation similar to Selenium but faster and more modern.

    • Excellent for scraping SPAs (Single Page Applications).

    • Website: https://playwright.dev/python/

  5. Pyppeteer


💼 Enterprise & No-Code Web Scraping Tools

  1. Octoparse

    • No-code/low-code visual scraping tool.

    • Cloud-based or desktop version available.

    • Useful for business users and quick deployments.

    • Website: https://www.octoparse.com/

  2. ParseHub

  3. Diffbot

    • AI-powered data extraction tool for structured web scraping.

    • Excellent for enterprise needs with API access to article, product, and discussion data.

    • Website: https://www.diffbot.com/

  4. Apify

    • Offers an ecosystem of web scraping and automation tools.

    • Based on JavaScript and integrates with Puppeteer and Playwright.

    • Has a marketplace of ready-made actors (scrapers).

    • Website: https://apify.com/

  5. Bright Data (formerly Luminati)

    • Large-scale data collection platform with rotating proxies and scraping tools.

    • Offers a Web Scraper IDE and data unblocking capabilities.

    • Website: https://brightdata.com/


🛠️ Honorable Mentions

  • MechanicalSoup – Combines requests and BeautifulSoup for form-based scraping.

  • HTTPX + Selectolax – Lightweight and faster modern replacement for requests + BeautifulSoup.

  • Node.js Puppeteer – Popular in JavaScript ecosystems for browser automation.


📌 Choosing the Right Tool

Use Case

Recommended Tool

Simple scraping

BeautifulSoup + requests

Large-scale or complex projects

Scrapy

JavaScript-heavy websites

Playwright or Selenium

Visual scraping / No coding

Octoparse or ParseHub

Enterprise API-based extraction

Diffbot or Bright Data




Here's a simple Python web scraper using requests and BeautifulSoup to extract headlines from a website (e.g., https://news.ycombinator.com):


✅ Basic Web Scraper Example

python

CopyEdit

import requests

from bs4 import BeautifulSoup

import csv


# Target URL for Yahoo Finance news

url = 'https://finance.yahoo.com/'


# Make the request

headers = {'User-Agent': 'Mozilla/5.0'}  # Pretend to be a browser

response = requests.get(url, headers=headers)

soup = BeautifulSoup(response.text, 'html.parser')


# Find headlines in the "Top Stories" section

headlines = soup.find_all('h3', class_='Mb(5px)')


# Open CSV file to write

with open('yahoo_finance_headlines.csv', 'w', newline='', encoding='utf-8') as csvfile:

    writer = csv.writer(csvfile)

    writer.writerow(['Headline', 'Link'])  # Column headers


    for headline in headlines:

        a_tag = headline.find('a')

        if a_tag:

            text = a_tag.text.strip()

            link = 'https://finance.yahoo.com' + a_tag['href']

            print(text, link)

            writer.writerow([text, link])



🔧 Requirements

Install the required libraries if you don't have them:

bash

CopyEdit

pip install requests beautifulsoup4



🧠 Notes

  • This example scrapes Hacker News headlines.

  • Be respectful of websites' robots.txt rules and scraping policies.

  • For more advanced scraping (handling JavaScript, etc.), consider using Selenium or Playwright.

Would you like an example that saves results to a CSV or scrapes a specific site like Amazon, Twitter, or a financial news site?



Create a directory for Beautifulsoup example



john_iacovacci1@sentiment-prod:~$ mkdir bs

john_iacovacci1@sentiment-prod:~$ cd bs


john_iacovacci1@sentiment-prod:~/bs$ pip install requests beautifulsoup4

Defaulting to user installation because normal site-packages is not writeable

Requirement already satisfied: requests in /usr/local/lib/python3.10/dist-packages (2.32.3)

Requirement already satisfied: beautifulsoup4 in /usr/local/lib/python3.10/dist-packages (4.13.3)

Requirement already satisfied: certifi>=2017.4.17 in /usr/lib/python3/dist-packages (from requests) (2020.6.20)

Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/lib/python3/dist-packages (from requests) (1.26.5)

Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests) (3.4.1)

Requirement already satisfied: idna<4,>=2.5 in /usr/lib/python3/dist-packages (from requests) (3.3)

Requirement already satisfied: typing-extensions>=4.0.0 in /usr/local/lib/python3.10/dist-packages (from beautifulsoup4) (4.13.0)

Requirement already satisfied: soupsieve>1.2 in /usr/local/lib/python3.10/dist-packages (from beautifulsoup4) (2.6)


john_iacovacci1@sentiment-prod:~/bs$ pico scrape.py



john_iacovacci1@sentiment-prod:~/bs$ python3 scrape.py

john_iacovacci1@sentiment-prod:~/bs$ ls -lt

total 8john_iacovacci1@sentiment-prod:~/bs$ cat yahoo_finance_headlines.csv 

Headline,Link

john_iacovacci1@sentiment-prod:~/bs$ 


-rw-rw-r-- 1 john_iacovacci1 john_iacovacci1  15 Jun 24 21:31 yahoo_finance_headlines.csv

-rw-rw-r-- 1 john_iacovacci1 john_iacovacci1 875 Jun 24 21:31 scrape.py




No comments:

Post a Comment

Notes 3-18-25

https://uconn-sa.blogspot.com/  We were able to launch an app engine program from our compute engine instance.   I'd like to get all wo...