Sentiment Analysis: web scrappers

web scrappers

Top Python-Based Web Scraping Libraries

BeautifulSoup

Great for parsing HTML and XML documents.

Easy to use and integrate with requests.

Best for smaller projects or one-off scrapes.

Website: https://www.crummy.com/software/BeautifulSoup/

Scrapy

A powerful and scalable scraping and crawling framework.

Built-in support for handling requests, items, pipelines, and asynchronous scraping.

Great for large-scale or complex scraping projects.

Website: https://scrapy.org/

Selenium

Automates web browsers (used for scraping JavaScript-heavy websites).

Slower than Scrapy but supports dynamic pages.

Can be used with headless browsers like Chrome and Firefox.

Website: https://www.selenium.dev/

Playwright (Python)

Headless browser automation similar to Selenium but faster and more modern.

Excellent for scraping SPAs (Single Page Applications).

Website: https://playwright.dev/python/

Pyppeteer

Python port of Puppeteer (Node.js headless browser library).

Useful for rendering JavaScript-heavy pages.

Website: https://github.com/pyppeteer/pyppeteer

💼 Enterprise & No-Code Web Scraping Tools

Octoparse

No-code/low-code visual scraping tool.

Cloud-based or desktop version available.

Useful for business users and quick deployments.

Website: https://www.octoparse.com/

ParseHub

Point-and-click interface for scraping data from dynamic websites.

Cloud-based with API support.

Website: https://www.parsehub.com/

Diffbot

AI-powered data extraction tool for structured web scraping.

Excellent for enterprise needs with API access to article, product, and discussion data.

Website: https://www.diffbot.com/

Apify

Offers an ecosystem of web scraping and automation tools.

Based on JavaScript and integrates with Puppeteer and Playwright.

Has a marketplace of ready-made actors (scrapers).

Website: https://apify.com/

Bright Data (formerly Luminati)

Large-scale data collection platform with rotating proxies and scraping tools.

Offers a Web Scraper IDE and data unblocking capabilities.

Website: https://brightdata.com/

🛠️ Honorable Mentions

MechanicalSoup – Combines requests and BeautifulSoup for form-based scraping.

HTTPX + Selectolax – Lightweight and faster modern replacement for requests + BeautifulSoup.

Node.js Puppeteer – Popular in JavaScript ecosystems for browser automation.

📌 Choosing the Right Tool

Use Case
Recommended Tool
Simple scraping
BeautifulSoup + requests
Large-scale or complex projects
Scrapy
JavaScript-heavy websites
Playwright or Selenium
Visual scraping / No coding
Octoparse or ParseHub
Enterprise API-based extraction
Diffbot or Bright Data

Here's a simple Python web scraper using requests and BeautifulSoup to extract headlines from a website (e.g., https://news.ycombinator.com):

✅ Basic Web Scraper Example

python
CopyEdit
import requests
from bs4 import BeautifulSoup
import csv

# Target URL for Yahoo Finance news
url = 'https://finance.yahoo.com/'

# Make the request
headers = {'User-Agent': 'Mozilla/5.0'} # Pretend to be a browser
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')

# Find headlines in the "Top Stories" section
headlines = soup.find_all('h3', class_='Mb(5px)')

# Open CSV file to write
with open('yahoo_finance_headlines.csv', 'w', newline='', encoding='utf-8') as csvfile:
    writer = csv.writer(csvfile)
    writer.writerow(['Headline', 'Link']) # Column headers

    for headline in headlines:
        a_tag = headline.find('a')
        if a_tag:
            text = a_tag.text.strip()
            link = 'https://finance.yahoo.com' + a_tag['href']
            print(text, link)
            writer.writerow([text, link])

🔧 Requirements

Install the required libraries if you don't have them:
bash
CopyEdit
pip install requests beautifulsoup4

🧠 Notes

This example scrapes Hacker News headlines.
Be respectful of websites' robots.txt rules and scraping policies.
For more advanced scraping (handling JavaScript, etc.), consider using Selenium or Playwright.
Would you like an example that saves results to a CSV or scrapes a specific site like Amazon, Twitter, or a financial news site?

Create a directory for Beautifulsoup example

john_iacovacci1@sentiment-prod:~$ mkdir bs
john_iacovacci1@sentiment-prod:~$ cd bs

john_iacovacci1@sentiment-prod:~/bs$ pip install requests beautifulsoup4
Defaulting to user installation because normal site-packages is not writeable
Requirement already satisfied: requests in /usr/local/lib/python3.10/dist-packages (2.32.3)
Requirement already satisfied: beautifulsoup4 in /usr/local/lib/python3.10/dist-packages (4.13.3)
Requirement already satisfied: certifi>=2017.4.17 in /usr/lib/python3/dist-packages (from requests) (2020.6.20)
Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/lib/python3/dist-packages (from requests) (1.26.5)
Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests) (3.4.1)
Requirement already satisfied: idna<4,>=2.5 in /usr/lib/python3/dist-packages (from requests) (3.3)
Requirement already satisfied: typing-extensions>=4.0.0 in /usr/local/lib/python3.10/dist-packages (from beautifulsoup4) (4.13.0)
Requirement already satisfied: soupsieve>1.2 in /usr/local/lib/python3.10/dist-packages (from beautifulsoup4) (2.6)

john_iacovacci1@sentiment-prod:~/bs$ pico scrape.py

john_iacovacci1@sentiment-prod:~/bs$ python3 scrape.py
john_iacovacci1@sentiment-prod:~/bs$ ls -lt
total 8john_iacovacci1@sentiment-prod:~/bs$ cat yahoo_finance_headlines.csv
Headline,Link
john_iacovacci1@sentiment-prod:~/bs$

-rw-rw-r-- 1 john_iacovacci1 john_iacovacci1 15 Jun 24 21:31 yahoo_finance_headlines.csv
-rw-rw-r-- 1 john_iacovacci1 john_iacovacci1 875 Jun 24 21:31 scrape.py

No comments:

Post a Comment

Subscribe to: Comments (Atom)