Top 23 Python web-scraping Projects

Scrapy

182 51,343 9.6 Python

Scrapy, a fast high-level web crawling & scraping framework for Python.

Project mention: Scrapy Vs. Crawlee | dev.to | 2024-05-15

Scrapy is an open-source Python-based web scraping framework that extracts data from websites. With Scrapy, you create spiders, which are autonomous scripts to download and process web content. The limitation of Scrapy is that it does not work very well with JavaScript rendered websites, as it was designed for static HTML pages. We will do a comparison later in the article about this.

changedetection.io

196 15,285 9.5 Python

The best and simplest free open source web page change detection, website watcher, restock monitor and notification service. Restock Monitor, change detection. Designed for simplicity - Simply monitor which websites had a text change for free. Free Open source web page change detection, Website defacement monitoring, Price change notification

Project mention: Google have removed RSS support from their developer blogs | news.ycombinator.com | 2023-12-11

I use ChangeDetection,
- https://changedetection.io/#features
- https://github.com/dgtlmoon/changedetection.io

Scout Monitoring

www.scoutapm.com featured

Free Django app performance insights with Scout Monitoring. Get Scout setup in minutes, and let us sweat the small stuff. A couple lines in settings.py is all you need to start monitoring your apps. Sign up for our free tier today.
Douyin_TikTok_Download_API

3 7,357 9.2 Python

🚀「Douyin_TikTok_Download_API」是一个开箱即用的高性能异步抖音、快手、TikTok、Bilibili数据爬取工具，支持API调用，在线批量解析及下载。
autoscraper

9 6,007 0.0 Python

A Smart, Automatic, Fast and Lightweight Web Scraper for Python
trafilatura

14 3,071 8.8 Python

Python & command-line tool to gather text on the Web: web crawling/scraping, extraction of text, metadata, comments

Project mention: Claude is now available in Europe | news.ycombinator.com | 2024-05-14

snoop

7 2,753 9.3 Python

Snoop — инструмент разведки на основе открытых данных (OSINT world)

Project mention: Osint update of the Snoop Project tool search for user by nickname | news.ycombinator.com | 2024-01-02

Grab

0 2,363 3.0 Python

Web Scraping Framework
InfluxDB

www.influxdata.com featured

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
curl_cffi

5 1,530 9.2 Python

Python binding for curl-impersonate via cffi. A http client that can impersonate browser tls/ja3/http2 fingerprints.

Project mention: This Week In Python | dev.to | 2024-03-22

curl_cffi – A http client that can impersonate browser tls/ja3/http2 fingerprints

botasaurus

5 1,036 9.5 Python

The All in One Framework to build Awesome Scrapers.

Project mention: This Week In Python | dev.to | 2024-04-05

botasaurus – The All in One Framework to build Awesome Scrapers

scrapy-fake-useragent

3 681 2.3 Python

Random User-Agent middleware based on fake-useragent
web-scraping

43 678 0.0 Python

Detailed web scraping tutorials for dummies with financial data crawlers on Reddit WallStreetBets, CME (both options and futures), US Treasury, CFTC, LME, MacroTrends, SHFE and alternative data crawlers on Tomtom, BBC, Wall Street Journal, Al Jazeera, Reuters, Financial Times, Bloomberg, CNN, Fortune, The Economist

Project mention: web-scraping: NEW Data - star count:554.0 | /r/algoprojects | 2023-09-25

google-search-results-python

4 532 4.5 Python

Google Search Results via SERP API pip Python Package
dude

28 412 9.0 Python

dude uncomplicated data extraction: A simple framework for writing web scrapers using Python decorators
basketball_reference_web_scraper

2 414 7.3 Python

NBA Stats API via Basketball Reference
wayback-machine-scraper

6 408 0.0 Python

A command-line utility and Scrapy middleware for scraping time series data from Archive.org's Wayback Machine.

Project mention: wayback-machine-scraper: NEW Data - star count:380.0 | /r/algoprojects | 2023-12-10

twitter-scraper-selenium

2 285 6.0 Python

Python's package to scrap Twitter's front-end easily
letterboxd_recommendations

3 223 9.1 Python

Scraping publicly-accessible Letterboxd data and creating a movie recommendation model with it that can generate recommendations when provided with a Letterboxd username

Project mention: Self Hosted Content Recommender? | /r/selfhosted | 2023-06-20

Use an existing self-hostable tool for getting recommendations from there, such as letterboxd_recommendations

facebook_page_scraper

1 201 6.9 Python

Scrapes facebook's pages front end with no limitations & provides a feature to turn data into structured JSON or CSV
saveddit

8 165 1.7 Python

Bulk Downloader for Reddit
rymscraper

1 157 2.4 Python

Python library to extract data from rateyourmusic.com.
stock_screener

1 128 0.0 Python

Picking stocks through various screening methods. Focus on Northern Europe.
GoodreadsScraper

1 119 2.7 Python

Scrape data from Goodreads using Scrapy and Selenium :books:
scrapper

1 123 8.2 Python

Web scraper with a simple REST API living in Docker and using a Headless browser and Readability.js for parsing.
SaaSHub

www.saashub.com featured

SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

Python web-scraping related posts

Claude is now available in Europe

2 projects | news.ycombinator.com | 14 May 2024
wayback-machine-scraper: NEW Data - star count:380.0

1 project | /r/algoprojects | 10 Dec 2023
Trafilatura: Python tool to gather text on the Web

3 projects | news.ycombinator.com | 14 Aug 2023
Self Hosted Content Recommender?

2 projects | /r/selfhosted | 20 Jun 2023
Show HN: Build AI Dags with Memory; Run and Validate LLM Tools in Containers

2 projects | news.ycombinator.com | 21 Apr 2023
Powerful and free scraper with a headless browser under the hood and Readability for parsing

2 projects | /r/webscraping | 18 Mar 2023
No more rec requests

1 project | /r/Letterboxd | 10 Mar 2023
A note from our sponsor - Scout Monitoring
www.scoutapm.com | 6 Jun 2024

Get Scout setup in minutes, and let us sweat the small stuff. A couple lines in settings.py is all you need to start monitoring your apps. Sign up for our free tier today. Learn more →

Index

What are some of the best open-source web-scraping projects in Python? This list will help you:

	Project	Stars
1	Scrapy	51,343
2	changedetection.io	15,285
3	Douyin_TikTok_Download_API	7,357
4	autoscraper	6,007
5	trafilatura	3,071
6	snoop	2,753
7	Grab	2,363
8	curl_cffi	1,530
9	botasaurus	1,036
10	scrapy-fake-useragent	681
11	web-scraping	678
12	google-search-results-python	532
13	dude	412
14	basketball_reference_web_scraper	414
15	wayback-machine-scraper	408
16	twitter-scraper-selenium	285
17	letterboxd_recommendations	223
18	facebook_page_scraper	201
19	saveddit	165
20	rymscraper	157
21	stock_screener	128
22	GoodreadsScraper	119
23	scrapper	123

Python web-scraping

Top 23 Python web-scraping Projects

Python web-scraping related posts

Claude is now available in Europe

wayback-machine-scraper: NEW Data - star count:380.0

Trafilatura: Python tool to gather text on the Web

Self Hosted Content Recommender?

Show HN: Build AI Dags with Memory; Run and Validate LLM Tools in Containers

Powerful and free scraper with a headless browser under the hood and Readability for parsing

No more rec requests

Index