SaaSHub helps you find the best software and product alternatives Learn more →
Top 23 Python Crawler Projects
-
Scout Monitoring
Free Django app performance insights with Scout Monitoring. Get Scout setup in minutes, and let us sweat the small stuff. A couple lines in settings.py is all you need to start monitoring your apps. Sign up for our free tier today.
-
newspaper
newspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs:
-
Douyin_TikTok_Download_API
🚀「Douyin_TikTok_Download_API」是一个开箱即用的高性能异步抖音、快手、TikTok、Bilibili数据爬取工具,支持API调用,在线批量解析及下载。
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
-
trafilatura
Python & command-line tool to gather text on the Web: web crawling/scraping, extraction of text, metadata, comments
-
news-please
news-please - an integrated web crawler and information extractor for news that just works
-
grab-site
The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
Scrapy is an open-source Python-based web scraping framework that extracts data from websites. With Scrapy, you create spiders, which are autonomous scripts to download and process web content. The limitation of Scrapy is that it does not work very well with JavaScript rendered websites, as it was designed for static HTML pages. We will do a comparison later in the article about this.
Project mention: Ask HN: How can I back up an old vBulletin forum without admin access? | news.ycombinator.com | 2024-01-29The format you want is WARC. Even the Library of Congress uses it. There are many many WARC scrapers. I'd look at what the Internet Archive recommends. A quick search turned up this from the Archive Team and Jason Scott https://github.com/ArchiveTeam/grab-site (https://wiki.archiveteam.org/index.php/Who_We_Are) but I found that in less than 15 seconds of searching so do your own diligence.
botasaurus – The All in One Framework to build Awesome Scrapers
It is apparently widely suspected that a certain "Books2" dataset mentioned by OpenAI is basically just LibGen:
https://blusharkmedia.medium.com/the-ongoing-battle-against-...
https://techhq.com/2023/09/can-libgen-shadow-library-survive...
https://www.twitter.com/theshawwn/status/1320282152689336320
https://qz.com/openai-books-piracy-microsoft-meta-google-cha...
https://qz.com/shadow-libraries-are-at-the-heart-of-the-moun...
https://goodereader.com/blog/e-book-news/authors-file-lawsui...
When asked about whether this was true, they refused to answer based on confidentiality concerns, then said they had deleted all copies of the dataset, stopped using it, and no longer employed the individuals that compiled it:
https://www.businessinsider.com/openai-destroyed-ai-training...
We do know for a fact that the (non-OpenAI-controlled) "books3" dataset is just "all of bibliotik":
https://www.twitter.com/theshawwn/status/1320282149329784833
https://github.com/soskek/bookcorpus/issues/27
And we also apparently know for a fact that this was included in the datasets used to train LLAMA:
https://en.wikipedia.org/wiki/The_Pile_(dataset)
https://aicopyright.substack.com/p/the-books-used-to-train-l...
https://aicopyright.substack.com/p/has-your-book-been-used-t...
Python Crawler related posts
-
Claude is now available in Europe
-
The Great GPT Firewall
-
Looking for something like ArchiveBox but with the recursive functionality of HTTrack
-
Show HN: New AI Dataset Based on LibGen and Sci-Hub
-
How to make scrapy run multiple times on the same URLs?
-
Web Scraper Multiparadigmático!
-
struggling to download websites
-
A note from our sponsor - SaaSHub
www.saashub.com | 1 Jun 2024
Index
What are some of the best open-source Crawler projects in Python? This list will help you:
Project | Stars | |
---|---|---|
1 | Scrapy | 51,270 |
2 | pyspider | 16,373 |
3 | newspaper | 13,808 |
4 | Photon | 10,586 |
5 | Douyin_TikTok_Download_API | 7,263 |
6 | autoscraper | 5,998 |
7 | scrapy-redis | 5,470 |
8 | myGPTReader | 4,407 |
9 | ProxyBroker | 3,740 |
10 | toapi | 3,462 |
11 | weibo-crawler | 3,164 |
12 | trafilatura | 3,012 |
13 | TorBot | 2,675 |
14 | Grab | 2,362 |
15 | news-please | 1,957 |
16 | PSpider | 1,811 |
17 | OpenWPM | 1,317 |
18 | grab-site | 1,273 |
19 | mlscraper | 1,235 |
20 | botasaurus | 1,003 |
21 | XSRFProbe | 915 |
22 | scrapyrt | 816 |
23 | bookcorpus | 784 |
Sponsored