SaaSHub helps you find the best software and product alternatives Learn more →
Top 10 Python warc Projects
-
ArchiveBox
🗃 Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
-
grab-site
The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
-
cdx_toolkit
A toolkit for CDX indices such as Common Crawl and the Internet Archive's Wayback Machine
-
forum-dl
Scrape posts, threads from forums, news aggregators, mail archives, export to JSONL, mailbox, WARC
Project mention: 38% of webpages that existed in 2013 are no longer accessible a decade later | news.ycombinator.com | 2024-05-18There's also https://archivebox.io which can take your bookmarks and archive them in many ways. Unfortunately back when I tried it last time it was a big buggy, I wish there was a better solution to build a nice archive of the sites I visit more often just in case.
Project mention: YaCy, a distributed Web Search Engine, based on a peer-to-peer network | news.ycombinator.com | 2024-03-05
Project mention: Ask HN: How can I back up an old vBulletin forum without admin access? | news.ycombinator.com | 2024-01-29The format you want is WARC. Even the Library of Congress uses it. There are many many WARC scrapers. I'd look at what the Internet Archive recommends. A quick search turned up this from the Archive Team and Jason Scott https://github.com/ArchiveTeam/grab-site (https://wiki.archiveteam.org/index.php/Who_We_Are) but I found that in less than 15 seconds of searching so do your own diligence.
Project mention: Ask HN: How can I back up an old vBulletin forum without admin access? | news.ycombinator.com | 2024-01-29You can try forum-dl, a forum scraping tool I've been writing for this purpose: https://github.com/mikwielgus/forum-dl
It's single-threaded, alpha-quality software, and still isn't compatible with many forums and themes. But it can export WARCs and may just happen to work for you.
Python warc related posts
-
An Introduction to the WARC File
-
Ask HN: How can I back up an old vBulletin forum without admin access?
-
Best practices for archiving websites
-
struggling to download websites
-
Internet Archive Down, will be up and running soon (i hope).
-
best tool for downloading forum posts in real-time?
-
Alternative to HTTrack (website copier) as of 2023?
-
A note from our sponsor - SaaSHub
www.saashub.com | 28 May 2024
Index
What are some of the best open-source warc projects in Python? This list will help you:
Project | Stars | |
---|---|---|
1 | ArchiveBox | 20,023 |
2 | conifer | 1,458 |
3 | grab-site | 1,273 |
4 | ipwb | 591 |
5 | WarcDB | 384 |
6 | warcio | 350 |
7 | bitextor | 282 |
8 | cdx_toolkit | 153 |
9 | forum-dl | 61 |
10 | warc2zim | 36 |
Sponsored