An Introduction to the WARC File

Scout Monitoring - Free Django app performance insights with Scout Monitoring

Get Scout setup in minutes, and let us sweat the small stuff. A couple lines in settings.py is all you need to start monitoring your apps. Sign up for our free tier today.

www.scoutapm.com

featured

InfluxDB - Power Real-Time Data Analytics at Scale

Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

www.influxdata.com

featured

ArchiveBox

250 20,023 9.8 Python

🗃 Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...

API is coming soon (relatively, it's still a one-man project)! Stay tuned https://github.com/ArchiveBox/ArchiveBox/issues/496
I have an event-sourcing refactor in progress now to allow us to pluginize functionality like the API (similar to Home Assistant with a plugin app sotre), it will take a month or two. Next up is the REST API using the new plugin system.

linkwarden

19 6,367 9.7 TypeScript

⚡️⚡️⚡️Self-hosted collaborative bookmark manager to collect, organize, and preserve webpages and articles.
Scout Monitoring

www.scoutapm.com featured

Free Django app performance insights with Scout Monitoring. Get Scout setup in minutes, and let us sweat the small stuff. A couple lines in settings.py is all you need to start monitoring your apps. Sign up for our free tier today.
monolith

23 10,149 7.2 Rust

⬛️ CLI tool for saving complete web pages as a single HTML file

I have never used monolith to say with any certainty, but two things in your description are worth highlighting between the goals of WARC versus the umpteen bazillion "save this one page I'm looking at as a single file" type projects:
1. WARC is designed, as a goal, to archive the request-response handshake. It does not get into the business of trying to make it easy for a browser to subsequently display that content, since that's a browser's problem
2. Using your cited project specifically, observe the number of "well, save it but ..." options <https://github.com/Y2Z/monolith#options> which is in stark contrast to the archiving goals I just spoke about. It's not a good snapshot of history if the server responded with `content-type: text/html;charset=iso-8859-1` back in the 90s but "modern tools" want everything to be UTF-8 so we'll just convert it, shall we? Bah, I don't like JavaScript, so we'll just toss that out, shall we? And so on
For 100% clarity: monolith, and similar, may work fantastic for any individual's workflow, and I'm not here to yuck anyone's yum; but I do want to highlight that all things being equal it should always be possible to derive monolith files from warc files because the warc files are (or at least have the goal of) perfect fidelity of what the exchange was. I would guess only pcap files would be of higher fidelity, but also a lot more extraneous or potentially privacy violating details

warc

1 232 10.0 Python

Python library for reading and writing warc files (by internetarchive)

I wrote a library to handle these and the older ARC files to use with an archiving proxy that a friend and I built for the Internet Archive. https://github.com/internetarchive/warc. He wrote the WARC parts while I did the ARC one using this https://archive.org/web/researcher/ArcFileFormat.php
Good memories.

WarcDB

7 384 7.2 Python

WarcDB: Web crawl data as SQLite databases.
full-text-tabs-forever

4 59 8.5 TypeScript

Full text search all your browsing history

A bit of a late response, but yes I've been storing full text of every website I visit and it's excellent for finding stuff again.
The idea is to index pages as you visit them using a browser extension, thus avoiding all the pitfalls of being treated like a bot.
Here's the project: https://github.com/iansinnott/full-text-tabs-forever

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Ask HN: How can I back up an old vBulletin forum without admin access?

9 projects | news.ycombinator.com | 29 Jan 2024
Best practices for archiving websites

2 projects | /r/datacurator | 6 Dec 2023
An easy solution to save entire websites for my dad?

7 projects | /r/DataHoarder | 17 Oct 2021
Looking for open source software to scrape webpages but also make them searchable with a webui. (locally hosted)

4 projects | /r/DataHoarder | 16 Jan 2021
Alternative to HTTrack (website copier) as of 2023?

4 projects | /r/DataHoarder | 10 Feb 2023

An Introduction to the WARC File

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com
warc web-archiving Archiving and Digital Preservation (DP) save-the-internet Crawling
Post date: 29 Jan 2024

ArchiveBox

linkwarden

Scout Monitoring

monolith

warc

WarcDB

full-text-tabs-forever

Related posts

Ask HN: How can I back up an old vBulletin forum without admin access?

Best practices for archiving websites

An easy solution to save entire websites for my dad?

Looking for open source software to scrape webpages but also make them searchable with a webui. (locally hosted)

Alternative to HTTrack (website copier) as of 2023?

An Introduction to the WARC File

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com warc web-archiving Archiving and Digital Preservation (DP) save-the-internet Crawling Post date: 29 Jan 2024

ArchiveBox

linkwarden

Scout Monitoring

monolith

warc

WarcDB

full-text-tabs-forever

Related posts

Ask HN: How can I back up an old vBulletin forum without admin access?

Best practices for archiving websites

An easy solution to save entire websites for my dad?

Looking for open source software to scrape webpages but also make them searchable with a webui. (locally hosted)

Alternative to HTTrack (website copier) as of 2023?

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com
warc web-archiving Archiving and Digital Preservation (DP) save-the-internet Crawling
Post date: 29 Jan 2024