An Introduction to the WARC File

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Scout Monitoring - Free Django app performance insights with Scout Monitoring
Get Scout setup in minutes, and let us sweat the small stuff. A couple lines in settings.py is all you need to start monitoring your apps. Sign up for our free tier today.
www.scoutapm.com
featured
InfluxDB - Power Real-Time Data Analytics at Scale
Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
www.influxdata.com
featured
  • ArchiveBox

    🗃 Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...

  • API is coming soon (relatively, it's still a one-man project)! Stay tuned https://github.com/ArchiveBox/ArchiveBox/issues/496

    I have an event-sourcing refactor in progress now to allow us to pluginize functionality like the API (similar to Home Assistant with a plugin app sotre), it will take a month or two. Next up is the REST API using the new plugin system.

  • linkwarden

    ⚡️⚡️⚡️Self-hosted collaborative bookmark manager to collect, organize, and preserve webpages and articles.

  • Scout Monitoring

    Free Django app performance insights with Scout Monitoring. Get Scout setup in minutes, and let us sweat the small stuff. A couple lines in settings.py is all you need to start monitoring your apps. Sign up for our free tier today.

    Scout Monitoring logo
  • monolith

    ⬛️ CLI tool for saving complete web pages as a single HTML file

  • I have never used monolith to say with any certainty, but two things in your description are worth highlighting between the goals of WARC versus the umpteen bazillion "save this one page I'm looking at as a single file" type projects:

    1. WARC is designed, as a goal, to archive the request-response handshake. It does not get into the business of trying to make it easy for a browser to subsequently display that content, since that's a browser's problem

    2. Using your cited project specifically, observe the number of "well, save it but ..." options <https://github.com/Y2Z/monolith#options> which is in stark contrast to the archiving goals I just spoke about. It's not a good snapshot of history if the server responded with `content-type: text/html;charset=iso-8859-1` back in the 90s but "modern tools" want everything to be UTF-8 so we'll just convert it, shall we? Bah, I don't like JavaScript, so we'll just toss that out, shall we? And so on

    For 100% clarity: monolith, and similar, may work fantastic for any individual's workflow, and I'm not here to yuck anyone's yum; but I do want to highlight that all things being equal it should always be possible to derive monolith files from warc files because the warc files are (or at least have the goal of) perfect fidelity of what the exchange was. I would guess only pcap files would be of higher fidelity, but also a lot more extraneous or potentially privacy violating details

  • warc

    Python library for reading and writing warc files (by internetarchive)

  • I wrote a library to handle these and the older ARC files to use with an archiving proxy that a friend and I built for the Internet Archive. https://github.com/internetarchive/warc. He wrote the WARC parts while I did the ARC one using this https://archive.org/web/researcher/ArcFileFormat.php

    Good memories.

  • WarcDB

    WarcDB: Web crawl data as SQLite databases.

  • full-text-tabs-forever

    Full text search all your browsing history

  • A bit of a late response, but yes I've been storing full text of every website I visit and it's excellent for finding stuff again.

    The idea is to index pages as you visit them using a browser extension, thus avoiding all the pitfalls of being treated like a bot.

    Here's the project: https://github.com/iansinnott/full-text-tabs-forever

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts

  • Ask HN: How can I back up an old vBulletin forum without admin access?

    9 projects | news.ycombinator.com | 29 Jan 2024
  • Best practices for archiving websites

    2 projects | /r/datacurator | 6 Dec 2023
  • An easy solution to save entire websites for my dad?

    7 projects | /r/DataHoarder | 17 Oct 2021
  • Looking for open source software to scrape webpages but also make them searchable with a webui. (locally hosted)

    4 projects | /r/DataHoarder | 16 Jan 2021
  • Alternative to HTTrack (website copier) as of 2023?

    4 projects | /r/DataHoarder | 10 Feb 2023