Ask HN: Best way to keep the raw HTML of scraped pages?

Scout Monitoring - Free Django app performance insights with Scout Monitoring

Get Scout setup in minutes, and let us sweat the small stuff. A couple lines in settings.py is all you need to start monitoring your apps. Sign up for our free tier today.

www.scoutapm.com

featured

InfluxDB - Power Real-Time Data Analytics at Scale

Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

www.influxdata.com

featured

warc-proxy

1 60 10.0 Python

Serving content from a WARC

I thought that mitmproxy did this, but cursory searches didn't show anything; that said, their actual format[1] has even more fidelity (I'd guess it's comparable to wireshark)
One should be aware that WARC is great for preservation, but getting content back out of it would require specialized tooling ala: https://github.com/alard/warc-proxy
1: https://github.com/mitmproxy/mitmproxy/blob/9.0.1/mitmproxy/...

mitmproxy

153 34,737 9.4 Python

An interactive TLS-capable intercepting HTTP proxy for penetration testers and software developers.

I thought that mitmproxy did this, but cursory searches didn't show anything; that said, their actual format[1] has even more fidelity (I'd guess it's comparable to wireshark)
One should be aware that WARC is great for preservation, but getting content back out of it would require specialized tooling ala: https://github.com/alard/warc-proxy
1: https://github.com/mitmproxy/mitmproxy/blob/9.0.1/mitmproxy/...

Scout Monitoring

www.scoutapm.com featured

Free Django app performance insights with Scout Monitoring. Get Scout setup in minutes, and let us sweat the small stuff. A couple lines in settings.py is all you need to start monitoring your apps. Sign up for our free tier today.
Scrapy

182 51,270 9.6 Python

Scrapy, a fast high-level web crawling & scraping framework for Python.

If you weren't already aware, Scrapy has strong support for this via their HTTPCache middleware; you can choose whether to have it actually behave like a cache, choosing to returned already scraped content if matched or merely to act as a pass-through cache: https://docs.scrapy.org/en/2.7/topics/downloader-middleware....
Their OOtB storage does what the sibling comment says about sha1-ing the request and then sharding the output filename by the first two characters: https://github.com/scrapy/scrapy/blob/2.7.1/scrapy/extension...

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Help with random values in query string

2 projects | /r/scrapy | 1 Dec 2022
Apple's M4 Has Reportedly Adopted the ARMv9 Architecture

3 projects | news.ycombinator.com | 24 May 2024
Ask HN: Fiddler Alternatives

1 project | news.ycombinator.com | 14 Mar 2024
Seven Python Projects to Elevate Your Coding Skills

3 projects | dev.to | 15 Feb 2024
Best Hacking Tools for Beginners 2024

5 projects | dev.to | 1 Feb 2024

Ask HN: Best way to keep the raw HTML of scraped pages?

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com
Python HacktoberFest Web Crawling Troubleshooting Security
Post date: 11 Nov 2022

warc-proxy

mitmproxy

Scout Monitoring

Scrapy

Related posts

Help with random values in query string

Apple's M4 Has Reportedly Adopted the ARMv9 Architecture

Ask HN: Fiddler Alternatives

Seven Python Projects to Elevate Your Coding Skills

Best Hacking Tools for Beginners 2024

Ask HN: Best way to keep the raw HTML of scraped pages?

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com Python HacktoberFest Web Crawling Troubleshooting Security Post date: 11 Nov 2022

warc-proxy

mitmproxy

Scout Monitoring

Scrapy

Related posts

Help with random values in query string

Apple's M4 Has Reportedly Adopted the ARMv9 Architecture

Ask HN: Fiddler Alternatives

Seven Python Projects to Elevate Your Coding Skills

Best Hacking Tools for Beginners 2024

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com
Python HacktoberFest Web Crawling Troubleshooting Security
Post date: 11 Nov 2022