Cached Chrome Top Million Websites

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

InfluxDB - Power Real-Time Data Analytics at Scale
Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
www.influxdata.com
featured
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com
featured
  • ClickHouse

    ClickHouse® is a real-time analytics DBMS

  • If you are interested in the research on technologies used on the Internet, I recommend playing with the "Minicrawl" dataset.

    It contains data about ~7 million top websites, and for every website, it also contains: - the full content of the main page; - the verbose output of curl, containing various timing info; the HTTP headers, protocol info...

    Using this dataset, you can build a service similar to https://builtwith.com/ for your research.

    Data: https://clickhouse-public-datasets.s3.amazonaws.com/minicraw... (129 GB compressed, ~1 TB uncompressed).

    Description: https://github.com/ClickHouse/ClickHouse/issues/18842

    You can easily try it with clickhouse-local without downloading:

      $ curl https://clickhouse.com/ | sh

  • crux-top-lists

    Downloadable snapshots of the Chrome Top Million Websites pulled from public CrUX data in Google BigQuery.

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • It's a tough thing to balance, but generally, bringing in someone's personal details as ammunition in an internet argument is not ok on HN (https://hn.algolia.com/?dateRange=all&page=0&prefix=false&so...). I'm not saying those are never relevant, but

    [editing...]

  • github-explorer

    Everything You Always Wanted To Know About GitHub (But Were Afraid To Ask)

  • Yes, it's continuously updated.

    The source code is here: https://github.com/ClickHouse/github-explorer

    This shell scripts updates it: https://github.com/ClickHouse/github-explorer/blob/main/upda...

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts

  • We Built a 19 PiB Logging Platform with ClickHouse and Saved Millions

    1 project | news.ycombinator.com | 2 Apr 2024
  • 1 billion rows challenge in PostgreSQL and ClickHouse

    1 project | dev.to | 18 Jan 2024
  • We Executed a Critical Supply Chain Attack on PyTorch

    6 projects | news.ycombinator.com | 14 Jan 2024
  • Tell HN: Hacker News dataset on BigQuery hasn't been updated since Nov 2022

    1 project | news.ycombinator.com | 27 Dec 2023
  • Real-Time Data Enrichment and Analytics With RisingWave and ClickHouse

    1 project | dev.to | 25 Dec 2023