qstudio
arco-era5
qstudio | arco-era5 | |
---|---|---|
1 | 5 | |
56 | 179 | |
- | 6.7% | |
7.4 | 5.9 | |
about 1 month ago | 19 days ago | |
Java | Python | |
Apache License 2.0 | Apache License 2.0 |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
qstudio
-
Loading a trillion rows of weather data into TimescaleDB
What's the process for adding support for other databases to your tool qStudio?
I'm thinking perhaps you could add support for Timeplus [1]? Timeplus is a streaming-first database built on ClickHouse. The core DB engine Timeplus Proton is open source [2].
It seems that qStudio is open source [3] and written in Java and will need a JDBC driver to add support for a new RDBMS? If yes, Timeplus Proton has an open source JDBC driver [4] based on ClickHouse's driver but with modifications added for streaming use cases.
1: https://www.timeplus.com/
2: https://github.com/timeplus-io/proton
3: https://github.com/timeseries/qstudio
4: https://github.com/timeplus-io/proton-java-driver
arco-era5
-
Loading a trillion rows of weather data into TimescaleDB
Why?
Most weather and climate datasets - including ERA5 - are highly structured on regular latitude-longitude grids. Even if you were solely doing timeseries analyses for specific locations plucked from this grid, the strength of this sort of dataset is its intrinsic spatiotemporal structure and context, and it makes very little sense to completely destroy the dataset's structure unless you were solely and exclusively to extract point timeseries. And even then, you'd probably want to decimate the data pretty dramatically, since there is very little use case for, say, a point timeseries of surface temperature in the middle of the ocean!
The vast majority of research and operational applications of datasets like ERA5 are probably better suited by leveraging cloud-optimized replicas of the original dataset, such as ARCO-ERA5 published on the Google Public Datasets program [1]. These versions of the dataset preserve the original structure, and chunk it in ways that are amenable to massively parallel access via cloud storage. In almost any case I've encountered in my career, a generically chunked Zarr-based archive of a dataset like this will be more than performant enough for the majority of use cases that one might care about.
[1]: https://cloud.google.com/storage/docs/public-datasets/era5
-
GraphCast: AI model for faster and more accurate global weather forecasting
You can get some of the historical data also from here: https://cloud.google.com/storage/docs/public-datasets/era5 (if the official API is too slow. )
To use the data in live fashion I think you would need to get license from ECMWF...
-
Open-source could finally get the world’s microscopes speaking the same language
This article misses one of the coolest things about the Zarr format - that it's flexible enough that it's also becoming widely used in climate science.
In particular the Pangeo project (https://pangeo.io/architecture.html) uses large Zarr stores as a performant format in the cloud which we can analyse in parallel at scale using distributed computing frameworks like dask.
More and more climate science data is being made publicly available as Zarr in the cloud, often through open data partnerships with cloud providers (e.g. on AWS (https://aws.amazon.com/blogs/publicsector/decrease-geospatia...) ERA-5 on GCP(https://cloud.google.com/storage/docs/public-datasets/era5)).
I personally think that the more that common tooling can be shared between scientific disciplines the better.
- Analysis-Ready, Cloud Optimized ERA5
What are some alternatives?
proton-java-driver - JDBC driver for Timeplus Proton
bioformats2raw - Bio-Formats image file format to raw format converter
zarr-python - An implementation of chunked, compressed, N-dimensional arrays for Python.
era5_in_gee - Functions and Python scripts to ingest ERA5 data into Google Earth Engine
ome-zarr-py - Implementation of next-generation file format (NGFF) specifications for storing bioimaging data in the cloud.
czifile - Read Carl Zeiss(r) Image (CZI) files
ai-models