-
xidel
Command line tool to download and extract data from HTML/XML pages or JSON-APIs, using CSS, XPath 3.0, XQuery 3.0, JSONiq or pattern matching. It can also create new or transformed XML/HTML/JSON documents.
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
-
Mage
🧙 The modern replacement for Airflow. Mage is an open-source data pipeline tool for transforming and integrating data. https://github.com/mage-ai/mage-ai
-
CoinCap-firehose-s3-DynamicPartitioning
AWS CDK project using typescript. Services: Lambda, Kinesis Firehose, Glue, Quicksight.
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
-
astro-sdk
Astro SDK allows rapid and clean development of {Extract, Load, Transform} workflows using Python and SQL, powered by Apache Airflow.
Great recommendations by the rest of the members here. I would love to learn more about your use case if possible, as we are adding a native REST, websocket and gRPC support to our message broker (Memphis. Let’s chat if possible, would love to work on this together
Xidel for extraction and pagination
GNU Parallel for parallelism, retry, and resumption
JQ for JSON processing
I have a small project like this i done before. Which i am gonna shamelessly plug in lol. https://github.com/PanzerFlow/aws_lambda_reddit_api
AWS: deploy using maintained Terraform scripts.
I agree with the Cron triggered Lambda approach. For inspiration I have a small project where a lambda pulls data from a public api and writes it to a firehose which buffers the data and writes it to s3. There is also a cron job on Glue which catalogues the data. https://github.com/TrygviZL/CoinCap-firehose-s3-DynamicPartitioning
I have an example here using COVID data. basically you just write a python function that reads the API and returns a dataframe (or any number of dataframes) and downstream tasks can then read the output as either a dataframe or a SQL table.
RudderStack is an open-source tool to build data pipelines with high-availability and high-precision event ordering. It is suitable for your use case as