Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality. Learn more →
Top 23 Data Open-Source Projects
-
TanStack Query
🤖 Powerful asynchronous state management, server-state utilities and data fetching for the web. TS/JS, React Query, Solid Query, Svelte Query and Vue Query.
-
Metabase
The simplest, fastest way to get business intelligence and analytics to everyone in your company :yum:
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
-
SheetJS js-xlsx
📗 SheetJS Spreadsheet Data Toolkit -- New home https://git.sheetjs.com/SheetJS/sheetjs
-
WorkOS
The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
-
airbyte
The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
-
prql
PRQL is a modern language for transforming data — a simple, powerful, pipelined SQL replacement
-
Bogus
:card_index: A simple fake data generator for C#, F#, and VB.NET. Based on and ported from the famed faker.js.
-
machine-learning-roadmap
A roadmap connecting many of the most important concepts in machine learning, how to learn them and what tools to use to perform them.
-
Mage
🧙 The modern replacement for Airflow. Mage is an open-source data pipeline tool for transforming and integrating data. https://github.com/mage-ai/mage-ai
-
Snowplow
The enterprise-grade behavioral data engine (web, mobile, server-side, webhooks), running cloud-natively on AWS and GCP
-
kestra
Infinitely scalable, event-driven, language-agnostic orchestration and scheduling platform to manage millions of workflows declaratively in code.
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
Remote Code Execution via H2
ExcelJS and XLSX (SheetJS) are great libraries to work with XLSX files. The former I've found a bit easier to work with but less efficient in general.
Project mention: LlamaIndex: A data framework for your LLM applications | news.ycombinator.com | 2024-04-07
Link: https://swr.vercel.app/
Project mention: [USMNT] It only took 20 caps for Jesus Ferreira to get double-digit goals. The fastest in #USMNT history. | /r/MLS | 2023-06-29You of course already know this answer, but just to put it into more perspective. Here are the SPI ranking equivalents to what he did with these 11 goals in Scotland and Switzerland.
We have some of this functionality in Presto (https://github.com/prestodb/presto), but it takes fair bit of work to implement it for all the different backends.
Project mention: Prefect: A workflow orchestration tool for data pipelines | news.ycombinator.com | 2024-03-13
Project mention: Launch HN: Bracket (YC W22) – Two-Way Sync Between Salesforce and Postgres | news.ycombinator.com | 2023-12-12I'l also give a shout-out to Airbyte (https://airbyte.com/), with which I've had some limited success with integrating Salesforce to a local database. The particular pull for Airbyte is that we can self-host the open source version, rather than pay Fivetran a significant sum to do this for us.
It's an immature tool, so I don't yet know that I can claim we've spent _less_ than Fivetran on the additional engineering and ops time, but it feels like it has potential to do so once stabilized.
Instead of manually having to think of defaults for your interface properties, you could use Faker.
Project mention: Prolog language for PostgreSQL proof of concept | news.ycombinator.com | 2024-03-30
One production example I know: GitHub code navigation is written in Haskell https://github.com/github/semantic
Bogus NuGet package is fake data generator which can be helpful for populating tables in a database and testing purposes. If a database is not used and Bogus populates list of data each time an application runs, the data is random, never the same. Also, the random data generated by Bogus may not meet a developer’s requirements.
**[Mrdbourke/machine-learning-roadmap on GitHub](https://github.com/mrdbourke/machine-learning-roadmap)**: This GitHub repository is more focused on machine learning. It's a good choice if you're looking for a more community-driven approach, as GitHub repositories often encourage contributions and updates from various experts.
Project mention: A mage on the Hero’s Journey: a fantasy epic on how a startup rose from the ashes | dev.to | 2023-06-12In the coming years, Mage will create a cooperative experience so that developers can build data pipelines with their team and level up together. After that journey, Mage will go on an epic quest to create the 1st open world community experience in the data universe.
Kestra's communication is asynchronous and based on a queuing mechanism. It leverages the Micronaut framework and offers two runners: one that uses a database (JDBC) for both the message queue and resource storage, and another that uses Kafka as the message queue and Elasticsearch as the resource storage. The platform is fully extensible and plugin-based, providing a rich set of plugins for various workflow tasks, triggers, and data storage options. For those interested, the GitHub repository is available here: https://github.com/kestra-io/kestra
I'm part of the team that build LlamaParse. It's net improvement compare to other PDF->Structured Text extractors (I build several in the past, includig https://github.com/axa-group/Parsr).
For character extraction, LlamaParse use a mixture of OCR / character extraction from the PDF (it's the only parser I'm aware of that address some of the buggy PDF font issues, check the 'text' mode to see raw document before reconstruction), use a mixture of heuristic and Machine learning models to reconstruct the document.
Once plug with a Recursive retrieval strategy, allow you to get Sota result on question answering over complexe text (see notebook: https://github.com/run-llama/llama_parse/blob/main/examples/...).
AMA
Project mention: We might want to regularly keep track of how important each server is | news.ycombinator.com | 2024-02-06Check out CloudQuery - https://github.com/cloudquery/cloudquery for an easy cloud asset inventory.
Data related posts
- Ask HN: Best way to mirror a Postgres database to parquet?
- LlamaIndex: A data framework for your LLM applications
- Build Responsive, Animated Charts in Minutes with EazyChart for React and Vue
- How to use fly.io and Tigris to deploy a Next.js app
- LlamaIndex is a data framework for your LLM applications
- Malloy: A language for describing data relationships and transformations
- Making a nice API of Amtrak's ugly API
-
A note from our sponsor - InfluxDB
www.influxdata.com | 19 Apr 2024
Index
What are some of the best open-source Data projects? This list will help you:
Project | Stars | |
---|---|---|
1 | TanStack Query | 39,548 |
2 | Metabase | 36,417 |
3 | SheetJS js-xlsx | 34,449 |
4 | llama_index | 30,639 |
5 | SWR | 29,331 |
6 | data | 16,617 |
7 | Presto | 15,582 |
8 | Prefect | 14,512 |
9 | airbyte | 13,821 |
10 | awesome-bigdata | 12,773 |
11 | faker | 11,683 |
12 | chinese-xinhua | 10,627 |
13 | prql | 9,414 |
14 | semantic-source | 8,858 |
15 | akshare | 8,321 |
16 | Bogus | 8,208 |
17 | machine-learning-roadmap | 7,164 |
18 | Mage | 6,953 |
19 | Snowplow | 6,731 |
20 | kestra | 6,260 |
21 | Tabulator | 6,167 |
22 | Parsr | 5,640 |
23 | cloudquery | 5,565 |