Polars

InfluxDB - Power Real-Time Data Analytics at Scale

Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

www.influxdata.com

featured

SaaSHub - Software Alternatives and Reviews

SaaSHub helps you find the best software and product alternatives

www.saashub.com

featured

polars

144 26,514 10.0 Rust

Dataframes powered by a multithreaded, vectorized query engine, written in Rust

- handling of categoricals in polars seemed a little underbaked, though my main complaint, that categories cannot be pre-defined, seems to have been recently addressed: https://github.com/pola-rs/polars/issues/10705

prql

106 9,459 9.9 Rust

PRQL is a modern language for transforming data — a simple, powerful, pipelined SQL replacement

I am very curious to know how you feel about PRQL (prql-lang.org) ? IMHO it gives you the ergonomics and DX of Polars or Pandas with the power and universality of SQL because you can still execute your queries on any SQL compatible query execution engine of your choice, including Polars and Pandas but also DuckDB, ClickHouse, BigQuery, Redshift, Postgres, Trino/Presto, SQLite, ... to name just a few popular ones.
The join syntax and semantics is one of the trickiest parts and is under discussion again recently. It's actually one of the key parts of any data transformation platform and is foundational to Relational Algebra, being right there in the "Relational" part and also the R in PRQL. Most of the PRQL built-in primitive transforms are just simple list manipulations like map, filter or reduce but joins require care to preserve monadic composition (see for example the design of SelectMany in LINQ or flatmap in the List Monad). See this comment for some of my thoughts on this: https://github.com/PRQL/prql/issues/3782#issuecomment-181131...

InfluxDB

www.influxdata.com featured

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
db-benchmark

91 320 0.0 R

reproducible benchmark of database-like ops

Real-world performance is complicated since data science covers a lot of use cases.
If you're just reading a small CSV to do analysis on it, then there will be no human-perceptible difference between Polars and Pandas. If you're reading a larger CSV with 100k rows, there still won't be much of a perceptible difference.
Per this (old) benchmark, there are differences once you get into 500MB+ territory: https://h2oai.github.io/db-benchmark/

explorer

20 986 9.4 Elixir

Series (one-dimensional) and dataframes (two-dimensional) for fast and elegant data exploration in Elixir

The Explorer library [0] in Elixir uses Polars underneath it.
[0] https://github.com/elixir-explorer/explorer

db-benchmark

11 124 8.0 R

reproducible benchmark of database-like ops (by duckdblabs)

DuckDB maintains a benchmark of open source database-like tools, including Polars and Pandas
https://duckdblabs.github.io/db-benchmark/

scikit-learn

82 58,265 9.9 Python

scikit-learn: machine learning in Python

sklearn is adding support through the dataframe interchange protocol (https://github.com/scikit-learn/scikit-learn/issues/25896). scipy, as far as I know, doesn't explicitly support dataframes (it just happens to work when you wrap a Series in `np.array` or `np.asarray`). I don't know about PyTorch but in general you can convert to numpy.

quivr

2 21 9.1 Python

Python library for working with Arrow data in tabular form (by B612-Asteroid-Institute)

Polars is cool, but man, I really have come to think that dataframes are disastrous for software. The mess of internal state and confusion of writing functions that take “df” and manipulate it - its all so hard to clean up once you’re deep in the mess.
Quivr (https://github.com/spenczar/quivr) is an alternative approach that has been working for me. Maybe types are good!

SaaSHub

www.saashub.com featured

SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
datafusion-ballista

12 1,302 8.2 Rust

Apache Arrow Ballista Distributed Query Engine

Not super on topic because this is all immature and not integrated with one another yet, but there is a scaled-out rust data-frames-on-arrow implementation called ballista that could maybe? form the backend of a polars scale out approach: https://github.com/apache/arrow-ballista

r-polars

5 397 9.8 R

Bring polars to R

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Pure Python Distributed SQL Engine

9 projects | news.ycombinator.com | 30 Dec 2022
How moving from Pandas to Polars made me write better code without writing better code

2 projects | dev.to | 5 Mar 2024
I used multiprocessing and multithreading at the same time to drop the execution time of my code from 155+ seconds to just over 2+ seconds

1 project | /r/Python | 29 May 2023
Data Engineering with Rust

5 projects | /r/rust | 9 May 2023
Any job processing framework like Spark but in Rust?

4 projects | /r/dataengineering | 23 Mar 2023

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com
Python Rust SQL Arrow Dataframe
Post date: 8 Jan 2024

polars

prql

InfluxDB

db-benchmark

explorer

db-benchmark

scikit-learn

quivr

SaaSHub

datafusion-ballista

r-polars

Related posts

Pure Python Distributed SQL Engine

How moving from Pandas to Polars made me write better code without writing better code

I used multiprocessing and multithreading at the same time to drop the execution time of my code from 155+ seconds to just over 2+ seconds

Data Engineering with Rust

Any job processing framework like Spark but in Rust?