Top 16 C++ Data Analysis Projects

cudf

23 7,333 9.9 C++

cuDF - GPU DataFrame Library

Project mention: A Polars exploration into Kedro | dev.to | 2023-05-17

The interesting thing about Polars is that it does not try to be a drop-in replacement to pandas, like Dask, cuDF, or Modin, and instead has its own expressive API. Despite being a young project, it quickly got popular thanks to its easy installation process and its “lightning fast” performance.

matplotplusplus

26 3,965 5.8 C++

Matplot++: A C++ Graphics Library for Data Visualization 📊🗾

Project mention: Creating k-NN with C++ (from Scratch) | dev.to | 2024-01-11

cmake_minimum_required(VERSION 3.5) project(knn_cpp CXX) # Set up C++ version and properties include(CheckIncludeFileCXX) check_include_file_cxx(any HAS_ANY) check_include_file_cxx(string_view HAS_STRING_VIEW) check_include_file_cxx(coroutine HAS_COROUTINE) set(CMAKE_CXX_STANDARD 20) set(CMAKE_BUILD_TYPE Debug) set(CMAKE_CXX_STANDARD_REQUIRED ON) set(CMAKE_CXX_EXTENSIONS OFF) # Copy data file to build directory file(COPY ${CMAKE_CURRENT_SOURCE_DIR}/iris.data DESTINATION ${CMAKE_CURRENT_BINARY_DIR}) # Download library usinng FetchContent include(FetchContent) FetchContent_Declare(matplotplusplus GIT_REPOSITORY https://github.com/alandefreitas/matplotplusplus GIT_TAG origin/master) FetchContent_GetProperties(matplotplusplus) if(NOT matplotplusplus_POPULATED) FetchContent_Populate(matplotplusplus) add_subdirectory(${matplotplusplus_SOURCE_DIR} ${matplotplusplus_BINARY_DIR} EXCLUDE_FROM_ALL) endif() FetchContent_Declare( fmt GIT_REPOSITORY https://github.com/fmtlib/fmt.git GIT_TAG 7.1.3 # Adjust the version as needed ) FetchContent_MakeAvailable(fmt) # Add executable and link project libraries and folders add_executable(${PROJECT_NAME} main.cc) target_link_libraries(${PROJECT_NAME} PUBLIC matplot fmt::fmt) aux_source_directory(lib LIB_SRC) target_include_directories(${PROJECT_NAME} PRIVATE ${CMAKE_CURRENT_SOURCE_DIR}) target_sources(${PROJECT_NAME} PRIVATE ${LIB_SRC}) add_subdirectory(tests)

InfluxDB

www.influxdata.com featured

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
root

31 2,425 10.0 C++

The official repository for ROOT: analyzing, storing and visualizing big data, scientifically

Project mention: If you can't reproduce the model then it's not open-source | news.ycombinator.com | 2024-01-17

I think the process of data acquisition isn't so clear-cut. Take CERN as an example: they release loads of data from various experiments under the CC0 license [1]. This isn't just a few small datasets for classroom use; we're talking big-league data, like the entire first run data from LHCb [2].
On their portal, they don't just dump the data and leave you to it. They've got guides on analysis and the necessary tools (mostly open source stuff like ROOT [3] and even VMs). This means anyone can dive in. You could potentially discover something new or build on existing experiment analyses. This setup, with open data and tools, ticks the boxes for reproducibility. But does it mean people need to recreate the data themselves?
Ideally, yeah, but realistically, while you could theoretically rebuild the LHC (since most technical details are public), it would take an army of skilled people, billions of dollars, and years to do it.
This contrasts with open source models, where you can retrain models using data to get the weights. But getting hold of the data and the cost to reproduce the weights is usually prohibitive. I get that CERN's approach might seem to counter this, but remember, they're not releasing raw data (which is mostly noise), but a more refined version. Try downloading several petabytes of raw data if not; good luck with that. But for training something like a LLM, you might need the whole dataset, which in many cases have its own problems with copyrights…etc.
[1] https://opendata.cern.ch/docs/terms-of-use
[2] https://opendata.cern.ch/docs/lhcb-releases-entire-run1-data...
[3] https://root.cern/

DataFrame

109 2,280 9.4 C++

C++ DataFrame for statistical, Financial, and ML analysis -- in modern C++ using native types and contiguous memory storage

Project mention: New multithreaded version of C++ DataFrame was released | news.ycombinator.com | 2024-02-13

datatable

9 1,790 6.1 C++

A Python package for manipulating 2-dimensional tabular data structures

Project mention: Cheat Sheets for data.table to Python's pandas syntax? | /r/Rlanguage | 2023-06-20

Aside from that, there is a Python translation of data.table (see documentation here), which might be worth looking into. However, it hasn't had any major updates in a while: the last release 2 years ago ...

TileDB

14 1,771 9.7 C++

The Universal Storage Engine

Project mention: Ask HN: Who is hiring? (May 2024) | news.ycombinator.com | 2024-05-01

TileDB, Inc. | Full-Time | REMOTE | USA, Greece/EU | [https://tiledb.com](https://tiledb.com/)
TileDB has recently announced a $34 million Series B fund-raise and is actively hiring for engineers across a range of roles (SRE, backend/distributed systems, database internals, and more). You will have the opportunity to work on innovative technology that creates impact for challenging problems in genomics, geospatial, machine learning, distributed systems, and many other areas.
TileDB Cloud is the modern database, allowing developers and scientists to capture, analyze, and share any data with any tool. We build on a broad foundation of open source, maintaining the TileDB storage engine, libraries for genomics (single-cell and population), geospatial (raster, point clouds, and more), a TileDB visualization engine extending Babylon.js, and much more ([github.com/TileDB-Inc/TileDB](http://github.com/TileDB-Inc/TileDB))
With TileDB, all data — tables, genomics, images, videos, location, time-series — is captured as multi-dimensional arrays. To supercharge this data, TileDB Cloud implements a serverless infrastructure delivering query execution, access control, data and code sharing, and distributed computing at global scale — eliminating cluster management, minimizing TCO, and promoting scientific collaboration and reproducibility.
Website: [https://tiledb.com](https://tiledb.com/) | GitHub: https://github.com/TileDB-Inc/TileDB | Blog: https://tiledb.com/blog
We are actively hiring for several roles including:
- Site Reliability Engineer (k8s, Terraform, automation, Prometheus, CloudWatch, GitOps; Golang, Python)

ArcticDB

4 1,123 9.8 C++

ArcticDB is a high performance, serverless DataFrame database built for the Python Data Science ecosystem.

Project mention: Speed Test - ArcticDB, HDF, Feather, Parquet | /r/algotrading | 2023-11-21

ArcticDB is a new data store for pandas DataFrames (https://arcticdb.io/). I have no affiliation with the project but wanted to see how it would compare on speed versus the other file format storage options available in Pandas: HDF, Feather, and Parquet. I could not find much on-line about how Arctic compares to the other options in terms of speed, so I ran some tests myself.

SaaSHub

www.saashub.com featured

SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
oneDAL

1 593 9.3 C++

oneAPI Data Analytics Library (oneDAL)
gdl

2 265 9.5 C++

GDL - GNU Data Language
AlphaPlot

1 238 3.1 C++

:chart_with_upwards_trend: Application for statistical analysis and data visualization which can generate different types of publication quality 2D and 3D plots with extensive visual customization.
volbx

1 231 3.6 C++

Graphical tool for data manipulation written in C++/Qt.
Graphia

8 227 9.7 C++

A visualisation tool for the creation and analysis of graphs

Project mention: NetworkX – Network Analysis in Python | news.ycombinator.com | 2023-12-08

Export the graph to GML or to GraphML or to GraphViz DOT or to some other Graph format. BTW I recommend 3D graph visualization over 2D when possible, that is when you're exploring interactively as opposed to printing figures. The Graphia tool is the only FOSS tool for this purpose that I know of:
https://graphia.app
https://github.com/graphia-app/graphia

nebula

9 150 7.4 C++

A distributed block-based data storage and compute engine (by varchar-io)
vinum

5 65 0.0 C++

Vinum is a SQL processor for Python, designed for data analysis workflows and in-memory analytics.
MachineLearning

6 17 6.9 C++

From linear regression towards neural networks... (by aromanro)

Project mention: Get gradient of Softmax activation | /r/learnmachinelearning | 2023-07-12

Softmax is at the end of this source file: https://github.com/aromanro/MachineLearning/blob/master/MachineLearning/MachineLearning/ActivationFunctions.h

vif

1 11 3.7 C++

Easy, robust, and fast numerics in C++. (by cschreib)
SaaSHub

www.saashub.com featured

SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

C++ Data Analysis related posts

New multithreaded version of C++ DataFrame was released

1 project | news.ycombinator.com | 13 Feb 2024
If you can't reproduce the model then it's not open-source

2 projects | news.ycombinator.com | 17 Jan 2024
What software is used to generate plots/graphs like this seen in many particle physics papers?

1 project | /r/PhysicsStudents | 10 Dec 2023
DataFrame: NEW Data - star count:2013.0

1 project | /r/algoprojects | 21 Nov 2023
DataFrame: NEW Data - star count:2013.0

1 project | /r/algoprojects | 20 Nov 2023
DataFrame: NEW Data - star count:2013.0

1 project | /r/algoprojects | 19 Nov 2023
C++ DataFrame vs. Polars

1 project | /r/cpp | 18 Nov 2023
A note from our sponsor - InfluxDB
www.influxdata.com | 15 May 2024

Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality. Learn more →

Index

What are some of the best open-source Data Analysis projects in C++? This list will help you:

	Project	Stars
1	cudf	7,333
2	matplotplusplus	3,965
3	root	2,425
4	DataFrame	2,280
5	datatable	1,790
6	TileDB	1,771
7	ArcticDB	1,123
8	oneDAL	593
9	gdl	265
10	AlphaPlot	238
11	volbx	231
12	Graphia	227
13	nebula	150
14	vinum	65
15	MachineLearning	17
16	vif	11

C++ Data Analysis

Top 16 C++ Data Analysis Projects

C++ Data Analysis related posts

New multithreaded version of C++ DataFrame was released

If you can't reproduce the model then it's not open-source

What software is used to generate plots/graphs like this seen in many particle physics papers?

DataFrame: NEW Data - star count:2013.0

DataFrame: NEW Data - star count:2013.0

DataFrame: NEW Data - star count:2013.0

C++ DataFrame vs. Polars

Index