C++ Data Analysis

Open-source C++ projects categorized as Data Analysis

Top 16 C++ Data Analysis Projects

  • cudf

    cuDF - GPU DataFrame Library

  • Project mention: A Polars exploration into Kedro | dev.to | 2023-05-17

    The interesting thing about Polars is that it does not try to be a drop-in replacement to pandas, like Dask, cuDF, or Modin, and instead has its own expressive API. Despite being a young project, it quickly got popular thanks to its easy installation process and its “lightning fast” performance.

  • matplotplusplus

    Matplot++: A C++ Graphics Library for Data Visualization 📊🗾

  • Project mention: Creating k-NN with C++ (from Scratch) | dev.to | 2024-01-11

    cmake_minimum_required(VERSION 3.5) project(knn_cpp CXX) # Set up C++ version and properties include(CheckIncludeFileCXX) check_include_file_cxx(any HAS_ANY) check_include_file_cxx(string_view HAS_STRING_VIEW) check_include_file_cxx(coroutine HAS_COROUTINE) set(CMAKE_CXX_STANDARD 20) set(CMAKE_BUILD_TYPE Debug) set(CMAKE_CXX_STANDARD_REQUIRED ON) set(CMAKE_CXX_EXTENSIONS OFF) # Copy data file to build directory file(COPY ${CMAKE_CURRENT_SOURCE_DIR}/iris.data DESTINATION ${CMAKE_CURRENT_BINARY_DIR}) # Download library usinng FetchContent include(FetchContent) FetchContent_Declare(matplotplusplus GIT_REPOSITORY https://github.com/alandefreitas/matplotplusplus GIT_TAG origin/master) FetchContent_GetProperties(matplotplusplus) if(NOT matplotplusplus_POPULATED) FetchContent_Populate(matplotplusplus) add_subdirectory(${matplotplusplus_SOURCE_DIR} ${matplotplusplus_BINARY_DIR} EXCLUDE_FROM_ALL) endif() FetchContent_Declare( fmt GIT_REPOSITORY https://github.com/fmtlib/fmt.git GIT_TAG 7.1.3 # Adjust the version as needed ) FetchContent_MakeAvailable(fmt) # Add executable and link project libraries and folders add_executable(${PROJECT_NAME} main.cc) target_link_libraries(${PROJECT_NAME} PUBLIC matplot fmt::fmt) aux_source_directory(lib LIB_SRC) target_include_directories(${PROJECT_NAME} PRIVATE ${CMAKE_CURRENT_SOURCE_DIR}) target_sources(${PROJECT_NAME} PRIVATE ${LIB_SRC}) add_subdirectory(tests)

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • root

    The official repository for ROOT: analyzing, storing and visualizing big data, scientifically

  • Project mention: If you can't reproduce the model then it's not open-source | news.ycombinator.com | 2024-01-17

    I think the process of data acquisition isn't so clear-cut. Take CERN as an example: they release loads of data from various experiments under the CC0 license [1]. This isn't just a few small datasets for classroom use; we're talking big-league data, like the entire first run data from LHCb [2].

    On their portal, they don't just dump the data and leave you to it. They've got guides on analysis and the necessary tools (mostly open source stuff like ROOT [3] and even VMs). This means anyone can dive in. You could potentially discover something new or build on existing experiment analyses. This setup, with open data and tools, ticks the boxes for reproducibility. But does it mean people need to recreate the data themselves?

    Ideally, yeah, but realistically, while you could theoretically rebuild the LHC (since most technical details are public), it would take an army of skilled people, billions of dollars, and years to do it.

    This contrasts with open source models, where you can retrain models using data to get the weights. But getting hold of the data and the cost to reproduce the weights is usually prohibitive. I get that CERN's approach might seem to counter this, but remember, they're not releasing raw data (which is mostly noise), but a more refined version. Try downloading several petabytes of raw data if not; good luck with that. But for training something like a LLM, you might need the whole dataset, which in many cases have its own problems with copyrights…etc.

    [1] https://opendata.cern.ch/docs/terms-of-use

    [2] https://opendata.cern.ch/docs/lhcb-releases-entire-run1-data...

    [3] https://root.cern/

  • DataFrame

    C++ DataFrame for statistical, Financial, and ML analysis -- in modern C++ using native types and contiguous memory storage

  • Project mention: New multithreaded version of C++ DataFrame was released | news.ycombinator.com | 2024-02-13
  • datatable

    A Python package for manipulating 2-dimensional tabular data structures

  • Project mention: Cheat Sheets for data.table to Python's pandas syntax? | /r/Rlanguage | 2023-06-20

    Aside from that, there is a Python translation of data.table (see documentation here), which might be worth looking into. However, it hasn't had any major updates in a while: the last release 2 years ago ...

  • TileDB

    The Universal Storage Engine

  • Project mention: Ask HN: Who is hiring? (May 2024) | news.ycombinator.com | 2024-05-01

    TileDB, Inc. | Full-Time | REMOTE | USA, Greece/EU | [https://tiledb.com](https://tiledb.com/)

    TileDB has recently announced a $34 million Series B fund-raise and is actively hiring for engineers across a range of roles (SRE, backend/distributed systems, database internals, and more). You will have the opportunity to work on innovative technology that creates impact for challenging problems in genomics, geospatial, machine learning, distributed systems, and many other areas.

    TileDB Cloud is the modern database, allowing developers and scientists to capture, analyze, and share any data with any tool. We build on a broad foundation of open source, maintaining the TileDB storage engine, libraries for genomics (single-cell and population), geospatial (raster, point clouds, and more), a TileDB visualization engine extending Babylon.js, and much more ([github.com/TileDB-Inc/TileDB](http://github.com/TileDB-Inc/TileDB))

    With TileDB, all data — tables, genomics, images, videos, location, time-series — is captured as multi-dimensional arrays. To supercharge this data, TileDB Cloud implements a serverless infrastructure delivering query execution, access control, data and code sharing, and distributed computing at global scale — eliminating cluster management, minimizing TCO, and promoting scientific collaboration and reproducibility.

    Website: [https://tiledb.com](https://tiledb.com/) | GitHub: https://github.com/TileDB-Inc/TileDB | Blog: https://tiledb.com/blog

    We are actively hiring for several roles including:

    - Site Reliability Engineer (k8s, Terraform, automation, Prometheus, CloudWatch, GitOps; Golang, Python)

  • ArcticDB

    ArcticDB is a high performance, serverless DataFrame database built for the Python Data Science ecosystem.

  • Project mention: Speed Test - ArcticDB, HDF, Feather, Parquet | /r/algotrading | 2023-11-21

    ArcticDB is a new data store for pandas DataFrames (https://arcticdb.io/). I have no affiliation with the project but wanted to see how it would compare on speed versus the other file format storage options available in Pandas: HDF, Feather, and Parquet. I could not find much on-line about how Arctic compares to the other options in terms of speed, so I ran some tests myself.

  • SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
  • oneDAL

    oneAPI Data Analytics Library (oneDAL)

  • gdl

    GDL - GNU Data Language

  • AlphaPlot

    :chart_with_upwards_trend: Application for statistical analysis and data visualization which can generate different types of publication quality 2D and 3D plots with extensive visual customization.

  • volbx

    Graphical tool for data manipulation written in C++/Qt.

  • Graphia

    A visualisation tool for the creation and analysis of graphs

  • Project mention: NetworkX – Network Analysis in Python | news.ycombinator.com | 2023-12-08

    Export the graph to GML or to GraphML or to GraphViz DOT or to some other Graph format. BTW I recommend 3D graph visualization over 2D when possible, that is when you're exploring interactively as opposed to printing figures. The Graphia tool is the only FOSS tool for this purpose that I know of:

    https://graphia.app

    https://github.com/graphia-app/graphia

  • nebula

    A distributed block-based data storage and compute engine (by varchar-io)

  • vinum

    Vinum is a SQL processor for Python, designed for data analysis workflows and in-memory analytics.

  • MachineLearning

    From linear regression towards neural networks... (by aromanro)

  • Project mention: Get gradient of Softmax activation | /r/learnmachinelearning | 2023-07-12

    Softmax is at the end of this source file: https://github.com/aromanro/MachineLearning/blob/master/MachineLearning/MachineLearning/ActivationFunctions.h

  • vif

    Easy, robust, and fast numerics in C++. (by cschreib)

  • SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

C++ Data Analysis related posts

  • New multithreaded version of C++ DataFrame was released

    1 project | news.ycombinator.com | 13 Feb 2024
  • If you can't reproduce the model then it's not open-source

    2 projects | news.ycombinator.com | 17 Jan 2024
  • What software is used to generate plots/graphs like this seen in many particle physics papers?

    1 project | /r/PhysicsStudents | 10 Dec 2023
  • DataFrame: NEW Data - star count:2013.0

    1 project | /r/algoprojects | 21 Nov 2023
  • DataFrame: NEW Data - star count:2013.0

    1 project | /r/algoprojects | 20 Nov 2023
  • DataFrame: NEW Data - star count:2013.0

    1 project | /r/algoprojects | 19 Nov 2023
  • C++ DataFrame vs. Polars

    1 project | /r/cpp | 18 Nov 2023
  • A note from our sponsor - InfluxDB
    www.influxdata.com | 15 May 2024
    Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality. Learn more →

Index

What are some of the best open-source Data Analysis projects in C++? This list will help you:

Project Stars
1 cudf 7,333
2 matplotplusplus 3,965
3 root 2,425
4 DataFrame 2,280
5 datatable 1,790
6 TileDB 1,771
7 ArcticDB 1,123
8 oneDAL 593
9 gdl 265
10 AlphaPlot 238
11 volbx 231
12 Graphia 227
13 nebula 150
14 vinum 65
15 MachineLearning 17
16 vif 11

Sponsored
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com