Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality. Learn more →
Top 16 C++ Data Analysis Projects
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
-
DataFrame
C++ DataFrame for statistical, Financial, and ML analysis -- in modern C++ using native types and contiguous memory storage
-
ArcticDB
ArcticDB is a high performance, serverless DataFrame database built for the Python Data Science ecosystem.
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
-
AlphaPlot
:chart_with_upwards_trend: Application for statistical analysis and data visualization which can generate different types of publication quality 2D and 3D plots with extensive visual customization.
-
vinum
Vinum is a SQL processor for Python, designed for data analysis workflows and in-memory analytics.
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
The interesting thing about Polars is that it does not try to be a drop-in replacement to pandas, like Dask, cuDF, or Modin, and instead has its own expressive API. Despite being a young project, it quickly got popular thanks to its easy installation process and its “lightning fast” performance.
cmake_minimum_required(VERSION 3.5) project(knn_cpp CXX) # Set up C++ version and properties include(CheckIncludeFileCXX) check_include_file_cxx(any HAS_ANY) check_include_file_cxx(string_view HAS_STRING_VIEW) check_include_file_cxx(coroutine HAS_COROUTINE) set(CMAKE_CXX_STANDARD 20) set(CMAKE_BUILD_TYPE Debug) set(CMAKE_CXX_STANDARD_REQUIRED ON) set(CMAKE_CXX_EXTENSIONS OFF) # Copy data file to build directory file(COPY ${CMAKE_CURRENT_SOURCE_DIR}/iris.data DESTINATION ${CMAKE_CURRENT_BINARY_DIR}) # Download library usinng FetchContent include(FetchContent) FetchContent_Declare(matplotplusplus GIT_REPOSITORY https://github.com/alandefreitas/matplotplusplus GIT_TAG origin/master) FetchContent_GetProperties(matplotplusplus) if(NOT matplotplusplus_POPULATED) FetchContent_Populate(matplotplusplus) add_subdirectory(${matplotplusplus_SOURCE_DIR} ${matplotplusplus_BINARY_DIR} EXCLUDE_FROM_ALL) endif() FetchContent_Declare( fmt GIT_REPOSITORY https://github.com/fmtlib/fmt.git GIT_TAG 7.1.3 # Adjust the version as needed ) FetchContent_MakeAvailable(fmt) # Add executable and link project libraries and folders add_executable(${PROJECT_NAME} main.cc) target_link_libraries(${PROJECT_NAME} PUBLIC matplot fmt::fmt) aux_source_directory(lib LIB_SRC) target_include_directories(${PROJECT_NAME} PRIVATE ${CMAKE_CURRENT_SOURCE_DIR}) target_sources(${PROJECT_NAME} PRIVATE ${LIB_SRC}) add_subdirectory(tests)
Project mention: If you can't reproduce the model then it's not open-source | news.ycombinator.com | 2024-01-17I think the process of data acquisition isn't so clear-cut. Take CERN as an example: they release loads of data from various experiments under the CC0 license [1]. This isn't just a few small datasets for classroom use; we're talking big-league data, like the entire first run data from LHCb [2].
On their portal, they don't just dump the data and leave you to it. They've got guides on analysis and the necessary tools (mostly open source stuff like ROOT [3] and even VMs). This means anyone can dive in. You could potentially discover something new or build on existing experiment analyses. This setup, with open data and tools, ticks the boxes for reproducibility. But does it mean people need to recreate the data themselves?
Ideally, yeah, but realistically, while you could theoretically rebuild the LHC (since most technical details are public), it would take an army of skilled people, billions of dollars, and years to do it.
This contrasts with open source models, where you can retrain models using data to get the weights. But getting hold of the data and the cost to reproduce the weights is usually prohibitive. I get that CERN's approach might seem to counter this, but remember, they're not releasing raw data (which is mostly noise), but a more refined version. Try downloading several petabytes of raw data if not; good luck with that. But for training something like a LLM, you might need the whole dataset, which in many cases have its own problems with copyrights…etc.
[1] https://opendata.cern.ch/docs/terms-of-use
[2] https://opendata.cern.ch/docs/lhcb-releases-entire-run1-data...
[3] https://root.cern/
Project mention: New multithreaded version of C++ DataFrame was released | news.ycombinator.com | 2024-02-13
Aside from that, there is a Python translation of data.table (see documentation here), which might be worth looking into. However, it hasn't had any major updates in a while: the last release 2 years ago ...
TileDB, Inc. | Full-Time | REMOTE | USA, Greece/EU | [https://tiledb.com](https://tiledb.com/)
TileDB has recently announced a $34 million Series B fund-raise and is actively hiring for engineers across a range of roles (SRE, backend/distributed systems, database internals, and more). You will have the opportunity to work on innovative technology that creates impact for challenging problems in genomics, geospatial, machine learning, distributed systems, and many other areas.
TileDB Cloud is the modern database, allowing developers and scientists to capture, analyze, and share any data with any tool. We build on a broad foundation of open source, maintaining the TileDB storage engine, libraries for genomics (single-cell and population), geospatial (raster, point clouds, and more), a TileDB visualization engine extending Babylon.js, and much more ([github.com/TileDB-Inc/TileDB](http://github.com/TileDB-Inc/TileDB))
With TileDB, all data — tables, genomics, images, videos, location, time-series — is captured as multi-dimensional arrays. To supercharge this data, TileDB Cloud implements a serverless infrastructure delivering query execution, access control, data and code sharing, and distributed computing at global scale — eliminating cluster management, minimizing TCO, and promoting scientific collaboration and reproducibility.
Website: [https://tiledb.com](https://tiledb.com/) | GitHub: https://github.com/TileDB-Inc/TileDB | Blog: https://tiledb.com/blog
We are actively hiring for several roles including:
- Site Reliability Engineer (k8s, Terraform, automation, Prometheus, CloudWatch, GitOps; Golang, Python)
ArcticDB is a new data store for pandas DataFrames (https://arcticdb.io/). I have no affiliation with the project but wanted to see how it would compare on speed versus the other file format storage options available in Pandas: HDF, Feather, and Parquet. I could not find much on-line about how Arctic compares to the other options in terms of speed, so I ran some tests myself.
Export the graph to GML or to GraphML or to GraphViz DOT or to some other Graph format. BTW I recommend 3D graph visualization over 2D when possible, that is when you're exploring interactively as opposed to printing figures. The Graphia tool is the only FOSS tool for this purpose that I know of:
https://graphia.app
https://github.com/graphia-app/graphia
Softmax is at the end of this source file: https://github.com/aromanro/MachineLearning/blob/master/MachineLearning/MachineLearning/ActivationFunctions.h
C++ Data Analysis related posts
-
New multithreaded version of C++ DataFrame was released
-
If you can't reproduce the model then it's not open-source
-
What software is used to generate plots/graphs like this seen in many particle physics papers?
-
DataFrame: NEW Data - star count:2013.0
-
DataFrame: NEW Data - star count:2013.0
-
DataFrame: NEW Data - star count:2013.0
-
C++ DataFrame vs. Polars
-
A note from our sponsor - InfluxDB
www.influxdata.com | 15 May 2024
Index
What are some of the best open-source Data Analysis projects in C++? This list will help you:
Project | Stars | |
---|---|---|
1 | cudf | 7,333 |
2 | matplotplusplus | 3,965 |
3 | root | 2,425 |
4 | DataFrame | 2,280 |
5 | datatable | 1,790 |
6 | TileDB | 1,771 |
7 | ArcticDB | 1,123 |
8 | oneDAL | 593 |
9 | gdl | 265 |
10 | AlphaPlot | 238 |
11 | volbx | 231 |
12 | Graphia | 227 |
13 | nebula | 150 |
14 | vinum | 65 |
15 | MachineLearning | 17 |
16 | vif | 11 |
Sponsored