Python Data Mining

Open-source Python projects categorized as Data Mining

Top 23 Python Data Mining Projects

  • ML-From-Scratch

    Machine Learning From Scratch. Bare bones NumPy implementations of machine learning models and algorithms with a focus on accessibility. Aims to cover everything from linear regression to deep learning.

  • EasyOCR

    Ready-to-use OCR with 80+ supported languages and all popular writing scripts including Latin, Chinese, Arabic, Devanagari, Cyrillic and etc.

  • Project mention: I built an online PDF management platform using open-source software | news.ycombinator.com | 2024-05-12

    Ok on cleaned aligned data, but there are a few newer ones like EasyOCR [0] that can deal with much less organized text (albeit more slowly)

    [0] https://github.com/JaidedAI/EasyOCR

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • gensim

    Topic Modelling for Humans

  • Project mention: Aggregating news from different sources | /r/learnprogramming | 2023-07-08
  • pyod

    A Comprehensive and Scalable Python Library for Outlier Detection (Anomaly Detection)

  • Project mention: A Comprehensive Guide for Building Rag-Based LLM Applications | news.ycombinator.com | 2023-09-13

    This is a feature in many commercial products already, as well as open source libraries like PyOD. https://github.com/yzhao062/pyod

  • anomaly-detection-resources

    Anomaly detection related books, papers, videos, and toolboxes

  • Project mention: anomaly-detection-resources: NEW Extended Research - star count:7507.0 | /r/algoprojects | 2023-10-24
  • catboost

    A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks for Python, R, Java, C++. Supports computation on CPU and GPU.

  • Project mention: CatBoost: Open-source gradient boosting library | news.ycombinator.com | 2024-03-05
  • sktime

    A unified framework for machine learning with time series

  • SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
  • orange

    🍊 :bar_chart: :bulb: Orange: Interactive data analysis

  • Project mention: Hierarchical Clustering | news.ycombinator.com | 2024-04-20

    I know I've tooted its horn before, but Orange3 is a pretty neat Python-based GUI platform that makes this and a metric buttload of other statistical/ML techniques available to non-programmer types.

    Just watch out for null character `x00` in the corpus. That always seems to kill it stone dead.

    https://orangedatamining.com/

    https://orange3.readthedocs.io/projects/orange-visual-progra...

  • pdftabextract

    A set of tools for extracting tables from PDF files helping to do data mining on (OCR-processed) scanned documents.

  • invoice2data

    Extract structured data from PDF invoices

  • awesome-fraud-detection-papers

    A curated list of data mining papers about fraud detection.

  • pycm

    Multi-class confusion matrix library in Python

  • Project mention: PyCM 4.0 Released: Multilabel Confusion Matrix Support | /r/coolgithubprojects | 2023-06-07
  • CleverCSV

    CleverCSV is a Python package for handling messy CSV files. It provides a drop-in replacement for the builtin CSV module with improved dialect detection, and comes with a handy command line application for working with CSV files.

  • deep_gcns_torch

    Pytorch Repo for DeepGCNs (ICCV'2019 Oral, TPAMI'2021), DeeperGCN (arXiv'2020) and GNN1000(ICML'2021): https://www.deepgcns.org

  • nfstream

    NFStream: a Flexible Network Data Analysis Framework.

  • aeon

    A toolkit for machine learning from time series

  • Project mention: FLaNK 15 Jan 2024 | dev.to | 2024-01-15
  • ADBench

    Official Implement of "ADBench: Anomaly Detection Benchmark", NeurIPS 2022.

  • UnityPy

    UnityPy is python module that makes it possible to extract/unpack and edit Unity assets

  • PyPOTS

    A Python toolkit/library for reality-centric machine/deep learning and data mining on partially-observed time series, including SOTA neural network models for scientific analysis tasks of imputation, classification, clustering, forecasting, & anomaly detection on incomplete industrial (irregularly-sampled) multivariate TS with NaN missing values

  • Project mention: Missing values in time series collected from the real world are common to see and very pesky. A new state-of-the-art and fast neural network called SAITS is proposed to impute missing data in partially-observed multivariate time series. The code is open source on GitHub. | /r/datascience | 2023-06-28

    Oh, wow, thanks for sharing it here! PyPOTS still has a long way to go, and I'm making it better. If you have any suggestions for PyPOTS, please let me know. Your feedback is always welcome and means a lot to the community of PyPOTS! If you like PyPOTS, please star 🌟 PyPOTS repo on GitHub and share it with people you know who may need it to help others notice this helpful work. Thank you very much!

  • pm4py-core

    Public repository for the PM4Py (Process Mining for Python) project.

  • ail-framework

    AIL framework - Analysis Information Leak framework

  • Project mention: Ask HN: Show me your half baked project | news.ycombinator.com | 2023-10-12

    First time coming across this, looks very cool! Definitely some ideas there that I'd like to implement for osintbuddy. Another project I'm going to be taking some ideas from is: https://github.com/ail-project/ail-framework - a modular framework to analyse potential information leaks

  • matrixprofile

    A Python 3 library making time series data mining tasks, utilizing matrix profile algorithms, accessible to everyone.

  • grimoirelab-perceval

    Send Sir Perceval on a quest to retrieve and gather data from software repositories.

  • SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

Python Data Mining related posts

Index

What are some of the best open-source Data Mining projects in Python? This list will help you:

Project Stars
1 ML-From-Scratch 23,287
2 EasyOCR 22,312
3 gensim 15,311
4 pyod 8,013
5 anomaly-detection-resources 7,937
6 catboost 7,795
7 sktime 7,465
8 orange 4,648
9 pdftabextract 2,152
10 invoice2data 1,708
11 awesome-fraud-detection-papers 1,559
12 pycm 1,432
13 CleverCSV 1,225
14 deep_gcns_torch 1,118
15 nfstream 1,048
16 aeon 821
17 ADBench 789
18 UnityPy 733
19 PyPOTS 728
20 pm4py-core 664
21 ail-framework 511
22 matrixprofile 356
23 grimoirelab-perceval 285

Sponsored
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com