Top 23 Python Data Analysis Projects

scikit-learn

82 58,265 9.9 Python

scikit-learn: machine learning in Python

Project mention: How to Build a Logistic Regression Model: A Spam-filter Tutorial | dev.to | 2024-05-05

Online Courses: Coursera: "Machine Learning" by Andrew Ng edX: "Introduction to Machine Learning" by MIT Tutorials: Scikit-learn documentation: https://scikit-learn.org/ Kaggle Learn: https://www.kaggle.com/learn Books: "Hands-On Machine Learning with Scikit-Learn, Keras & TensorFlow" by Aurélien Géron "The Elements of Statistical Learning" by Trevor Hastie, Robert Tibshirani, and Jerome Friedman By understanding the core concepts of logistic regression, its limitations, and exploring further resources, you'll be well-equipped to navigate the exciting world of machine learning!

Pandas

399 42,104 10.0 Python

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more

Project mention: The ultimate guide to creating a secure Python package | dev.to | 2024-05-08

It's also possible for you to give a package an alias by using the as keyword. For instance, you could use the pandas package as pd like this:

InfluxDB

www.influxdata.com featured

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
streamlit

258 32,051 9.8 Python

Streamlit — A faster way to build and share data apps.

Project mention: Developing a Generic Streamlit UI to Test Amazon Bedrock Agents | dev.to | 2024-05-05

I decided to use Streamlit to build the UI as it is a popular and fitting choice. Streamlit is an open-source Python library used for building interactive web applications specially for AI and data applications. Since the application code is written only in Python, it is easy to learn and build with.

gradio

116 29,400 9.9 Python

Build and share delightful machine learning apps, all in Python. 🌟 Star to support our work!

Project mention: AI enthusiasm #9 - A multilingual chatbot📣🈸 | dev.to | 2024-05-01

gradio is a package developed to ease the development of app interfaces in python and other languages (GitHub)

best-of-ml-python

16 15,633 7.8 Python

🏆 A ranked list of awesome machine learning Python libraries. Updated weekly.
airbyte

140 14,217 10.0 Python

The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.

Project mention: How to Build a Chat App with Your Postgres Data using Agent Cloud | dev.to | 2024-05-13

AgentCloud uses Airbyte to build data pipelines, which allow us to split, chunk, and embed data from over 300 data sources, including Postgres.

ydata-profiling

43 12,085 8.5 Python

1 Line of code data quality profiling & exploratory data analysis for Pandas and Spark DataFrames.

Project mention: FLaNK 25 December 2023 | dev.to | 2023-12-26

SaaSHub

www.saashub.com featured

SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
pandas-ai

14 11,140 9.8 Python

Chat with your database (SQL, CSV, pandas, polars, mongodb, noSQL, etc). PandasAI makes data analysis conversational using LLMs (GPT 3.5 / 4, Anthropic, VertexAI) and RAG.

Project mention: PandasAI is great but is there a more general library? | news.ycombinator.com | 2023-08-23

pygwalker

22 9,930 9.6 Python

PyGWalker: Turn your pandas dataframe into an interactive UI for visual analysis

Project mention: Show HN: Use an "eraser" to clean data on flight without breaking your workflow | news.ycombinator.com | 2024-03-15

statsmodels

8 9,591 9.4 Python

Statsmodels: statistical modeling and econometrics in Python
mlcourse.ai

85 9,454 3.4 Python

Open Machine Learning Course

Project mention: Open Machine Learning Course | news.ycombinator.com | 2023-10-22

cleanlab

69 8,719 9.4 Python

The standard data-centric AI package for data quality and machine learning with messy, real-world data and labels.

Project mention: [Research] Detecting Annotation Errors in Semantic Segmentation Data | /r/MachineLearning | 2023-11-05

We have feely open-sourced our new method for improving segmentation data, published a paper on the research behind it, and released a 5-min code tutorial. You can also read more in the blog if you'd like.

akshare

0 8,445 9.8 Python

AKShare is an elegant and simple financial data interface library for Python, built for human beings! 开源财经数据接口库 (by akfamily)
pyod

7 7,994 7.5 Python

A Comprehensive and Scalable Python Library for Outlier Detection (Anomaly Detection)

Project mention: A Comprehensive Guide for Building Rag-Based LLM Applications | news.ycombinator.com | 2023-09-13

This is a feature in many commercial products already, as well as open source libraries like PyOD. https://github.com/yzhao062/pyod

imbalanced-learn

1 6,714 7.5 Python

A Python Package to Tackle the Curse of Imbalanced Datasets in Machine Learning

Project mention: What’s your approach to highly imbalanced data sets? | /r/datascience | 2023-05-26

There's a pletora of undersampling and oversampling models you can try out. To avoid removing information form the dataset, you can focus on oversampling techniques. You can try imbalanced-learn or smote-variants. Given enough data, using fully synthetic data is also an option, you can check ydata-synthetic for it. Let us know how it turned out!

knowledge-repo

2 5,441 4.1 Python

A next-generation curated knowledge sharing platform for data scientists and other technical professions.
Resume-Matcher

8 4,546 8.7 Python

Resume Matcher is an open source, free tool to improve your resume. It works by using language models to compare and rank resumes with job descriptions.

Project mention: Hacktoberfest 2023: The Complete Guide | dev.to | 2023-09-22

GitHub: https://github.com/srbhr/Resume-Matcher Website: https://www.resumematcher.fyi/ Discord: Resume Matcher's Discord Tech Stack: Python, NextJS, FastAPI, TypeScript

plotnine

36 3,835 9.6 Python

A Grammar of Graphics for Python

Project mention: FLaNK AI Weekly 18 March 2024 | dev.to | 2024-03-18

AWS Data Wrangler

9 3,811 9.4 Python

pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, Neptune, OpenSearch, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).

Project mention: Read files from s3 using Pandas/s3fs or AWS Data Wrangler? | /r/dataengineering | 2023-12-06

I had no problem with awswrangler (https://github.com/aws/aws-sdk-pandas) and it supports reading and writing partitions which was really helpful and a few other optimizations that made it a great tool

missingno

5 3,771 1.9 Python

Missing data visualization module for Python.
running_page

3 3,301 9.0 Python

Make your own running home page

Project mention: Ask HN: Comment here about whatever you're passionate about at the moment | news.ycombinator.com | 2023-11-06

A resource recently shared in HN for running tech lovers https://github.com/yihong0618/running_page

igel

11 3,080 1.1 Python

a delightful machine learning tool that allows you to train, test, and use models without writing code
sweetviz

1 2,840 6.7 Python

Visualize and compare datasets, target values and associations, with one line of code.
SaaSHub

www.saashub.com featured

SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

Python Data Analysis related posts

The Birth of Parquet

3 projects | news.ycombinator.com | 8 May 2024
The ultimate guide to creating a secure Python package

4 projects | dev.to | 8 May 2024
How to Build a Logistic Regression Model: A Spam-filter Tutorial

1 project | dev.to | 5 May 2024
PDEP-13: The Pandas Logical Type System

1 project | news.ycombinator.com | 4 May 2024
Cold-(Brew) Outreach: Landing my first big client at a coffee shop

1 project | news.ycombinator.com | 30 Apr 2024
Pandas reset_index(): How To Reset Indexes in Pandas

1 project | dev.to | 27 Apr 2024
The Design Philosophy of Great Tables (Software Package)

7 projects | news.ycombinator.com | 4 Apr 2024
A note from our sponsor - InfluxDB
www.influxdata.com | 15 May 2024

Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality. Learn more →

Index

What are some of the best open-source Data Analysis projects in Python? This list will help you:

	Project	Stars
1	scikit-learn	58,265
2	Pandas	42,104
3	streamlit	32,051
4	gradio	29,400
5	best-of-ml-python	15,633
6	airbyte	14,217
7	ydata-profiling	12,085
8	pandas-ai	11,140
9	pygwalker	9,930
10	statsmodels	9,591
11	mlcourse.ai	9,454
12	cleanlab	8,719
13	akshare	8,445
14	pyod	7,994
15	imbalanced-learn	6,714
16	knowledge-repo	5,441
17	Resume-Matcher	4,546
18	plotnine	3,835
19	AWS Data Wrangler	3,811
20	missingno	3,771
21	running_page	3,301
22	igel	3,080
23	sweetviz	2,840

Python Data Analysis

Top 23 Python Data Analysis Projects

Python Data Analysis related posts

The Birth of Parquet

The ultimate guide to creating a secure Python package

How to Build a Logistic Regression Model: A Spam-filter Tutorial

PDEP-13: The Pandas Logical Type System

Cold-(Brew) Outreach: Landing my first big client at a coffee shop

Pandas reset_index(): How To Reset Indexes in Pandas

The Design Philosophy of Great Tables (Software Package)

Index