Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality. Learn more →
Top 23 Python Data Analysis Projects
-
Pandas
Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
-
airbyte
The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
-
ydata-profiling
1 Line of code data quality profiling & exploratory data analysis for Pandas and Spark DataFrames.
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
-
pandas-ai
Chat with your database (SQL, CSV, pandas, polars, mongodb, noSQL, etc). PandasAI makes data analysis conversational using LLMs (GPT 3.5 / 4, Anthropic, VertexAI) and RAG.
-
cleanlab
The standard data-centric AI package for data quality and machine learning with messy, real-world data and labels.
-
knowledge-repo
A next-generation curated knowledge sharing platform for data scientists and other technical professions.
-
Resume-Matcher
Resume Matcher is an open source, free tool to improve your resume. It works by using language models to compare and rank resumes with job descriptions.
-
AWS Data Wrangler
pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, Neptune, OpenSearch, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).
-
igel
a delightful machine learning tool that allows you to train, test, and use models without writing code
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
Project mention: How to Build a Logistic Regression Model: A Spam-filter Tutorial | dev.to | 2024-05-05Online Courses: Coursera: "Machine Learning" by Andrew Ng edX: "Introduction to Machine Learning" by MIT Tutorials: Scikit-learn documentation: https://scikit-learn.org/ Kaggle Learn: https://www.kaggle.com/learn Books: "Hands-On Machine Learning with Scikit-Learn, Keras & TensorFlow" by Aurélien Géron "The Elements of Statistical Learning" by Trevor Hastie, Robert Tibshirani, and Jerome Friedman By understanding the core concepts of logistic regression, its limitations, and exploring further resources, you'll be well-equipped to navigate the exciting world of machine learning!
It's also possible for you to give a package an alias by using the as keyword. For instance, you could use the pandas package as pd like this:
Project mention: Developing a Generic Streamlit UI to Test Amazon Bedrock Agents | dev.to | 2024-05-05I decided to use Streamlit to build the UI as it is a popular and fitting choice. Streamlit is an open-source Python library used for building interactive web applications specially for AI and data applications. Since the application code is written only in Python, it is easy to learn and build with.
gradio is a package developed to ease the development of app interfaces in python and other languages (GitHub)
Project mention: How to Build a Chat App with Your Postgres Data using Agent Cloud | dev.to | 2024-05-13AgentCloud uses Airbyte to build data pipelines, which allow us to split, chunk, and embed data from over 300 data sources, including Postgres.
Project mention: PandasAI is great but is there a more general library? | news.ycombinator.com | 2023-08-23
Project mention: Show HN: Use an "eraser" to clean data on flight without breaking your workflow | news.ycombinator.com | 2024-03-15
Project mention: [Research] Detecting Annotation Errors in Semantic Segmentation Data | /r/MachineLearning | 2023-11-05We have feely open-sourced our new method for improving segmentation data, published a paper on the research behind it, and released a 5-min code tutorial. You can also read more in the blog if you'd like.
Project mention: A Comprehensive Guide for Building Rag-Based LLM Applications | news.ycombinator.com | 2023-09-13This is a feature in many commercial products already, as well as open source libraries like PyOD. https://github.com/yzhao062/pyod
There's a pletora of undersampling and oversampling models you can try out. To avoid removing information form the dataset, you can focus on oversampling techniques. You can try imbalanced-learn or smote-variants. Given enough data, using fully synthetic data is also an option, you can check ydata-synthetic for it. Let us know how it turned out!
GitHub: https://github.com/srbhr/Resume-Matcher Website: https://www.resumematcher.fyi/ Discord: Resume Matcher's Discord Tech Stack: Python, NextJS, FastAPI, TypeScript
Project mention: Read files from s3 using Pandas/s3fs or AWS Data Wrangler? | /r/dataengineering | 2023-12-06I had no problem with awswrangler (https://github.com/aws/aws-sdk-pandas) and it supports reading and writing partitions which was really helpful and a few other optimizations that made it a great tool
Project mention: Ask HN: Comment here about whatever you're passionate about at the moment | news.ycombinator.com | 2023-11-06A resource recently shared in HN for running tech lovers https://github.com/yihong0618/running_page
Python Data Analysis related posts
-
The Birth of Parquet
-
The ultimate guide to creating a secure Python package
-
How to Build a Logistic Regression Model: A Spam-filter Tutorial
-
PDEP-13: The Pandas Logical Type System
-
Cold-(Brew) Outreach: Landing my first big client at a coffee shop
-
Pandas reset_index(): How To Reset Indexes in Pandas
-
The Design Philosophy of Great Tables (Software Package)
-
A note from our sponsor - InfluxDB
www.influxdata.com | 15 May 2024
Index
What are some of the best open-source Data Analysis projects in Python? This list will help you:
Project | Stars | |
---|---|---|
1 | scikit-learn | 58,265 |
2 | Pandas | 42,104 |
3 | streamlit | 32,051 |
4 | gradio | 29,400 |
5 | best-of-ml-python | 15,633 |
6 | airbyte | 14,217 |
7 | ydata-profiling | 12,085 |
8 | pandas-ai | 11,140 |
9 | pygwalker | 9,930 |
10 | statsmodels | 9,591 |
11 | mlcourse.ai | 9,454 |
12 | cleanlab | 8,719 |
13 | akshare | 8,445 |
14 | pyod | 7,994 |
15 | imbalanced-learn | 6,714 |
16 | knowledge-repo | 5,441 |
17 | Resume-Matcher | 4,546 |
18 | plotnine | 3,835 |
19 | AWS Data Wrangler | 3,811 |
20 | missingno | 3,771 |
21 | running_page | 3,301 |
22 | igel | 3,080 |
23 | sweetviz | 2,840 |
Sponsored