How do you deal with parallelising parts of an ML pipeline especially on Python?

Scout Monitoring - Free Django app performance insights with Scout Monitoring

Get Scout setup in minutes, and let us sweat the small stuff. A couple lines in settings.py is all you need to start monitoring your apps. Sign up for our free tier today.

www.scoutapm.com

featured

InfluxDB - Power Real-Time Data Analytics at Scale

Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

www.influxdata.com

featured

ploomber

121 3,396 7.4 Python

The fastest ⚡️ way to build data pipelines. Develop iteratively, deploy anywhere. ☁️

Multiprocessing works well but you probably need an abstraction on top to make it work reliably. For starters, it's best to use a pool of processes because creating new ones is expensive, you also need to ensure that errors in the sub-processes are correctly displayed in the main process, otherwise, it becomes frustrating. Also, sometimes sub-processings might get stuck so you have to monitor them. I implemented something that takes care of all of that for a project I'm working on, it'll give you an idea of what it looks like (of course, you can use the framework as well, which lets you parallelize functions and notebooks).

debuglater

8 56 3.8 Python

Store Python traceback for later debugging. 🐛

Finally, debugging. If you're running code in sub-processes; debugging becomes a real pain because out of the box, you won't be able to start a debugger in the sub-processes. Furthermore, there's a chance that more than one fails. One solution is to dump the traceback when any sub-process fails, so you can start a debugging sesstion afterward; look at this project for an example.

Scout Monitoring

www.scoutapm.com featured

Free Django app performance insights with Scout Monitoring. Get Scout setup in minutes, and let us sweat the small stuff. A couple lines in settings.py is all you need to start monitoring your apps. Sign up for our free tier today.
mpire

8 1,930 7.3 Python

A Python package for easy multiprocessing, but faster than multiprocessing

https://github.com/Slimmer-AI/mpire is a nice lib, with better performance than multiprocessing.

orchest

44 4,023 4.5 TypeScript

Build data pipelines, the easy way 🛠️

We automatically provide container level parallelism in Orchest: https://github.com/orchest/orchest

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Decent low code options for orchestration and building data flows?

1 project | /r/dataengineering | 23 Dec 2022
Build ML workflows with Jupyter notebooks

1 project | /r/programming | 23 Dec 2022
Building container images in Kubernetes, how would you approach it?

2 projects | /r/kubernetes | 6 Dec 2022
Ideas for infrastructure and tooling to use for frequent model retraining?

1 project | /r/mlops | 9 Sep 2022
Looking for a mentor in MLOps. I am a lead developer.

1 project | /r/mlops | 25 Aug 2022

How do you deal with parallelising parts of an ML pipeline especially on Python?

This page summarizes the projects mentioned and recommended in the original post on /r/mlops
Data Science Machine Learning Pipelines Python Jupyter
Post date: 12 Aug 2022

ploomber

debuglater

Scout Monitoring

mpire

orchest

Related posts

Decent low code options for orchestration and building data flows?

Build ML workflows with Jupyter notebooks

Building container images in Kubernetes, how would you approach it?

Ideas for infrastructure and tooling to use for frequent model retraining?

Looking for a mentor in MLOps. I am a lead developer.

How do you deal with parallelising parts of an ML pipeline especially on Python?

This page summarizes the projects mentioned and recommended in the original post on /r/mlops Data Science Machine Learning Pipelines Python Jupyter Post date: 12 Aug 2022

ploomber

debuglater

Scout Monitoring

mpire

orchest

Related posts

Decent low code options for orchestration and building data flows?

Build ML workflows with Jupyter notebooks

Building container images in Kubernetes, how would you approach it?

Ideas for infrastructure and tooling to use for frequent model retraining?

Looking for a mentor in MLOps. I am a lead developer.

This page summarizes the projects mentioned and recommended in the original post on /r/mlops
Data Science Machine Learning Pipelines Python Jupyter
Post date: 12 Aug 2022