How do you deal with parallelising parts of an ML pipeline especially on Python?

This page summarizes the projects mentioned and recommended in the original post on /r/mlops

Scout Monitoring - Free Django app performance insights with Scout Monitoring
Get Scout setup in minutes, and let us sweat the small stuff. A couple lines in settings.py is all you need to start monitoring your apps. Sign up for our free tier today.
www.scoutapm.com
featured
InfluxDB - Power Real-Time Data Analytics at Scale
Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
www.influxdata.com
featured
  • ploomber

    The fastest ⚡️ way to build data pipelines. Develop iteratively, deploy anywhere. ☁️

  • Multiprocessing works well but you probably need an abstraction on top to make it work reliably. For starters, it's best to use a pool of processes because creating new ones is expensive, you also need to ensure that errors in the sub-processes are correctly displayed in the main process, otherwise, it becomes frustrating. Also, sometimes sub-processings might get stuck so you have to monitor them. I implemented something that takes care of all of that for a project I'm working on, it'll give you an idea of what it looks like (of course, you can use the framework as well, which lets you parallelize functions and notebooks).

  • debuglater

    Store Python traceback for later debugging. 🐛

  • Finally, debugging. If you're running code in sub-processes; debugging becomes a real pain because out of the box, you won't be able to start a debugger in the sub-processes. Furthermore, there's a chance that more than one fails. One solution is to dump the traceback when any sub-process fails, so you can start a debugging sesstion afterward; look at this project for an example.

  • Scout Monitoring

    Free Django app performance insights with Scout Monitoring. Get Scout setup in minutes, and let us sweat the small stuff. A couple lines in settings.py is all you need to start monitoring your apps. Sign up for our free tier today.

    Scout Monitoring logo
  • mpire

    A Python package for easy multiprocessing, but faster than multiprocessing

  • https://github.com/Slimmer-AI/mpire is a nice lib, with better performance than multiprocessing.

  • orchest

    Build data pipelines, the easy way 🛠️

  • We automatically provide container level parallelism in Orchest: https://github.com/orchest/orchest

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts

  • Decent low code options for orchestration and building data flows?

    1 project | /r/dataengineering | 23 Dec 2022
  • Build ML workflows with Jupyter notebooks

    1 project | /r/programming | 23 Dec 2022
  • Building container images in Kubernetes, how would you approach it?

    2 projects | /r/kubernetes | 6 Dec 2022
  • Ideas for infrastructure and tooling to use for frequent model retraining?

    1 project | /r/mlops | 9 Sep 2022
  • Looking for a mentor in MLOps. I am a lead developer.

    1 project | /r/mlops | 25 Aug 2022