Completed my first Data Engineering project with Kafka, Spark, GCP, Airflow, dbt, Terraform, Docker and more!

This page summarizes the projects mentioned and recommended in the original post on /r/dataengineering

Scout Monitoring - Free Django app performance insights with Scout Monitoring
Get Scout setup in minutes, and let us sweat the small stuff. A couple lines in settings.py is all you need to start monitoring your apps. Sign up for our free tier today.
www.scoutapm.com
featured
InfluxDB - Power Real-Time Data Analytics at Scale
Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
www.influxdata.com
featured
  • streamify

    A data engineering project with Kafka, Spark Streaming, dbt, Docker, Airflow, Terraform, GCP and much more!

  • I've documented on Git itself. It's slightly more focused on the setup part. But you can still get an idea on the data flow.

  • eventsim

    Event data simulator. Generates a stream of pseudo-random events from a set of users, designed to simulate web traffic.

  • Eventsim is a program that generates event data to replicate page requests for a fake music web site. The results look like real use data, but are totally fake. The docker image is borrowed from viirya's fork of it, as the original project has gone without maintenance for a few years now.

  • Scout Monitoring

    Free Django app performance insights with Scout Monitoring. Get Scout setup in minutes, and let us sweat the small stuff. A couple lines in settings.py is all you need to start monitoring your apps. Sign up for our free tier today.

    Scout Monitoring logo
  • eventsim

    Event data simulator. Generates a stream of pseudo-random events from a set of users, designed to simulate web traffic. (by viirya)

  • Eventsim is a program that generates event data to replicate page requests for a fake music web site. The results look like real use data, but are totally fake. The docker image is borrowed from viirya's fork of it, as the original project has gone without maintenance for a few years now.

  • terraform

    Terraform enables you to safely and predictably create, change, and improve infrastructure. It is a source-available tool that codifies APIs into declarative configuration files that can be shared amongst team members, treated as code, edited, reviewed, and versioned.

  • Infrastructure as Code software - Terraform

  • CPython

    The Python programming language

  • Language - Python

  • ApacheKafka

    A curated re-sources list for awesome Apache Kafka

  • Stream Processing - Kafka, Spark Streaming

  • Docker Compose

    Define and run multi-container applications with Docker

  • Containerization - Docker, Docker Compose

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • nodejs-bigquery

    Node.js client for Google Cloud BigQuery: A fast, economical and fully-managed enterprise data warehouse for large-scale data analytics.

  • Data Warehouse - BigQuery

  • Airflow

    Apache Airflow - A platform to programmatically author, schedule, and monitor workflows

  • Orchestration - Airflow

  • spark-bigquery-connector

    BigQuery data source for Apache Spark: Read data from BigQuery into DataFrames, write DataFrames into BigQuery tables.

  • corp

    Assets related to the operation of Fishtown Analytics.

  • Just a slight critique, but I noticed some of the dbt models are a bit hard to read. Especially your dim_users SCD2 model, which uses lots of nested subqueries and multiple columns on the same line. You may want to refer to this style guide from dbt Labs. I find CTEs are a lot easier to parse and read.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts

  • Data Analytics at Potloc I: Making data integrity your priority with Elementary & Meltano

    4 projects | dev.to | 5 Jan 2023
  • Optimizing Costs in DevOps: Migrating a Kubernetes App from Amazon to Digital Ocean

    3 projects | dev.to | 22 May 2024
  • AWS Cloud Platform for highly loaded WordPress website

    3 projects | dev.to | 29 Apr 2024
  • Build and Deploy a ReactJS App to AWS EC2 with Docker, NGINX, and Automate with GitHub Actions.

    5 projects | dev.to | 5 Jan 2024
  • How to deploy a Django app to Google Cloud Run using Terraform

    5 projects | dev.to | 1 Jan 2024