Skip to content

marckeelingiv/pipeline

Repository files navigation

pipeline

Introduction

  • What is Data Engineering?
  • Why is it needed?
  • Two types of data engineers (A & B)

Chapter 1 (Data Generation and Transfer)

  1. Set up Environment
    • python -m venv ./env
    • .\env\Scripts\Activate.ps1
    • python.exe -m pip install --upgrade pip
    • python -m pip install -r .\requirements.txt
  2. Turn on MySQL (3306), ClickHouse (8123), Airbyte (8000) (start generating data)
    • docker compose -f .\00_infrastructure\DockerCompose.yaml up -d
    • go into data generation folder cd .\01_data_generation
    • run the python script to seed the data python -m data_gen_and_seed (then wait for 5 min)
    • docker compose -f .\00_infrastructure\airbyte\DockerCompose_airbyte.yaml up -d
  3. Show Relational Diagram
  4. Show how the data is generated
  5. Show data in source (connect to mysql using dbeaver)
  6. Create connection from Mysql to ClickHouse through Airbyte
    1. login to airbyte localhost:800 username: airbyte password: password
      • create database on clickhouse: CREATE DATABASE mysql_extracts;
      • make sure to land data in new schema called mysql_extracts in clickhouse
    2. Talk About Enable binary logging on mysql
    3. describe what airbyte is
    4. describe what ClickHouse is

Chapter 2 (Organizing the Data)

  1. create the dbt repo shown in 02_transformation
    • dbt --version
    • dbt init dbt_visits
    • create profiles.yaml file
    • create sources.yml file
    • add group to dbt_project.yml file
    • dbt debug
    • create base and intermediate folders
  2. dbt models
    • create models
    • show documentation
    • (optional)
    • patient attributes model (v1 & v2)
    • visits joined (v1 & v2)
  3. dbt tests
    • what are they
    • quick demo

Chapter 3 (Orchestration)

  1. Create Dagster environment
    • cd into dbt_visits
    • dagster-dbt project scaffold --project-name my_dagster
    • move my_dagster folder up one level mv .\my_dagster\ ../
  2. Connect dbt to Dagster
    • change dbt_project_dir = Path(__file__).joinpath("..", "..", "..").resolve() TO dbt_project_dir = Path('..','dbt_visits').resolve()
    • turn on dagster
      • linux DAGSTER_DBT_PARSE_PROJECT_ON_LOAD=1 dagster dev
      • windows $env:DAGSTER_DBT_PARSE_PROJECT_ON_LOAD = "1"; dagster dev
    • now lets add more models to dbt and watch dagster pull them in
  3. Connect Airbyte to Dagster
  4. Set up report delivery schedules
  5. (future task) Set up Alerting

Things not shown here

  • Working with people you have no control over to communicate the vision and convince to give you the permissions needed to execute
  • Working with Data Science to figure out what data is important to them
  • How to Secure your connections and data
  • How to deploy to production

Possible Changes/Improvements

  • Swap out ClickHouse for DuckDB
  • Improve dates in fake data
  • Set up notifications of failure

Link To Video Demo

Data Pipeline Implementation at Big Mountain Data and Dev Conference

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages