- What is Data Engineering?
- Why is it needed?
- Two types of data engineers (A & B)
- Set up Environment
python -m venv ./env
.\env\Scripts\Activate.ps1
python.exe -m pip install --upgrade pip
python -m pip install -r .\requirements.txt
- Turn on MySQL (3306), ClickHouse (8123), Airbyte (8000) (start generating data)
- docker compose -f .\00_infrastructure\DockerCompose.yaml up -d
- go into data generation folder
cd .\01_data_generation
- run the python script to seed the data
python -m data_gen_and_seed
(then wait for 5 min) - docker compose -f .\00_infrastructure\airbyte\DockerCompose_airbyte.yaml up -d
- Show Relational Diagram
- Show how the data is generated
- Show data in source (connect to mysql using dbeaver)
- Create connection from Mysql to ClickHouse through Airbyte
- login to airbyte localhost:800 username: airbyte password: password
- create database on clickhouse:
CREATE DATABASE mysql_extracts;
- make sure to land data in new schema called
mysql_extracts
in clickhouse
- create database on clickhouse:
- Talk About Enable binary logging on mysql
- describe what airbyte is
- describe what ClickHouse is
- login to airbyte localhost:800 username: airbyte password: password
- create the dbt repo shown in 02_transformation
dbt --version
dbt init dbt_visits
- create profiles.yaml file
- create sources.yml file
- add group to dbt_project.yml file
- dbt debug
- create base and intermediate folders
- dbt models
- create models
- show documentation
- (optional)
- patient attributes model (v1 & v2)
- visits joined (v1 & v2)
- dbt tests
- what are they
- quick demo
- Create Dagster environment
- cd into dbt_visits
dagster-dbt project scaffold --project-name my_dagster
- move my_dagster folder up one level
mv .\my_dagster\ ../
- Connect dbt to Dagster
- change
dbt_project_dir = Path(__file__).joinpath("..", "..", "..").resolve()
TOdbt_project_dir = Path('..','dbt_visits').resolve()
- turn on dagster
- linux
DAGSTER_DBT_PARSE_PROJECT_ON_LOAD=1 dagster dev
- windows
$env:DAGSTER_DBT_PARSE_PROJECT_ON_LOAD = "1"; dagster dev
- linux
- now lets add more models to dbt and watch dagster pull them in
- change
- Connect Airbyte to Dagster
- Create an airbyte resource (https://docs.dagster.io/concepts/resources & https://docs.dagster.io/integrations/airbyte#using-airbyte-with-dagster)
- Set up report delivery schedules
- (future task) Set up Alerting
- Working with people you have no control over to communicate the vision and convince to give you the permissions needed to execute
- Working with Data Science to figure out what data is important to them
- How to Secure your connections and data
- How to deploy to production
- Swap out ClickHouse for DuckDB
- Improve dates in fake data
- Set up notifications of failure
Data Pipeline Implementation at Big Mountain Data and Dev Conference