Data Bricks - Digital Agency

_HµCourse Modules
Module 1: Introduction to Databricks & Modern Data Engineering
 What is Databricks?
 Lakehouse Architecture Overview
 Difference between Data Lake, Data Warehouse, and Lakehouse
 Setting up Databricks workspace
 Brief tour of the Databricks UI
 Hands-on: Launching your first Databricks Notebook
Module 2: Spark Essentialsfor Data Engineers
 Spark architecture &execution model
 DataFrames vs Datasets vs RDDs
 Working with PySpark in Databricks
 Hands-on: Reading and writing data using Spark
 Transformations and Actions
Module 3: Delta Lake and Data Lakehouse
 What is Delta Lake?
 Features: ACID Transactions, Time Travel, Schema Enforcement
 Creating & managing Delta Tables
 Data versioning and rollback
 Hands-on: Converting Parquet to Delta, using Time Travel
Module 4: Ingesting Data at Scale
 Ingesting batch data (CSV, JSON, Parquet, externalsources)
 Streaming data ingestion (Kafka, Auto Loader)
 Using Databricks Auto Loader
 Hands-on: Streaming data ingestion from cloud storage (S3/ADLS)
 Best practicesforscalable ingestion
Module 5: Building ETL Pipelines
 Writing modular ETL jobs in notebooks
 Using Databricks Workflows(Jobs API)
 Orchestrating pipelines using Task Dependencies
 Scheduling jobs and alerts
 Hands-on: Building a reusable ETL pipeline
Module 6: Data Quality and Validation
 Data validation with Deequ or Great Expectations
 Implementing quality checks in Delta
 Handling bad records and nulls
 Logging and alerting with MLflow or other tools
 Hands-on: Quality checks with Delta table constraints
Module 7: Performance Optimization & Best Practices
 Caching, partitioning, and Z-ordering
 Query optimization and cost reduction
 Choosing the right cluster type and size
 Managing job performance metrics
 Hands-on: Optimizing a slow-running query
Module 8: Real-Time Data Processing
 Spark Structured Streaming on Databricks
 Window functions and aggregations
 Streaming joins and watermarks
 Handling late data
 Hands-on: Building a real-time dashboard pipeline
Module 9: CI/CD and Production Readiness
 Version control with Git in Databricks
 CI/CD using Databricks Repos & Workflows
 Managing environments (dev, test, prod)
 Logging, monitoring & alerting strategies
 Hands-on: End-to-end pipeline deployment with Git
Module 10: Project
Project Ideas(Choose One):
 Build a real-time data pipeline
 ETL pipeline for a retail sales dataset with Delta Lake
 End-to-end streaming + batch processing pipeline
⬛ Outcomes
By the end of this course, learners will be able to:
 Build production-grade ETL pipelines in Databricks
 Use Spark and Delta Lake effectively
 Automate and schedule jobs using Workflows
 Process batch and streaming data atscale
 Implement CI/CD and deploy end-to-end data pipelines