Role of a Data Engineer
Data Engineering vs Data Science vs Analytics
OLTP vs OLAP Systems
Batch vs Streaming Processing
Data Lifecycle & Modern Data Architectures
Lambda, Kappa & Medallion Architecture
Introduction to Cloud Data Platforms
Python Refresher for Data Engineers
Data Types & Control Structures
Functions, Modules & Packages
File Handling (CSV, JSON, Parquet, Avro)
Error Handling & Logging
Python Virtual Environments
Working with APIs
Project: Data Processing with Python
SQL Fundamentals
Joins & Subqueries
Window Functions
CTEs
Indexes & Query Optimization
Handling Large Datasets
Data Quality Checks
Project: Analytical Queries on Large Data
Dimensional Modeling (Kimball)
Star & Snowflake Schema
Fact & Dimension Tables
Slowly Changing Dimensions (SCD 1, 2, 3)
Data Marts
Partitioning & Clustering
Performance Optimization Techniques
Spark Architecture
RDDs vs DataFrames vs Datasets
Spark Transformations & Actions
PySpark DataFrame API
Spark SQL
Joins & Window Functions
Performance Tuning
Handling Skew & Partitioning
Project: Large-Scale Data Processing
Azure Data Engineering Architecture
Azure Data Lake Gen2 (ADLS)
Azure Data Factory (ADF)
Pipelines & Activities
Linked Services & Datasets
Triggers & Scheduling
Error Handling & Monitoring
Azure Databricks
Workspace & Clusters
PySpark in Databricks
Delta Lake
Notebooks & Jobs
Project: End-to-End Azure Data Pipeline
AWS Data Architecture Overview
Amazon S3 (Data Lake)
AWS Glue (ETL with PySpark)
AWS Glue Data Catalog
Amazon Athena
Amazon Redshift (Data Warehouse)
AWS Lambda for Data Processing
Project: AWS-Based Data Pipeline
Data Lake vs Lakehouse
Delta Lake Architecture
ACID Transactions
Schema Enforcement & Evolution
Time Travel
Upserts & Merges
Performance Optimization (Z-Ordering)
Project: Delta Lake Implementation
Data Pipeline Orchestration
Apache Airflow Concepts
DAG Design & Scheduling
Error Handling & Alerts
Azure Data Factory vs Airflow
AWS Step Functions
Project: End-to-End Orchestrated Pipeline
Data Quality Dimensions
Validation & Reconciliation
Schema Validation
Data Lineage
Metadata Management
IAM & Access Control
PII Handling & Compliance
Monitoring & Auditing
Version Control for Data Pipelines
CI/CD Concepts
Infrastructure as Code (Terraform / CloudFormation Basics)
Environment Promotion
Monitoring & Logging
Cost Optimization Strategies
Batch Data Pipeline (AWS & Azure)
Streaming Pipeline (Kafka / Kinesis)
Delta Lake + Databricks Project
Data Warehouse Modeling Project
Real-World Industry Use Case (Healthcare / Finance / Retail)
Business Problem Understanding
Architecture Design
Data Ingestion (Batch + Streaming)
Transformation & Storage
Analytics & Reporting Layer
Monitoring & Optimization
Production-Ready Implementation
Data Engineer Resume & GitHub Portfolio
SQL & Python Interview Questions
Spark & Databricks Interview Prep
AWS & Azure Data Engineering Interview Prep
System Design Interviews
Mock Interviews
Industry Best Practices