Course #:WA2950

Data Engineering with PySpark Training (Coming Soon)

Data Engineering with PySpark


Data Warehouse and Data Lake Specialists, Software Developers


General background in programming and/or data processing; ability to learn a new language (Python) by doing stepwise exercises


Three days

Outline of Data Engineering with PySpark Training

Chapter 1. Defining Data Engineering

  • What is Data Engineering?
  • How is it different from Data Science?

Chapter 2. The Data Engineer Role

  • The scope of the DE role
  • Data Scientists, Machine Learning Specialists, and Data Engineers

Chapter 3. Data Processing Phases

  • Data Ingestion
  • Data Cleansing

Chapter 4. Distributed Computing Concepts

  • Data Physics
  • CAP Theorem
  • Hadoop

Chapter 5. Apache Spark

  • Supported Languages
  • Distributed Data Processing with PySpark

Chapter 6. Apache Spark Dev Environments

  • Spark Shells
  • Jupyter Notebooks

Chapter 7. Introduction to Functional Programming

  • Why I need Functional Programming?
  • Functional Programming with Python

Chapter 8. Functional Programming using Spark RDD API

  • RDD Transformations and Actions
  • Data Partitioning

Chapter 9. ETL Jobs with RDD

  • Using map-reduce FP for Data Processing

Chapter 10. Spark SQL DataFrames

  • What are DataFrames?
  • Relationship with RDDs
  • Ways to Create DataFrames
  • Schema of Datasets
  • Inferring the Schema

Chapter 11. SQL-centric Programming using DataFrames API

  • Using the sql Method, and the Native DataFrame API
  • Data Aggregation

Chapter 12. ETL Jobs with DataFrames

  • Using Spark SQL DataFrame API
  • Contrasting with Spark RDD API

Chapter 13. Repairing and Normalizing Data

  • What May Be Wrong With My Data?
  • Detecting and Removing Bad Data

Chapter 14. Data Visualization with seaborn

  • EDA
  • Available Options

Chapter 15. Working with Various File Formats: CSV, Parquet, ORC, and JSON0

  • What is Columnar Data Storage Formats?
  • Comparing Various Formats
  • Ways to Read and Store Data in Various Formats
