Why take Data Engineering Bootcamp Training with Web Age?

Course #:WA3020

To succeed as a data engineer in today’s fast-paced workplace, you must have a strong understanding of programming, automation and scripting, data stores, data processing techniques, and more. In this comprehensive 5-day Data Engineering bootcamp, you will gain foundational data engineering skills such as disseminating and repairing data with Python and Spark SQL while crafting visualizations for outcomes and learning how to build production-ready distributed data infrastructure.

With a combination of theoretical learning modules and hands-on labs, you will be able to gain essential skills you can put into action in the workplace right away. With a growing demand for data engineers in all industries, this highly rated bootcamp is the best place to enhance your skill sets.

Sign up for Data Engineering Bootcamp Training today.

Web Age is a global leader in Data Engineer skills training.

Enroll now in our expert-led Data Engineering Bootcamp training to build production-ready data infrastructure, and learn essential data engineering skills.

WATCH NOW! Data Engineering with Python and PySpark

data engineer training

Data Engineering & Data Analytics Upskilling Trends in 2022
Complimentary White Paper
Everything you need to know in 10 minutes!

data engineering skills


data engineer skills for 2022

The Top 10 Essential Data Engineer Skills for 2022!

if you want to know what data engineer skills are required and make sense for a data engineer in 2022, then you’re in the right place.

What are those skills?

04/17/2023 - 04/21/2023
10:00 AM - 06:00 PM
Online Virtual Class
USD $3,144.75
Enroll
05/22/2023 - 05/26/2023
10:00 AM - 06:00 PM
Online Virtual Class
USD $3,140.00
Enroll
06/26/2023 - 06/30/2023
10:00 AM - 06:00 PM
Online Virtual Class
USD $3,140.00
Enroll

Delivery Focus for this Data Engineering Bootcamp

Data Engineering & Data Manipulation will be heavily weighted; with some understanding of Data Science.

Topics

  • Big Data Concepts and Systems Overview for Data Engineers
  • Defining Data Engineering
  • Data Processing Phases
  • Python 3 Introduction
  • Python Variables and Types
  • Control Statements and Data Collections
  • Functions and Modules
  • Working with File I/O and Useful Modules
  • Practical Introduction to NumPy
  • Practical Introduction to pandas
  • Data Grouping and Aggregation with pandas
  • Repairing and Normalizing Data
  • Data Visualization in Python
  • Python as a Cloud Scripting Language
  • Introduction to Apache Spark
  • The Spark Shell
  • Spark RDDs
  • Parallel Data Processing with Spark
  • Introduction to Spark SQL

Objectives

  • Data Availability and Consistency
  • A/B Testing Data Engineering Tasks Project
  • Learning the Databricks Community Cloud Lab Environment
  • Python Variables
  • Dates and Times
  • The if, for, and try Constructs
  • Dictionaries
  • Sets, Tuples
  • Functions, Functional Programming
  • Understanding NumPy and pandas
  • PySpark

Audience

This Data Engineer Bootcamp training  is targeted to Data Engineers

Prerequisites

Some working experience in any programming language;  the students will be introduced to programming in Python.  Basic understanding of SQL and data processing concepts, including data grouping and aggregation.

Duration

Five days.

Lab Setup Guide

Outline for Data Engineering Bootcamp Training using Python and PySpark Training

Chapter 1 - Big Data Concepts and Systems Overview for Data Engineers

  • Gartner's Definition of Big Data
  • The Big Data Confluence Diagram
  • A Practical Definition of Big Data
  • Challenges Posed by Big Data
  • The Traditional Client–Server Processing Pattern
  • Enter Distributed Computing
  • Data Physics
  • Data Locality (Distributed Computing Economics)
  • The CAP Theorem
  • Mechanisms to Guarantee a Single CAP Property
  • Eventual Consistency
  • NoSQL Systems CAP Triangle
  • Big Data Sharding
  • Sharding Example
  • Apache Hadoop
  • Hadoop Ecosystem Projects
  • Other Hadoop Ecosystem Projects
  • Hadoop Design Principles
  • Hadoop's Main Components
  • Hadoop Simple Definition
  • Hadoop Component Diagram
  • HDFS
  • Storing Raw Data in HDFS and Schema-on-Demand
  • MapReduce Defined
  • MapReduce Shared-Nothing Architecture
  • MapReduce Phases
  • The Map Phase
  • The Reduce Phase
  • Similarity with SQL Aggregation Operations
  • Summary

Chapter 2 - Defining Data Engineering

  • Data is King
  • Translating Data into Operational and Business Insights
  • What is Data Engineering
  • The Data-Related Roles
  • The Data Science Skill Sets
  • The Data Engineer Role
  • Core Skills and Competencies
  • An Example of a Data Product
  • What is Data Wrangling (Munging)?
  • The Data Exchange Interoperability Options
  • Summary

Chapter 3 - Data Processing Phases

  • Typical Data Processing Pipeline
  • Data Discovery Phase
  • Data Harvesting Phase
  • Data Priming Phase
  • Exploratory Data Analysis
  • Model Planning Phase
  • Model Building Phase
  • Communicating the Results
  • Production Roll-out
  • Data Logistics and Data Governance
  • Data Processing Workflow Engines
  • Apache Airflow
  • Data Lineage and Provenance
  • Apache NiFi
  • Summary

Chapter 4 - Python 3 Introduction

  • What is Python?
  • Python Documentation
  • Where Can I Use Python?
  • Which version of Python am I running?
  • Running Python Programs
  • Python Shell
  • Dev Tools and REPLs
  • IPython
  • Jupyter
  • Hands-On Exercise
  • The Anaconda Python Distribution
  • Summary

Chapter 5 - Python Variables and Types

  • Variables and Types
  • More on Variables
  • Assigning Multiple Values to Multiple Variables
  • More on Types
  • Variable Scopes
  • The Layout of Python Programs
  • Comments and Triple-Delimited String Literals
  • Sample Python Code
  • PEP8
  • Getting Help on Python Objects
  • Null (None)
  • Strings
  • Finding Index of a Substring
  • String Splitting
  • Raw String Literals
  • String Formatting and Interpolation
  • String Public Method Names
  • The Boolean Type
  • Boolean Operators
  • Relational Operators
  • Numbers
  • "Easy Numbers"
  • Looking Up the Runtime Type of a Variable
  • Divisions
  • Assignment-with-Operation
  • Hands-On Exercise
  • Dates and Times
  • Hands-On Exercise
  • Summary

Chapter 6 - Control Statements and Data Collections

  • Control Flow with The if-elif-else Triad
  • An if-elif-else Example
  • Conditional Expressions (a.k.a. Ternary Operator)
  • The While-Break-Continue Triad
  • The for Loop
  • The range() Function
  • Examples of Using range()
  • The try-except-finally Construct
  • Hands-On Exercise
  • The assert Expression
  • Lists
  • Main List Methods
  • List Comprehension
  • Zipping Lists
  • Enumerate
  • Hands-On Exercise
  • Dictionaries
  • Working with Dictionaries
  • Other Dictionary Methods
  • Sets
  • Set Methods
  • Set Operations
  • Set Operations Examples
  • Finding Unique Elements in a List
  • Common Collection Functions and Operators
  • Hands-On Exercise
  • Tuples
  • Unpacking Tuples
  • Hands-On Exercise
  • Summary

Chapter 7 - Functions and Modules

  • Built-in Functions
  • Functions
  • The "Call by Sharing" Parameter Passing
  • Global and Local Variable Scopes
  • Default Parameters
  • Named Parameters
  • Dealing with Arbitrary Number of Parameters
  • Keyword Function Parameters
  • Hands-On Exercise
  • What is Functional Programming (FP)?
  • Concept: Pure Functions
  • Concept: Recursion
  • Concept: Higher-Order Functions
  • Lambda Functions in Python
  • Examples of Using Lambdas
  • Lambdas in the Sorted Function
  • Hands-On Exercise
  • Python Modules
  • Importing Modules
  • Installing Modules
  • Listing Methods in a Module
  • Creating Your Own Modules
  • Creating a Module's Entry Point
  • Summary

Chapter 8 - File I/O and Useful Modules

  • Reading Command-Line Parameters
  • Hands-On Exercise (N/A in DCC)
  • Working with Files
  • Reading and Writing Files
  • Hands-On Exercise
  • Hands-On Exercise
  • Random Numbers
  • Hands-On Exercise
  • Regular Expressions
  • The re Object Methods
  • Using Regular Expressions Examples
  • Hands-On Exercise
  • Summary

Chapter 9 - Practical Introduction to NumPy

  • NumPy
  • The First Take on NumPy Arrays
  • The ndarray Data Structure
  • Getting Help
  • Understanding Axes
  • Indexing Elements in a NumPy Array
  • Understanding Types
  • Re-Shaping
  • Commonly Used Array Metrics
  • Commonly Used Aggregate Functions
  • Sorting Arrays
  • Vectorization
  • Vectorization Visually
  • Broadcasting
  • Broadcasting Visually
  • Filtering
  • Array Arithmetic Operations
  • Reductions: Finding the Sum of Elements by Axis
  • Array Slicing
  • 2-D Array Slicing
  • Slicing and Stepping Through
  • The Linear Algebra Functions
  • Summary

Chapter 10 - Practical Introduction to pandas

  • What is pandas?
  • The Series Object
  • Accessing Values and Indexes in Series
  • Setting Up Your Own Index
  • Using the Series Index as a Lookup Key
  • Can I Pack a Python Dictionary into a Series?
  • The DataFrame Object
  • The DataFrame's Value Proposition
  • Creating a pandas DataFrame
  • Getting DataFrame Metrics
  • Accessing DataFrame Columns
  • Accessing DataFrame Rows
  • Accessing DataFrame Cells
  • Using iloc
  • Using loc
  • Examples of Using loc
  • DataFrames are Mutable via Object Reference!
  • The Axes
  • Deleting Rows and Columns
  • Adding a New Column to a DataFrame
  • Appending / Concatenating DataFrame and Series Objects
  • Example of Appending / Concatenating DataFrames
  • Re-indexing Series and DataFrames
  • Getting Descriptive Statistics of DataFrame Columns
  • Navigating Rows and Columns For Data Reduction
  • Getting Descriptive Statistics of DataFrames
  • Applying a Function
  • Sorting DataFrames
  • Reading From CSV Files
  • Writing to the System Clipboard
  • Writing to a CSV File
  • Fine-Tuning the Column Data Types
  • Changing the Type of a Column
  • What May Go Wrong with Type Conversion
  • Summary

Chapter 11 - Data Grouping and Aggregation with pandas

  • Data Aggregation and Grouping
  • Sample Data Set
  • The pandas.core.groupby.SeriesGroupBy Object
  • Grouping by Two or More Columns
  • Emulating SQL's WHERE Clause
  • The Pivot Tables
  • Cross-Tabulation
  • Summary

Chapter 12 - Repairing and Normalizing Data

  • Repairing and Normalizing Data
  • Dealing with the Missing Data
  • Sample Data Set
  • Getting Info on Null Data
  • Dropping a Column
  • Interpolating Missing Data in pandas
  • Replacing the Missing Values with the Mean Value
  • Scaling (Normalizing) the Data
  • Data Preprocessing with scikit-learn
  • Scaling with the scale() Function
  • The MinMaxScaler Object
  • Summary

Chapter 13 - Data Visualization in Python

  • Data Visualization
  • Data Visualization in Python
  • Matplotlib
  • Getting Started with matplotlib
  • The matplotlib.pyplot.plot() Function
  • The matplotlib.pyplot.bar() Function
  • The matplotlib.pyplot.pie () Function
  • The matplotlib.pyplot.subplot() Function
  • A Subplot Example
  • Figures
  • Saving Figures to a File
  • Seaborn
  • Getting Started with seaborn
  • Histograms and KDE
  • Plotting Bivariate Distributions
  • Scatter plots in seaborn
  • Pair plots in seaborn
  • Heatmaps
  • A Seaborn Scatterplot with Varying Point Sizes and Hues
  • ggplot
  • Summary

Chapter 14 - Python as a Cloud Scripting Language

  • Python's Value
  • Python on AWS
  • AWS SDK For Python (boto3)
  • What is Serverless Computing?
  • How Functions Work
  • The AWS Lambda Event Handler
  • What is AWS Glue?
  • PySpark on Glue - Sample Script
  • Summary

Chapter 15 - Introduction to Apache Spark

  • What is Apache Spark
  • The Spark Platform
  • Spark vs Hadoop's MapReduce (MR)
  • Common Spark Use Cases
  • Languages Supported by Spark
  • Running Spark on a Cluster
  • The Spark Application Architecture
  • The Driver Process
  • The Executor and Worker Processes
  • Spark Shell
  • Jupyter Notebook Shell Environment
  • Spark Applications
  • The spark-submit Tool
  • The spark-submit Tool Configuration
  • Interfaces with Data Storage Systems
  • The Resilient Distributed Dataset (RDD)
  • Datasets and DataFrames
  • Spark SQL, DataFrames, and Catalyst Optimizer
  • Project Tungsten
  • Spark Machine Learning Library
  • Spark (Structured) Streaming
  • GraphX
  • Extending Spark Environment with Custom Modules and Files
  • Spark 3
  • Spark 3 Updates at a Glance
  • Summary

Chapter 16 - The Spark Shell

  • The Spark Shell
  • The Spark v.2 + Command-Line Shells
  • The Spark Shell UI
  • Spark Shell Options
  • Getting Help
  • Jupyter Notebook Shell Environment
  • Example of a Jupyter Notebook Web UI (Databricks Cloud)
  • The Spark Context (sc) and Spark Session (spark)
  • Creating a Spark Session Object in Spark Applications
  • The Shell Spark Context Object (sc)
  • The Shell Spark Session Object (spark)
  • Loading Files
  • Saving Files
  • Summary

Chapter 17 - Spark RDDs

  • The Resilient Distributed Dataset (RDD)
  • Ways to Create an RDD
  • Supported Data Types
  • RDD Operations
  • RDDs are Immutable
  • Spark Actions
  • RDD Transformations
  • Other RDD Operations
  • Chaining RDD Operations
  • RDD Lineage
  • The Big Picture
  • What May Go Wrong
  • Miscellaneous Pair RDD Operations
  • RDD Caching
  • Summary

Chapter 18 - Parallel Data Processing with Spark

  • Running Spark on a Cluster
  • Data Partitioning
  • Data Partitioning Diagram
  • Single Local File System RDD Partitioning
  • Multiple File RDD Partitioning
  • Special Cases for Small-sized Files
  • Parallel Data Processing of Partitions
  • Spark Application, Jobs, and Tasks
  • Stages and Shuffles
  • The "Big Picture"
  • Summary

Chapter 19 - Introduction to Spark SQL

  • What is Spark SQL?
  • Uniform Data Access with Spark SQL
  • Using JDBC Sources
  • Hive Integration
  • What is a DataFrame?
  • Creating a DataFrame in PySpark
  • Creating a DataFrame in PySpark (Cont'd)
  • Commonly Used DataFrame Methods and Properties in PySpark
  • Commonly Used DataFrame Methods and Properties in PySpark (Cont'd)
  • Grouping and Aggregation in PySpark
  • The "DataFrame to RDD" Bridge in PySpark
  • The SQLContext Object
  • Examples of Spark SQL / DataFrame (PySpark Example)
  • Converting an RDD to a DataFrame Example
  • Example of Reading / Writing a JSON File
  • Performance, Scalability, and Fault-tolerance of Spark SQL
  • Summary

Lab Exercises

Lab 1. Data Availability and Consistency
Lab 2. A/B Testing Data Engineering Tasks Project
Lab 3. Learning the Databricks Community Cloud Lab Environment
Lab 4. Python Variables
Lab 5. Dates and Times
Lab 6. The if, for, and try Constructs
Lab 7. Understanding Lists
Lab 8. Dictionaries
Lab 9. Sets
Lab 10. Tuples
Lab 11. Functions
Lab 12. Functional Programming
Lab 13. File I/O
Lab 14. Using HTTP and JSON
Lab 15. Random Numbers
Lab 16. Regular Expressions
Lab 17. Understanding NumPy
Lab 18. A NumPy Project
Lab 19. Understanding pandas
Lab 20. Data Grouping and Aggregation
Lab 21. Repairing and Normalizing Data
Lab 22. Data Visualization and EDA with pandas and seaborn
Lab 23. Correlating Cause and Effect
Lab 24. Learning PySpark Shell Environment
Lab 25. Understanding Spark DataFrames
Lab 26. Learning the PySpark DataFrame API
Lab 27. Data Repair and Normalization in PySpark
Lab 28. Working with Parquet File Format in PySpark and pandas

Frequently Asked Questions

What is a Data Engineer?

A data engineer conceives, builds and maintains the data infrastructure that holds your enterprise’s advanced analytics capacities together.

A data engineer is responsible for building and maintaining the data architecture of a data science project. Data Engineers are responsible for the creation and maintenance of analytics infrastructure that enables almost every other function in the data world. They are responsible for the development, construction, maintenance and testing of architectures, such as databases and large-scale processing systems. As part of this, Data Engineers are also responsible for the creation of data set processes used in modeling, mining, acquisition, and verification.

What is the Data Engineering Role?

  • Data engineers are software engineers who have the primary focus on dealing with data engineering tasks
  • They work closely with System Administrators, Data Analysis, and Data Scientists to prepare the data and make it consumable in subsequent advanced data processing scenarios
  • Most of these activities fall under the category of ETL (Extract, Transform and Load) processes Practitioners in this field deal with such aspects of data processing as processing efficiency, scalable computation, system interoperability, and security
  • Data engineers have knowledge of the appropriate data storage systems (RDBMS or NoSQL types) used by organizations they work for and their external interfaces
  • Typically, data engineers have to understand the database and data schemas as well as the APIs to access the stored data  Depending on the task at hand, internal standards, etc., they may be required to use a variety of programming languages, including Java, Scala, Python, C#, C, etc

What is the difference between a Data Scientist and a Data Engineer?

It is important to know the distinction between these 2 roles.

While there is frequent collaboration between data scientists and data engineers, they’re different positions that prioritize different skill sets. Data scientists focus on advanced statistics and mathematical analysis of the data that’s generated and stored, all in the interest of identifying trends and solving business needs or industry questions. But they can’t do their job without a team of data engineers who have advanced programming skills (Java, Scala, Python) and an understanding of distributed systems and data pipelines.

Broadly speaking, a data scientist builds models using a combination of statistics, mathematics, machine learning and domain based knowledge. He/she has to code and build these models using the same tools/languages and framework that the organization supports.

A data engineer on the other hand has to build and maintain data structures and architectures for data ingestion, processing, and deployment for large-scale data-intensive applications. To build a pipeline for data collection and storage, to funnel the data to the data scientists, to put the model into production – these are just some of the tasks a data engineer has to perform.

Data scientists and data engineers need to work together for any large scale data science project to succeed,

What are the different roles in Data Engineering?

Data Engineer:  A data engineer needs to have knowledge of database tools, languages like Python and Java, distributed systems like Hadoop, among other things. It’s a combination of tasks into one single role.

Data Architect: A data architect lays down the foundation for data management systems to ingest, integrate and maintain all the data sources. This role requires knowledge of tools like SQL, XML, Hive, Pig, Spark, etc.

Database Administrator: As the name suggests, a person working in this role requires extensive knowledge of databases. Responsibilities entail ensuring the databases are available to all the required users, is maintained properly and functions without any hiccups when new features are added.

What are the core Data Engineering skills?

What is the future for Data Engineering?

The data engineering field is expected to continue growing rapidly over the next several years, and there’s huge demand for data engineers across industries.

The global Big Data and data engineering services market is expected to grow at a CAGR of 31.3 percent by 2025.

What is Data Wrangling (Munging)?

Data wrangling (a.k.a. munging) is the process of organized data transformation from one data format (or structure) into another

Normally, it is part of a data processing workflow the output of which is consumed downstream, normally, by systems with focus on data analytics / machine learning

Usually, it is part of automated ETL (Extract, Transform, and Load) activities

Specific activities include: Data cleansing, removing data outliers, repairing data (by plugging in some surrogate data in place of the missing values), data enhancement / augmenting, aggregation, sorting, filtering, and normalizing

Can I take Data Engineering Courses online?

Yes! We know your busy work schedule may prevent you from getting to one of our classrooms which is why we offer convenient Data Engineering Courses online to meet your needs wherever you want. We offer our Data Engineering courses as public Data Engineering Courses or dedicated Data Engineering courses. Ask us about taking a Data Engineer course online!

Click here to see our Guaranteed to Run Virtual Online Class Schedules

Proven Results in Data Engineer Bootcamp Training

For over 20 years, we have trained thousands of developers at some of the country’s largest tech companies – including many Fortune 500 companies. We’re proudly distinguished by these clients, partners and awards.

aws training   Microsoft Silver Partner   ISACA logo   
open group
Scrum Alliance Registered   profit100

This was a great course. I loved the blend of Python Concepts Plus Just enough Data science to be productive
Instructor was very thorough, yet practical. He was a great communicator and explained everything in layman’s terms.
Great tutorials! I will go back to these
This course is excellent. It gave me an overview of data science and a good understanding. It put me in the right direction of data analysis in my work.
I was SO impressed with the level of knowledge that the instructor has. I will definitely consider classes from Web Age in the future! The instructor gave us many real-life examples and provided links to further learning .

How to Build Your Data Science Team

Data engineers utilize the various stages in a pipeline from acquisition and transport, to storage, processing and servicing continually improving their methods and practices. Today’s Data Engineer must become proficient at programming, learn automation and scripting, understand may different data stores, master data processing techniques, efficiently schedule workflows, know the ever changing cloud landscape, and keep up with trends.

This data engineering bootcamp training video will delve into today’s best tools and techniques that great data scientists utilize to efficiently and effectively understand outcomes from their datasets, and capture, transform and shape their data stores.

Why Choose Web Age Solutions for Data Engineer Bootcamp Training?

Best price in the industry

You won’t find better value in the marketplace. If you do find a lower price, we will beat it.

Various delivery methods

Flexible delivery methods are available depending on your learning style.

Resources

Resources are included for a comprehensive learning experience.

We regularly offer Data Engineering Bootcamp Training courses in these and other cities. Atlanta, Austin, Baltimore, Calgary, Chicago, Cleveland, Dallas, Denver, Detroit, Houston, Jacksonville, Miami, Montreal, New York City, Orlando, Ottawa, Philadelphia, Phoenix, Pittsburgh, Seattle, Toronto, Vancouver, Washington DC.