Web Age Solutions Inc
Providing Technology Training and Mentoring For Modern Technology Adoption
Web Age Aniversary Logo
US Inquiries / 1.877.517.6540
Canadian Inquiries / 1.877.812.8887

Data Engineering Bootcamp Training (Using Python and PySpark)

Why take Data Engineering Bootcamp Training with Web Age?


Web Age is a global leader in Data Engineer skills training.

Enroll now in our expert-led Data Engineering Bootcamp training to build production-ready data infrastructure, and learn essential data engineering skills.

Data Engineering with Python and PySpark

Course #:WA3020


Learn about the world of data engineering in this 5-Day Data Engineering Bootcamp training.

Data engineering is the aspect of data science that focuses on practical applications of data collection and analysis.

A data engineer conceives, builds and maintains the data infrastructure that holds your enterprise’s advanced analytics capacities together.

Today’s Data Engineer must become proficient at programming, learn automation and scripting, understand may different data stores, master data processing techniques, efficiently schedule workflows, know the ever changing cloud landscape, and keep up with trends.

This Top Rated Data Engineering Bootcamp training course is supplemented by hands-on labs that help attendees reinforce their theoretical knowledge of the learned material.

Enroll now in this Data Engineering Bootcamp training to build production-ready data infrastructure, and learn essential data engineering skills.

Data Engineering Bootcamp Training

Kubernetes Made Easier:
Explore the AWS Elastic Kubernetes Service

Thursday, July 8
12:00 to 1:00 PM ET

Data Engineering bootcamp Training registration

data engineering bootcamp training

The Top 10 Essential Data Engineer Skills
for 2021!

You NEED to keep your certified data engineer
training and skills up to date by embracing the
new technology tools available.

What are those skills?

Keep Reading…

Overview of Data Engineering Bootcamp Training (Using Python and PySpark)

Delivery Focus for this Data Engineering Bootcamp:

Data Engineering & Data Manipulation will be heavily weighted; with some understanding of Data Science.

Data Engineering Bootcamp Training Objectives

  • Defining Data Engineering
  • Distributed Computing Concepts for Data Engineers
  • Data Processing Phases
  • NumPy
  • Pandas
  • Repairing and Normalizing Data
  • Data Visualization in Python
  • Python as a Cloud Scripting Language
  • Apache Spark
  • Introduction to Spark SQL
  • Operational Data Analytics with Splunk
  • Apache Airflow


This Data Engineer Bootcamp training  is targeted to Data Engineers


Five days.

Data Engineer Bootcamp Training
View related courses:
Data Engineer Training Courses.

Outline of Data Engineering Bootcamp Training (Using Python and PySpark)

Chapter 1. Defining Data Engineering

  • Data is King
  • Translating Data into Operational and Business Insights
  • What is Data Engineering
  • The Data-Related Roles
  • The Data Science Skill Sets
  • The Data Engineer Role
  • Core Skills and Competencies
  • An Example of a Data Product
  • What is Data Wrangling (Munging)?
  • The Data Exchange Interoperability Options
  • Summary

Chapter 2. Distributed Computing Concepts for Data Engineers

  • The Traditional Client–Server Processing Pattern
  • Enter Distributed Computing
  • Data Physics
  • Data Locality (Distributed Computing Economics)
  • The CAP Theorem
  • Mechanisms to Guarantee a Single CAP Property
  • Eventual Consistency
  • The NoSQL Systems CAP Triangle
  • Summary

Chapter 3. Data Processing Phases

  • Typical Data Processing Pipeline
  • Data Discovery Phase
  • Data Harvesting Phase
  • Data Priming Phase
  • Exploratory Data Analysis
  • Model Planning Phase
  • Model Building Phase
  • Communicating the Results
  • Production Roll-out
  • Data Logistics and Data Governance
  • Data Processing Workflow Engines
  • Apache Airflow
  • Data Lineage and Provenance
  • Apache NiFi
  • Summary

Chapter 4. Practical Introduction to NumPy

  • SciPy
  • NumPy
  • The First Take on NumPy Arrays
  • Getting Help
  • Understanding Axes
  • Indexing Elements in a NumPy Array
  • NumPy Arrays
  • Understanding Types
  • Re-Shaping
  • Commonly Used Array Metrics
  • Commonly Used Aggregate Functions
  • Sorting Arrays
  • Vectorization
  • Broadcasting
  • Filtering
  • Array Arithmetic Operations
  • Array Slicing
  • 2-D Array Slicing
  • The Linear Algebra Functions
  • Summary

Chapter 5. Practical Introduction to pandas

  • What is pandas?
  • The Series Object
  • Accessing Values and Indexes in Series
  • Setting Up Your Own Index
  • Using the Series Index as a Lookup Key
  • Can I Pack a Python Dictionary into a Series?
  • The DataFrame Object
  • The DataFrame’s Value Proposition
  • Creating a pandas DataFrame
  • Getting DataFrame Metrics
  • Accessing DataFrame Columns
  • Accessing DataFrame Rows
  • Accessing DataFrame Cells
  • Using iloc
  • Using loc
  • Examples of Using loc
  • DataFrames are Mutable via Object Reference!
  • Deleting Rows and Columns
  • Adding a New Column to a DataFrame
  • Appending / Concatenating DataFrame and Series Objects
  • Example of Appending / Concatenating DataFrames
  • Re-indexing Series and DataFrames
  • Getting Descriptive Statistics of DataFrame Columns
  • Getting Descriptive Statistics of DataFrames
  • Applying a Function
  • Sorting DataFrames
  • Reading From CSV Files
  • Writing to the System Clipboard
  • Writing to a CSV File
  • Fine-Tuning the Column Data Types
  • Changing the Type of a Column
  • What May Go Wrong with Type Conversion
  • Summary

Chapter 6. Data Grouping and Aggregation with pandas

  • Data Aggregation and Grouping
  • Sample Data Set
  • The pandas.core.groupby.SeriesGroupBy Object
  • Grouping by Two or More Columns
  • Emulating SQL’s WHERE Clause
  • The Pivot Tables
  • Cross-Tabulation
  • Summary

Chapter 7. Repairing and Normalizing Data

  • Repairing and Normalizing Data
  • Dealing with the Missing Data
  • Sample Data Set
  • Getting Info on Null Data
  • Dropping a Column
  • Interpolating Missing Data in pandas
  • Replacing the Missing Values with the Mean Value
  • Scaling (Normalizing) the Data
  • Data Preprocessing with scikit-learn
  • Scaling with the scale() Function
  • The MinMaxScaler Object
  • Summary

Chapter 8. Data Visualization in Python

  • Data Visualization
  • Data Visualization in Python
  • Matplotlib
  • Getting Started with matplotlib
  • The matplotlib.pyplot.plot() Function
  • The matplotlib.pyplot.bar() Function
  • The matplotlib.pyplot.pie () Function
  • Subplots
  • Using the matplotlib.gridspec.GridSpec Object
  • The matplotlib.pyplot.subplot() Function
  • Figures
  • Saving Figures to a File
  • Seaborn
  • Getting Started with seaborn
  • Histograms and KDE
  • Plotting Bivariate Distributions
  • Scatter plots in seaborn
  • Pair plots in seaborn
  • Heatmaps
  • ggplot
  • Summary

Chapter 9. Python as a Cloud Scripting Language

  • Python’s Value
  • Python on AWS
  • AWS SDK For Python (boto3)
  • What is Serverless Computing?
  • How Functions Work
  • The AWS Lambda Event Handler
  • What is AWS Glue?
  • PySpark on Glue – Sample Script
  • Summary

Chapter 10. Introduction to Apache Spark

  • What is Apache Spark
  • The Spark Platform
  • Spark vs Hadoop’s MapReduce (MR)
  • Common Spark Use Cases
  • Languages Supported by Spark
  • Running Spark on a Cluster
  • The Spark Application Architecture
  • The Driver Process
  • The Executor and Worker Processes
  • Spark Shell
  • Jupyter Notebook Shell Environment
  • Spark Applications
  • The spark-submit Tool
  • The spark-submit Tool Configuration
  • Interfaces with Data Storage Systems
  • Project Tungsten
  • The Resilient Distributed Dataset (RDD)
  • Datasets and DataFrames
  • Spark SQL, DataFrames, and Catalyst Optimizer
  • Spark Machine Learning Library
  • GraphX
  • Extending Spark Environment with Custom Modules and Files
  • Summary

Chapter 11. The Spark Shell

  • The Spark Shell
  • The Spark v.2 + Command-Line Shells
  • The Spark Shell UI
  • Spark Shell Options
  • Getting Help
  • Jupyter Notebook Shell Environment
  • Example of a Jupyter Notebook Web UI (Databricks Cloud)
  • The Spark Context (sc) and Spark Session (spark)
  • Creating a Spark Session Object in Spark Applications
  • The Shell Spark Context Object (sc)
  • The Shell Spark Session Object (spark)
  • Loading Files
  • Saving Files
  • Summary

Chapter 12. Spark RDDs

  • The Resilient Distributed Dataset (RDD)
  • Ways to Create an RDD
  • Supported Data Types
  • RDD Operations
  • RDDs are Immutable
  • Spark Actions
  • RDD Transformations
  • Other RDD Operations
  • Chaining RDD Operations
  • RDD Lineage
  • The Big Picture
  • What May Go Wrong
  • Miscellaneous Pair RDD Operations
  • RDD Caching
  • Summary

Chapter 13. Parallel Data Processing with Spark

  • Running Spark on a Cluster
  • Data Partitioning
  • Data Partitioning Diagram
  • Single Local File System RDD Partitioning
  • Multiple File RDD Partitioning
  • Special Cases for Small-sized Files
  • Parallel Data Processing of Partitions
  • Spark Application, Jobs, and Tasks
  • Stages and Shuffles
  • The “Big Picture”
  • Summary

Chapter 14. Introduction to Spark SQL

  • What is Spark SQL?
  • Uniform Data Access with Spark SQL
  • Hive Integration
  • Hive Interface
  • Integration with BI Tools
  • What is a DataFrame?
  • Creating a DataFrame in PySpark
  • Commonly Used DataFrame Methods and Properties in PySpark
  • Grouping and Aggregation in PySpark
  • The “DataFrame to RDD” Bridge in PySpark
  • The SQLContext Object
  • Examples of Spark SQL / DataFrame (PySpark Example)
  • Converting an RDD to a DataFrame Example
  • Example of Reading / Writing a JSON File
  • Using JDBC Sources
  • JDBC Connection Example
  • Performance, Scalability, and Fault-tolerance of Spark SQL
  • Summary

Chapter 15. Operational Data Analytics with Splunk

  • Splunk Defined
  • Splunk Products
  • Splunk Editions
  • Deployment Options
  • Common Components
  • Splunk Admin Dashboard (Web UI)
  • Events
  • Data Indexing
  • Web UI for Adding Data to Indexer
  • Distributed Splunk Indexing and Searching
  • Architecture for a Multi-Tier Splunk Enterprise Deployment
  • Data Source Types
  • The Source Types Automatically Recognized by Splunk
  • The “Pre-trained” Data Source Types
  • Windows ® Data Sources
  • Custom Event Format
  • Web UI: Adding Data Flow for Local File Upload
  • Web UI: Add Data for Monitoring
  • Data Searching
  • Search Processing Language (SPL)
  • Searching and Reporting Activities
  • The Search Page
  • Core Search Concepts
  • The Search Basics
  • Search Command Categories
  • Command Examples
  • More Examples of Search Commands
  • Statistical and Time Functions
  • From SQL to SPL – the Translation Table
  • Visualizations
  • Save Your Searches as Dashboards
  • Summary

Chapter 16. Apache Airflow Introduction

  • A Traditional ETL Approach
  • Apache Airflow Defined
  • Airflow Core Components
  • The Component Collaboration Diagram
  • Workflow Building Blocks and Concepts
  • Airflow CLI
  • Main Configuration File
  • Extending Airflow
  • Jinja Templates
  • Variables and Macros
  • Summary

Chapter 17. Apache Airflow Web UI

  • Web UI – the Landing (DAGs) Page
  • Web UI – the DAG Graph View
  • Run Status Legends
  • The Pause Button (Trigger Latch)
  • The DAG Triggering/Job Checking Sequence
  • The Control Panel for a Task
  • Sample Log File Messages (Abridged for Space)
  • Summary

Lab 1 – A/B Testing Data Engineering Tasks Project

Lab 2 – Data Availability and Consistency

Lab 3 – Learning the Databricks Community Cloud Lab Environment

Lab 4 – Functional Programming

Lab 5 – Using HTTP and JSON

Lab 6 – Random Numbers

Lab 7 – Regular Expressions

Lab 8 – Understanding NumPy

Lab 9 – A NumPy Project

Lab 10 – Understanding pandas

Lab 11 – Data Grouping and Aggregation

Lab 12 – Repairing and Normalizing Data

Lab 13 – Data Visualization and EDA with pandas and seaborn

Lab 14 – Correlating Cause and Effect

Lab 15 – Learning PySpark Shell Environment

Lab 16 – Understanding Spark DataFrames

Lab 17 – Learning the PySpark DataFrame API

Lab 18 – Data Repair and Normalization in PySpark

Lab 19 – Working with Parquet File Format in PySpark and pandas

Lab 20 – Learning the Lab Environment

Lab 21 – Local File Upload

Lab 22 – Using Search and Reporting App

Lab 23 – Querying for Insights

What is a Data Engineer?

A data engineer conceives, builds and maintains the data infrastructure that holds your enterprise’s advanced analytics capacities together.

A data engineer is responsible for building and maintaining the data architecture of a data science project. Data Engineers are responsible for the creation and maintenance of analytics infrastructure that enables almost every other function in the data world. They are responsible for the development, construction, maintenance and testing of architectures, such as databases and large-scale processing systems. As part of this, Data Engineers are also responsible for the creation of data set processes used in modeling, mining, acquisition, and verification.

Read more in our Featured Data Engineering Training Article:

What is Data Engineering?

What is the difference between a Data Scientist and a Data Engineer?

It is important to know the distinction between these 2 roles.

While there is frequent collaboration between data scientists and data engineers, they’re different positions that prioritize different skill sets. Data scientists focus on advanced statistics and mathematical analysis of the data that’s generated and stored, all in the interest of identifying trends and solving business needs or industry questions. But they can’t do their job without a team of data engineers who have advanced programming skills (Java, Scala, Python) and an understanding of distributed systems and data pipelines.

Broadly speaking, a data scientist builds models using a combination of statistics, mathematics, machine learning and domain based knowledge. He/she has to code and build these models using the same tools/languages and framework that the organization supports.

A data engineer on the other hand has to build and maintain data structures and architectures for data ingestion, processing, and deployment for large-scale data-intensive applications. To build a pipeline for data collection and storage, to funnel the data to the data scientists, to put the model into production – these are just some of the tasks a data engineer has to perform.

Data scientists and data engineers need to work together for any large scale data science project to succeed,

What are the different roles in Data Engineering?

Data Engineer:  A data engineer needs to have knowledge of database tools, languages like Python and Java, distributed systems like Hadoop, among other things. It’s a combination of tasks into one single role.

Data Architect: A data architect lays down the foundation for data management systems to ingest, integrate and maintain all the data sources. This role requires knowledge of tools like SQL, XML, Hive, Pig, Spark, etc.

Database Administrator: As the name suggests, a person working in this role requires extensive knowledge of databases. Responsibilities entail ensuring the databases are available to all the required users, is maintained properly and functions without any hiccups when new features are added.

What are the core Data Engineering skills?

What is the future for Data Engineering?

The data engineering field is expected to continue growing rapidly over the next several years, and there’s huge demand for data engineers across industries.

The global Big Data and data engineering services market is expected to grow at a CAGR of 31.3 percent by 2025.

Can I take this Data Engineering Bootcamp training online?

Yes! We know your busy work schedule may prevent you from getting to one of our classrooms which is why we offer convenient Data Engineer Bootcamp training online to meet your needs wherever you want. We offer our Data Engineering courses as public Data Engineer bootcamp training classes or dedicated Data Engineer Bootcamp training. Ask us about taking a Data Engineer Bootcamp training online course!

Click here to see our Guaranteed to Run Virtual Online Class Schedules

Proven Results in Data Engineer Bootcamp Training

For over 20 years, we have trained thousands of developers at some of the country’s largest tech companies – including many Fortune 500 companies. Here are a few of the clients we have delivered Data Engineering Bootcamp Training Courses to:

Booz Allen Hamilton Data Engineer Bootcamp Training     Liberty Mutual Data Engineer Bootcamp Training     FedEx Ground Data Engineer Bootcamp Training     Fidelity Investments Data Engineer Bootcamp Training     Lockheed Martin Data Engineer Bootcamp Courses    Data Engineer Bootcamp Training

Here are some reviews from past students who completed our Data Engineering Bootcamp Training Courses:

“This was a great course. I loved the blend of Python Concepts Plus Just enough Data science to be productive”

“Instructor was very thorough, yet practical. He was a great communicator and explained everything in layman’s terms.”

“Great tutorials! I will go back to these”

“This course is excellent. It gave me an overview of data science and a good understanding. It put me in the right direction of data analysis in my work.”


How to Build Your Data Science Team

Data engineers utilize the various stages in a pipeline from acquisition and transport, to storage, processing and servicing continually improving their methods and practices. Today’s Data Engineer must become proficient at programming, learn automation and scripting, understand may different data stores, master data processing techniques, efficiently schedule workflows, know the ever changing cloud landscape, and keep up with trends.

This data engineering bootcamp training video will delve into today’s best tools and techniques that great data scientists utilize to efficiently and effectively understand outcomes from their datasets, and capture, transform and shape their data stores.

Why Choose Web Age Solutions for Data Engineer Bootcamp Training?

Data Engineer Bootcamp Training

Best price in the industry

You won’t find better value in the marketplace. If you do find a lower price, we will beat it.

Data Engineer Bootcamp Training

Various delivery methods

Flexible delivery methods are available depending on your learning style.

Data Engineer Bootcamp Training


Resources are included for a comprehensive learning experience.

We regularly offer Data Engineering Bootcamp Training courses in these and other cities. Atlanta, Austin, Baltimore, Calgary, Chicago, Cleveland, Dallas, Denver, Detroit, Houston, Jacksonville, Miami, Montreal, New York City, Orlando, Ottawa, Philadelphia, Phoenix, Pittsburgh, Seattle, Toronto, Vancouver, Washington DC.

US Inquiries / 1.877.517.6540
Canadian Inquiries / 1.877.812.8887