Web Age Solutions Inc
Providing Technology Training and Mentoring For Modern Technology Adoption
Web Age Aniversary Logo
US Inquiries / 1.877.517.6540
Canadian Inquiries / 1.877.812.8887

Data Engineering Bootcamp Training


Learn about the world of data engineering in this 5 day Data Engineer Bootcamp Training!

A data engineer conceives, builds and maintains the data infrastructure that holds your enterprise’s advanced analytics capacities together. Data Engineering is the foundation for the new world of Big Data. Enroll now in our Data Engineering Bootcamp training to build production-ready data infrastructure, and learn essential data engineering skills.

Data Engineer Training
Data Engineering with Python

Course #:WA2926


Data engineering is the aspect of data science that focuses on practical applications of data collection and analysis.

A data engineer conceives, builds and maintains the data infrastructure that holds your enterprise’s advanced analytics capacities together.

Learn about the world of data engineering in this 5-Day Data Engineering Bootcamp training.

This Data Engineering Bootcamp training course is supplemented by hands-on labs that help attendees reinforce their theoretical knowledge of the learned material.

Delivery Methods For This Data Engineer Bootcamp

Web Age Solutions Live online data engineer bootcamp Training              Web Age Solutions Onsite data engineer bootcamp Training              Web Age Solutions Classroom data engineer bootcamp Training
Data Engineer Bootcamp Training
View related courses:
Data Engineer Training Courses.

Overview of Data Engineering Bootcamp Training

Delivery Focus for this Data Engineering Bootcamp:

Data Engineering & Data Manipulation will be heavily weighted; with some understanding of Data Science.


This Data Engineer Bootcamp training  is targeted to Data Engineers


Five days.

Outline of Data Engineering Bootcamp Training

Chapter 1. Big Data for Data Engineers

  • Gartner’s Definition of Big Data
  • The Big Data Confluence Diagram
  • A Practical Definition of Big Data
  • The Traditional Client–Server Processing Pattern
  • Enter Distributed Computing
  • Data Physics
  • Data Locality (Distributed Computing Economics)
  • The CAP Theorem
  • Eventual Consistency
  • NoSQL Systems CAP Triangle
  • Big Data And Analytics Landscape 2019
  • Hands-on Exercise: Learning the Lab Environment
  • Apache Hadoop
  • Hadoop Ecosystem Projects
  • Other Hadoop Ecosystem Projects
  • Hadoop’s Main Components
  • Hadoop Component Diagram
  • HDFS
  • Storing Raw Data in HDFS
  • Hands-on Exercise: Hadoop Distributed File System
  • MapReduce Defined
  • MapReduce Phases
  • The Map Phase
  • The Reduce Phase
  • Similarity with SQL Aggregation Operations
  • Hadoop’s MapReduce
  • MapReduce Word Count Job
  • Hive Demo
  • Summary

Chapter 2. Defining Data Engineering

  • What is Data Engineering
  • The Data-Related Roles
  • The Data Science Skill Sets
  • Core Skills and Competencies
  • What is Data Wrangling (Munging)?
  • Summary

Chapter 3. Data Processing Phases

  • Typical Data Processing Pipeline
  • Data Discovery Phase
  • Data Harvesting Phase
  • Data Priming Phase
  • Exploratory Data Analysis
  • Machine Learning Planning Phase (Optional)
  • Model Building Phase (Optional)
  • Communicating the Results
  • Production Roll-out
  • Data Logistics and Data Governance
  • Data Processing Workflow Engines
  • Apache Airflow
  • Data Lineage and Provenance
  • Apache NiFi
  • Summary

Chapter 4. Apache Hive

  • Traditional RDBMS Capabilities and TCO
  • What is Hive?
  • Apache Hive Logo
  • Hive’s Value Proposition
  • Who uses Hive?
  • What Hive Does Not Have
  • Hive’s Main Sub-Systems
  • Hive Features
  • The “Classic” Hive Architecture
  • The New Hive Architecture (Hive Server 2)
  • Multi-Client Concurrency in Hive Server 2
  • Components
  • Where are the Hive Tables Located?
  • Data Organization in Hive
  • Hive Tables
  • Managed and External Tables
  • Partitions
  • Buckets
  • Buckets and Partitions
  • Buckets Visually
  • Partitions Visually
  • HiveQL
  • The “Classic” Hive Command-line Interface (CLI)
  • The Beeline Command Shell
  • Summary

Chapter 5. Hive Command-line Interface

  • Hive Command-line Interface (CLI)
  • The Hive Interactive Shell
  • Running Host OS Commands from the Hive Shell
  • Interfacing with HDFS from the Hive Shell
  • The Hive in Unattended Mode
  • The Hive CLI Integration with the OS Shell
  • Executing HiveQL Scripts
  • Comments in Hive Scripts
  • Variables and Properties in Hive CLI
  • Setting Properties in CLI
  • Example of Setting Properties in CLI
  • Passing Arguments to Hive Script
  • Hive Namespaces
  • Using the SET Command
  • Setting Properties in the Shell
  • Setting Properties for the New Shell Session
  • Setting Alternative Hive Execution Engines
  • The Beeline Shell
  • Connecting to the Hive Server in Beeline
  • Beeline Command Switches
  • Beeline Internal Commands
  • Summary

Chapter 6. Hive Data Definition Language

  • Hive Data Definition Language
  • Creating Databases in Hive
  • Using Databases
  • Creating Tables in Hive
  • Supported Data Type Categories
  • Common Primitive Types
  • String and Date / Time Types
  • Complex Types
  • Miscellaneous Types
  • Example of CREATE TABLE Statement
  • Working with Complex Types
  • Table Partitioning
  • Table Partitioning
  • Partitions Benefits
  • Table Partitioning on Multiple Columns
  • Viewing Table Partitions
  • Bucketed Table DDL
  • Loading Data into Bucketed Table
  • File Format Storage
  • ORC, Parquet, and Avro Binary Data Formats Compared
  • Data Serializers / Deserializers
  • Row Format
  • Visualizing Row Format
  • Row Format with the SerDe Definition
  • A RegexSerDe Example
  • The ORC Data Format
  • Converting Text to ORC Data Format
  • The Parquet Data Storage Format
  • File Compression
  • The EXTERNAL DDL Parameter
  • Example of Using EXTERNAL
  • Features Comparison
  • What Type is my Table?
  • Temporary Tables
  • Creating an Empty Table
  • Dropping a Table
  • Table / Partition(s) Truncation
  • Alter Table/Partition/Column
  • Views
  • Create View Statement
  • Why Use Views?
  • Restricting Amount of Viewable Data
  • Examples of Restricting Amount of Viewable Data
  • Hive Indexing
  • Describing Data
  • Summary

Chapter 7. HiveQL

  • What is HiveQL?
  • HiveQL Main Features
  • Alternative Execution Engines
  • Data Validation
  • Hive Data Manipulation Language (DML)
  • Using the LOAD DATA statement
  • Examples of Loading Data into a Hive Table
  • Loading Data with the INSERT Statement
  • Appending and Replacing Data with the INSERT Statement
  • Examples of Using the INSERT Statement
  • Multi-Table Inserts
  • Multi-table Inserts Syntax
  • Multi-Table Inserts Example
  • The Skewed Tables Concept
  • A Skewed Tables Example
  • Controlling the Number of Reducers
  • Computing Table Statistics
  • DESCRIBE Command Variants
  • Summary

Chapter 8. Hive Select Statement and Built-In Functions

  • The SELECT Statement Syntax
  • The WHERE Clause
  • Examples of the WHERE Statement
  • Partition-Based Queries
  • Example of an Efficient Use Of Partitions in SELECT Statement
  • Create Table As Select Operation
  • Supported Numeric Operators
  • Built-in Mathematical Functions
  • Built-in Aggregate Functions
  • Built-in Statistical Functions
  • Other Useful Built-in Functions
  • The GROUP BY Clause
  • The HAVING Clause
  • The LIMIT Clause
  • The ORDER BY Clause
  • The JOIN Clause
  • Types of Joins
  • The Shuffle Join Visually
  • Map (Broadcast) Join Visually
  • Setting Up the Map Side (Broadcast) Join
  • Sort-Merge-Bucket Join Visually
  • The CASE … Clause
  • Example of CASE … Clause
  • Re-Writing SELECT Statements
  • The TRANSFORM Clause
  • Performance Enhancements with Vectorization + ORC
  • Summary

Chapter 9. Introduction to Functional Programming

  • What is Functional Programming (FP)?
  • Terminology: Higher-Order Functions
  • Lambda Functions
  • A Short List of Languages that Support FP
  • Common High Order (HO) Functions
  • Common High-Order Functions in Python
  • The Map Function in pandas
  • Functional Programming APIs in PySpark
  • Summary

Chapter 10. Introduction to Apache Spark

  • What is Apache Spark
  • The Spark Platform
  • Spark vs Hadoop’s MapReduce (MR)
  • Common Spark Use Cases
  • Languages Supported by Spark
  • Running Spark on a Cluster
  • The Spark Application Architecture
  • The Driver Process
  • The Executor and Worker Processes
  • Spark Shell
  • Jupyter Notebook Shell Environment
  • Spark Applications
  • The spark-submit Tool
  • The spark-submit Tool Configuration
  • Interfaces with Data Storage Systems
  • Project Tungsten
  • The Resilient Distributed Dataset (RDD)
  • Datasets and DataFrames
  • Spark SQL, DataFrames, and Catalyst Optimizer
  • Spark Machine Learning Library
  • GraphX
  • Extending Spark Environment with Custom Modules and Files
  • Summary

Chapter 11. How Spark Works Visually

  • Spark 2+ Architecture
  • Spark Application Execution Diagram
  • Spark Applications: The Big Picture

Chapter 12. The Spark Shell

  • The Spark Shell
  • The Spark v.2 + Command-Line Shells
  • The Spark Shell UI
  • Spark Shell Options
  • Getting Help
  • Jupyter Notebook Shell Environment
  • Example of a Jupyter Notebook Web UI (Databricks Cloud)
  • The Spark Context (sc) and Spark Session (spark)
  • Creating a Spark Session Object in Spark Applications
  • The Shell Spark Context Object (sc)
  • The Shell Spark Session Object (spark)
  • Loading Files
  • Saving Files
  • Summary

Chapter 13. Spark RDDs

  • The Resilient Distributed Dataset (RDD)
  • Ways to Create an RDD
  • Supported Data Types
  • RDD Operations
  • RDDs are Immutable
  • Spark Actions
  • RDD Transformations
  • Other RDD Operations
  • Chaining RDD Operations
  • RDD Lineage
  • The Big Picture
  • What May Go Wrong
  • Checkpointing RDDs
  • Local Checkpointing
  • Parallelized Collections
  • More on parallelize() Method
  • The Pair RDD
  • Where do I use Pair RDDs?
  • Example of Creating a Pair RDD with Map
  • Example of Creating a Pair RDD with keyBy
  • Miscellaneous Pair RDD Operations
  • RDD Caching
  • RDD Persistence
  • Summary

Chapter 14. Parallel Data Processing with Spark

  • Running Spark on a Cluster
  • Data Partitioning
  • Data Partitioning Diagram
  • Single Local File System RDD Partitioning
  • Multiple File RDD Partitioning
  • Special Cases for Small-sized Files
  • Parallel Data Processing of Partitions
  • Spark Application, Jobs, and Tasks
  • Stages and Shuffles
  • The “Big Picture”
  • Summary

Chapter 15. Shared Variables in Spark

  • Shared Variables in Spark
  • Broadcast Variables
  • Creating and Using Broadcast Variables
  • Example of Using Broadcast Variables
  • Problems with Global Variables
  • Example of the Closure Problem
  • Accumulators
  • Creating and Using Accumulators
  • Example of Using Accumulators (Scala Example)
  • Example of Using Accumulators (Python Example)
  • Custom Accumulators
  • Summary

Chapter 16. Introduction to Spark SQL

  • What is Spark SQL?
  • Uniform Data Access with Spark SQL
  • Hive Integration
  • Hive Interface
  • Integration with BI Tools
  • What is a DataFrame?
  • Creating a DataFrame in PySpark
  • Commonly Used DataFrame Methods and Properties in PySpark
  • Grouping and Aggregation in PySpark
  • The “DataFrame to RDD” Bridge in PySpark
  • The SQLContext Object
  • Examples of Spark SQL / DataFrame (PySpark Example)
  • Converting an RDD to a DataFrame Example
  • Example of Reading / Writing a JSON File
  • Using JDBC Sources
  • JDBC Connection Example
  • Performance, Scalability, and Fault-tolerance of Spark SQL
  • Summary

Lab Exercises

Lab 1. Learning the Lab Environment
Lab 2. The Hadoop Distributed File System
Lab 3. Comparing Hive with Spark SQL
Lab 4. The Hive and Beeline Shells
Lab 5. Understanding Tables in Hive
Lab 6. Querying Hive Tables
Lab 7. Working with the Parquet Data Format in Hive
Lab 8. The PySpark Shell
Lab 9. Learning the Databricks Community Cloud Lab Environment
Lab 10. Learning PySpark Shell Environment
Lab 11. Understanding Spark DataFrames
Lab 12. Learning the PySpark DataFrame API
Lab 13. Processing Data in PySpark using the DataFrame API (Project)
Lab 14. Working with Pivot Tables in PySpark (Project)
Lab 15. Data Visualization and EDA in PySpark
Lab 16. Data Visualization and EDA in PySpark (Project)

What is a Data Engineer?

A data engineer conceives, builds and maintains the data infrastructure that holds your enterprise’s advanced analytics capacities together.

A data engineer is responsible for building and maintaining the data architecture of a data science project. Data Engineers are responsible for the creation and maintenance of analytics infrastructure that enables almost every other function in the data world. They are responsible for the development, construction, maintenance and testing of architectures, such as databases and large-scale processing systems. As part of this, Data Engineers are also responsible for the creation of data set processes used in modeling, mining, acquisition, and verification.

What is the difference between a Data Scientist and a Data Engineer?

It is important to know the distinction between these 2 roles.

While there is frequent collaboration between data scientists and data engineers, they’re different positions that prioritize different skill sets. Data scientists focus on advanced statistics and mathematical analysis of the data that’s generated and stored, all in the interest of identifying trends and solving business needs or industry questions. But they can’t do their job without a team of data engineers who have advanced programming skills (Java, Scala, Python) and an understanding of distributed systems and data pipelines.

Broadly speaking, a data scientist builds models using a combination of statistics, mathematics, machine learning and domain based knowledge. He/she has to code and build these models using the same tools/languages and framework that the organization supports.

A data engineer on the other hand has to build and maintain data structures and architectures for data ingestion, processing, and deployment for large-scale data-intensive applications. To build a pipeline for data collection and storage, to funnel the data to the data scientists, to put the model into production – these are just some of the tasks a data engineer has to perform.

Data scientists and data engineers need to work together for any large scale data science project to succeed,

What are the different roles in Data Engineering?

Data Engineer:  A data engineer needs to have knowledge of database tools, languages like Python and Java, distributed systems like Hadoop, among other things. It’s a combination of tasks into one single role.

Data Architect: A data architect lays down the foundation for data management systems to ingest, integrate and maintain all the data sources. This role requires knowledge of tools like SQL, XML, Hive, Pig, Spark, etc.

Database Administrator: As the name suggests, a person working in this role requires extensive knowledge of databases. Responsibilities entail ensuring the databases are available to all the required users, is maintained properly and functions without any hiccups when new features are added.

What are the core Data Engineering skills?

What is the future for Data Engineering?

The data engineering field is expected to continue growing rapidly over the next several years, and there’s huge demand for data engineers across industries.

The global Big Data and data engineering services market is expected to grow at a CAGR of 31.3 percent by 2025.

Can I take this Data Engineering Bootcamp training online?

Yes! We know your busy work schedule may prevent you from getting to one of our classrooms which is why we offer convenient Data Engineer Bootcamp training online to meet your needs wherever you want. We offer our Data Engineering courses as public Data Engineer bootcamp training classes or dedicated Data Engineer Bootcamp training. Ask us about taking a Data Engineer Bootcamp training online course!

Click here to see our Guaranteed to Run Virtual Online Class Schedules

Data Engineering with Python



In this Data Engineer Training video, we’ll review the core capabilities of Python that enable developers to solve a variety of data engineering problems.

We’ll also review NumPy and pandas libraries, with a focus on such topics as the need for understanding your data, selecting the right data types, improving performance of your applications, common data repairing techniques, and so on.

Related Course: WA2905

Proven Results in Data Engineer Training

For over 20 years, we have trained thousands of developers at some of the country’s largest tech companies – including many Fortune 500 companies. Here are a few of the clients we have delivered Data Engineering Courses to:

Booz Allen Hamilton Data Engineer Bootcamp Training     Liberty Mutual Data Engineer Training     FedEx Ground Data Engineer Bootcamp Training     Fidelity Investments Data Engineer Bootcamp Training     Lockheed Martin Data Engineer Bootcamp Courses    Data Engineer Bootcamp Training

Here are some reviews from past students who completed our Data Engineering Courses:

“This was a great course. I loved the blend of Python Concepts Plus Just enough Data science to be productive”

“Instructor was very thorough, yet practical. He was a great communicator and explained everything in layman’s terms.”

“Great tutorials! I will go back to these”

“This course is excellent. It gave me an overview of data science and a good understanding. It put me in the right direction of data analysis in my work.”

PySpark for Data Engineering & Machine Learning

In this data engineering training video we will review the core capabilities of PySpark as well as PySpark’s areas of specialization in data engineering, ETL, and Machine Learning use cases.


Related Data Engineering courses:

Practical Machine Learning with Apache Spark (WA2845)

Data Engineering with Python Training (WA2905)

Why Choose Web Age Solutions for Data Engineer Bootcamp Training?

Data Engineer Courses and Training

Best price in the industry

You won’t find better value in the marketplace. If you do find a lower price, we will beat it.

Data Engineer Bootcamp Training

Various delivery methods

Flexible delivery methods are available depending on your learning style.

Data Engineer Bootcamp Training


Resources are included for a comprehensive learning experience.

We regularly offer Data Engineering courses in these and other cities. Atlanta, Austin, Baltimore, Calgary, Chicago, Cleveland, Dallas, Denver, Detroit, Houston, Jacksonville, Miami, Montreal, New York City, Orlando, Ottawa, Philadelphia, Phoenix, Pittsburgh, Seattle, Toronto, Vancouver, Washington DC.

US Inquiries / 1.877.517.6540
Canadian Inquiries / 1.877.812.8887