Web Age Solutions Inc
Providing Technology Training and Mentoring For Modern Technology Adoption
Web Age Aniversary Logo
US Inquiries / 1.877.517.6540
Canadian Inquiries / 1.877.812.8887

Data Engineer Training: Data Engineering Bootcamp

Course #:WA2926




Data engineering is the aspect of data science that focuses on practical applications of data collection and analysis.

 

A data engineer conceives, builds and maintains the data infrastructure that holds your enterprise’s advanced analytics capacities together.

Learn about the world of data engineering.

This five day Data Engineering Bootcamp training course is supplemented by hands-on labs that help attendees reinforce their theoretical knowledge of the learned material.

Special Offer!
July 27-31  – Guaranteed to Run!

Retail: $2,995
Now: $1,000

register now

Delivery Methods

Web Age Solutions Live Virtual Training              Web Age Solutions Onsite Training              Web Age Solutions Classroom Training

Overview of Data Engineering Bootcamp Training

Delivery Focus:

Data Engineering & Data Manipulation should be heavily weighted; with some understanding of Data Science

Audience

This Data Engineer Training is Targeted to Data Engineers

Duration

Five days.

Outline of Data Engineering Bootcamp Training

Chapter 1. Introduction – The Big Data Landscape & Key Components

  • The Big Data EcoSystem at CVS
  • YARN, Spark, Spark Streaming, Kafka
  • Containers: Docker/Kubernetes
  • Monitoring and Logging: Prometheus

Chapter 2. Data Engineering Defined

  • Data is King
  • Translating Data into Business Insights
  • What is Data Engineering
  • The Data-Related Roles
  • The Data Science Skill Sets
  • The Data Engineer Role
  • An Example of a Data Product
  • Data Schema for Data Exchange Interoperability
  • The Data Exchange Interoperability Options
  • Big Data and NoSQL
  • Data Physics
  • The Traditional Client – Server Processing Pattern
  • Data Locality (Distributed Computing Economics)
  • The CAP Theorem
  • Mechanisms to Guarantee a Single CAP Property
  • The CAP Triangle
  • Eventual Consistency

Chapter 3. Data Processing Phases

  • Typical Data Processing Pipeline
  • Data Discovery Phase
  • Data Harvesting Phase
  • Data Priming Phase
  • Data Logistics and Data Governance
  • Exploratory Data Analysis
  • Model Planning Phase
  • Model Building Phase
  • Communicating the Results
  • Production Roll-out

Chapter 4. Core Data Engineering Tasks

  • Data acquisition in Python
  • Database and Web interfaces
  • Ensuring data quality
  • Repairing and normalizing data
  • Descriptive statistics computing features in Python
  • Processing data at scale

Chapter 5 – Functional Programming Primer

  • What is Functional Programming
  • Benefits of Functional Programming
  • Functions as Data
  • Using Map Function
  • Using Filter Function
  • Lambda expressions
  • List.sort() Using Lambda Expression
  • Difference Between Simple Loops and map/filter Type Functions
  • Additional Functions
  • Summary

Chapter 6. Introduction to PySpark

  • What is Apache Spark
  • Spark use cases
  • Architectural overview

Chapter 7. PySpark Shell

  • What is the PySpark Shell
  • Starting and using the shell
  • Spark context
  • PySpark Shell vs Spark Shell

Chapter 8. Resilient Distributed Dataset

  • What are Resilient Distributed Dataset (RDD)
  • Creating RDDs
  • Transformations and operations

Chapter 9. Parallel Processing

  • Spark cluster
  • Data partitioning
  • Applications, jobs and tasks

Chapter 10. Shared Variables

  • What are shared variables
  • Broadcast variables
  • Accumulators

Chapter 11. Spark SQL

  • What is Spark SQL
  • Uniform data
  • Hive
  • SQL Context object

Chapter 12. The Spark Machine Learning Library

  • What is MLlib?
  • Supported Languages
  • MLlib Packages
  • Dense and Sparse Vectors
  • Labeled Point
  • Python Example of Using the LabeledPoint Class
  • LIBSVM format
  • An Example of a LIBSVM File
  • Loading LIBSVM Files
  • Local Matrices
  • Example of Creating Matrices in MLlib
  • Distributed Matrices
  • Example of Using a Distributed Matrix
  • Classification and Regression Algorithm
  • Clustering
  • Summary

Chapter 13. Streaming – Kafka and Spark

  • Installing Apache Kafka
  • Configuration Files
  • Starting Kafka
  • Using Kafka Command Line Client Tools
  • Setting up a Multi-Broker Cluster
  • Using Multi-Broker Cluster
  • Kafka Connect
  • Kafka Connect  Configuration Files
  • Using Kafka Connect to Import/Export Data
  • Building Data Pipelines
  • Considerations When Building Data Pipelines
  • Timeliness
  • Reliability
  • High and Varying Throughput
  • High and Varying Throughput (Contd.)
  • Data Formats
  • Data Formats (Contd.)
  • Transformations
  • Transformations (Contd.)
  • Security
  • Failure Handling
  • Coupling and Agility
  • Ad-hoc Pipelines
  • Loss of Metadata
  • Extreme Processing
  • Kafka Connect Versus Producer and Consumer
  • Kafka Connect Versus Producer and Consumer (Contd.)
  • Spark Streaming Features
  • How It Works
  • Basic Data Stream Sources
  • Advanced Data Stream Sources
  • The DStream Object

Chapter 14. Infrastructure Optimization: What considerations are involved in getting the best performance from clusters and infrastructure?

  • Monitoring Distributed Systems: Retrieving performance statistics from cluster members, aggregating output, consolidating application logs.
  • Operations Strategy: What approaches can be used to find errors and bugs in distributed applications, and devise solutions for them?
  • Case Study/Demonstration
  • Lab: Explore log aggregation in Splunk

Chapter 15. Making Big Data Secure

  • What is required to secure Big Data infrastructure?
  • How can centralized security management software, such as Kerberos and LDAP, be configured as part of a broader security architecture?
  • What special considerations are there for applications and users who need to access protected resources?
  • How are permissions and roles managed so that Big Data processing resources, such as Spark applications running on top of YARN or Kubernetes, are able to access data stored within HDFS or in an object storage like Amazon S3?
  • Lab: Configuring Secure Access to Big Data Resources

Chapter 16. How does DevOps work in a data context?

  • Infrastructure: Version Control (git, GitHub), Automation (Jenkins), Processing (Spark, Hadoop, YARN), Data Management (Kafka),
  • Process Differences: DataOps is more than DevOps and data
  • Lifecycle and Differences
  • Incorporating Complex Data Infrastructure into Continuous Integration/Deployment
  • Standardization of runtime environment using containers
  • Accounting for Infrastructure Differences within IaC configuration
  • Incorporating orchestration to handle supporting component deployment and management
  • Statistical Process Control (SPC) to ensure pipeline and model repeatability
  • Case Study/Demonstration

Chapter 17. Tooling

  • GitHub: Source Forge
  • Docker and Jenkins: Continuous Integration
  • Spinnaker: Continuous Deployment
  • Lab: Continuous Integration of a Kafka Based Application Using Jenkins

What is a Data Engineer?

A data engineer conceives, builds and maintains the data infrastructure that holds your enterprise’s advanced analytics capacities together.

A data engineer is responsible for building and maintaining the data architecture of a data science project. Data Engineers are responsible for the creation and maintenance of analytics infrastructure that enables almost every other function in the data world. They are responsible for the development, construction, maintenance and testing of architectures, such as databases and large-scale processing systems. As part of this, Data Engineers are also responsible for the creation of data set processes used in modeling, mining, acquisition, and verification.

What is the difference between a Data Scientist and a Data Engineer?

It is important to know the distinction between these 2 roles.

While there is frequent collaboration between data scientists and data engineers, they’re different positions that prioritize different skill sets. Data scientists focus on advanced statistics and mathematical analysis of the data that’s generated and stored, all in the interest of identifying trends and solving business needs or industry questions. But they can’t do their job without a team of data engineers who have advanced programming skills (Java, Scala, Python) and an understanding of distributed systems and data pipelines.

Broadly speaking, a data scientist builds models using a combination of statistics, mathematics, machine learning and domain based knowledge. He/she has to code and build these models using the same tools/languages and framework that the organization supports.

A data engineer on the other hand has to build and maintain data structures and architectures for data ingestion, processing, and deployment for large-scale data-intensive applications. To build a pipeline for data collection and storage, to funnel the data to the data scientists, to put the model into production – these are just some of the tasks a data engineer has to perform.

Data scientists and data engineers need to work together for any large scale data science project to succeed,

What are the different roles in Data Engineering?

Data Engineer:  A data engineer needs to have knowledge of database tools, languages like Python and Java, distributed systems like Hadoop, among other things. It’s a combination of tasks into one single role.

Data Architect: A data architect lays down the foundation for data management systems to ingest, integrate and maintain all the data sources. This role requires knowledge of tools like SQL, XML, Hive, Pig, Spark, etc.

Database Administrator: As the name suggests, a person working in this role requires extensive knowledge of databases. Responsibilities entail ensuring the databases are available to all the required users, is maintained properly and functions without any hiccups when new features are added.

What are the core Data Engineering skills?

What is the future for Data Engineering?

The data engineering field is expected to continue growing rapidly over the next several years, and there’s huge demand for data engineers across industries.

The global Big Data and data engineering services market is expected to grow at a CAGR of 31.3 percent by 2025.

Can I take this Data Engineering Bootcamp training online?

Yes! We know your busy work schedule may prevent you from getting to one of our classrooms which is why we offer convenient Data Engineer training online to meet your needs wherever you want. We offer our Data Engineering courses as public Data Engineer training classes or dedicated Data Engineer training. Ask us about taking a Data Engineer training online course!

Click here to see our Guaranteed to Run Virtual Online Class Schedules

Data Engineer Training
View related courses:
Data Engineer Training Courses.

Data Engineering with Python

 

 

In this Data Engineer Training video, we’ll review the core capabilities of Python that enable developers to solve a variety of data engineering problems.

We’ll also review NumPy and pandas libraries, with a focus on such topics as the need for understanding your data, selecting the right data types, improving performance of your applications, common data repairing techniques, and so on.

Related Course: WA2905

Proven Results

For over 20 years, we have trained thousands of developers at some of the country’s largest tech companies – including many Fortune 500 companies. Here are a few of the clients we have delivered Data Engineering Courses to:

Booz Allen Hamilton Data Engineer Training     Liberty Mutual Data Engineer Training     FedEx Ground Data Engineer Training     Fidelity Investments Data Engineer Training     Lockheed Martin Data Engineer Training    Data Engineer Training

Here are some reviews from past students who completed our Data Engineering Courses:

“This was a great course. I loved the blend of Python Concepts Plus Just enough Data science to be productive”

“Instructor was very thorough, yet practical. He was a great communicator and explained everything in layman’s terms.”

“Great tutorials! I will go back to these”

“This course is excellent. It gave me an overview of data science and a good understanding. It put me in the right direction of data analysis in my work.”

PySpark for Data Engineering & Machine Learning

In this data engineering training video we will review the core capabilities of PySpark as well as PySpark’s areas of specialization in data engineering, ETL, and Machine Learning use cases.

 

Related courses:

Practical Machine Learning with Apache Spark (WA2845)

Data Engineering with Python Training (WA2905)

Why Choose Web Age Solutions?

Microservices Training

Best price in the industry

You won’t find better value in the marketplace. If you do find a lower price, we will beat it.

Microservices Training

Various delivery methods

Flexible delivery methods are available depending on your learning style.

Microservices Training

Resources

Resources are included for a comprehensive learning experience.

We regularly offer classes in these and other cities. Atlanta, Austin, Baltimore, Calgary, Chicago, Cleveland, Dallas, Denver, Detroit, Houston, Jacksonville, Miami, Montreal, New York City, Orlando, Ottawa, Philadelphia, Phoenix, Pittsburgh, Seattle, Toronto, Vancouver, Washington DC.


US Inquiries / 1.877.517.6540
Canadian Inquiries / 1.877.812.8887