Providing Technology Training and Mentoring For Modern Technology Adoption
Web Age Aniversary Logo
US Inquiries / 1.877.517.6540
Canadian Inquiries / 1.877.812.8887
Course #:WA2746

Hadoop Fundamentals with ETL Use Cases Training

This training course introduces the students to Apache Hadoop and key Hadoop ecosystem projects.  This intensive training course uses lectures and hands-on labs that help students learn theoretical knowledge and gain practical experience of Apache Hadoop and related Apache projects.

Topics

  • Hadoop Ecosystem Overview
  • MapReduce
  • Apache Sqoop
  • Functional Programming with Python
  • Spark
  • Spark SQL

Audience

Developers/Architects/Team Leads/Data Analysts/Data Scientists

Prerequisites

Participants should have the general knowledge of programming in Java and SQL as well as experience working in Unix environments (e.g. running shell commands, etc.)

Duration

Three days.

Outline of Hadoop Fundamentals with ETL Use Cases Training

Chapter 1. MapReduce Overview

  •  The Client – Server Processing Pattern
  •  Distributed Computing Challenges
  •  MapReduce Defined
  •  Google's MapReduce
  •  The Map Phase of MapReduce
  •  The Reduce Phase of MapReduce
  •  MapReduce Explained
  •  MapReduce Word Count Job
  •  MapReduce Shared-Nothing Architecture
  •  Similarity with SQL Aggregation Operations
  •  Example of Map & Reduce Operations using JavaScript
  •  Problems Suitable for Solving with MapReduce
  •  Typical MapReduce Jobs
  •  Fault-tolerance of MapReduce
  •  Distributed Computing Economics
  •  MapReduce Systems
  •  Summary

 

Chapter 2. Hadoop Overview

  •  Apache Hadoop
  •  Apache Hadoop Logo
  •  Typical Hadoop Applications
  •  Hadoop Clusters
  •  Hadoop Design Principles
  •  Hadoop Versions
  •  Hadoop's Main Components
  •  Hadoop Simple Definition
  •  Side-by-Side Comparison: Hadoop 1 and Hadoop 2
  •  Hadoop-based Systems for Data Analysis
  •  Other Hadoop Ecosystem Projects
  •  Hadoop Caveats
  •  Hadoop Distributions
  •  Cloudera Distribution of Hadoop (CDH)
  •  Cloudera Distributions
  •  Hortonworks Data Platform (HDP)
  •  MapR
  •  Summary

 

Chapter 3. Hadoop Distributed File System Overview

  •  Hadoop Distributed File System (HDFS)
  •  HDFS High Availability
  •  HDFS "Fine Print"
  •  Storing Raw Data in HDFS
  •  Hadoop Security
  •  HDFS Rack-awareness
  •  Data Blocks
  •  Data Block Replication Example
  •  HDFS NameNode Directory Diagram
  •  Accessing HDFS
  •  Examples of HDFS Commands
  •  Other Supported File Systems
  •  WebHDFS
  •  Examples of WebHDFS Calls
  •  Client Interactions with HDFS for the Read Operation
  •  Read Operation Sequence Diagram
  •  Client Interactions with HDFS for the Write Operation
  •  Communication inside HDFS
  •  Summary

 

Chapter 4. MapReduce with Hadoop

  •  Hadoop's MapReduce
  •  MapReduce 1 and MapReduce 2
  •  Why do I need Discussion of the Old MapReduce?
  •  MapReduce v1 ("Classic MapReduce")
  •  JobTracker and TaskTracker (the "Classic MapReduce")
  •  YARN (MapReduce v2)
  •  YARN vs MR1
  •  YARN As Data Operating System
  •  MapReduce Programming Options
  •  Java MapReduce API
  •  The Structure of a Java MapReduce Program
  •  The Mapper Class
  •  The Reducer Class
  •  The Driver Class
  •  Compiling Classes
  •  Running the MapReduce Job
  •  The Structure of a Single MapReduce Program
  •  Combiner Pass (Optional)
  •  Hadoop's Streaming MapReduce
  •  Python Word Count Mapper Program Example
  •  Python Word Count Reducer Program Example
  •  Setting up Java Classpath for Streaming Support
  •  Streaming Use Cases
  •  The Streaming API vs Java MapReduce API
  •  Amazon Elastic MapReduce
  •  Apache Tez
  •  Summary

 

Chapter 5. Apache Sqoop

  •  What is Sqoop?
  •  Apache Sqoop Logo
  •  Sqoop Import / Export
  •  Sqoop Help
  •  Examples of Using Sqoop Commands
  •  Data Import Example
  •  Fine-tuning Data Import
  •  Controlling the Number of Import Processes
  •  Data Splitting
  •  Helping Sqoop Out
  •  Example of Executing Sqoop Load in Parallel
  •  A Word of Caution: Avoid Complex Free-Form Queries
  •  Using Direct Export from Databases
  •  Example of Using Direct Export from MySQL
  •  More on Direct Mode Import
  •  Changing Data Types
  •  Example of Default Types Overriding
  •  File Formats
  •  The Apache Avro Serialization System
  •  Binary vs Text
  •  More on the SequenceFile Binary Format
  •  Generating the Java Table Record Source Code
  •  Data Export from HDFS
  •  Export Tool Common Arguments
  •  Data Export Control Arguments
  •  Data Export Example
  •  Using a Staging Table
  •  INSERT and UPDATE Statements
  •  INSERT Operations
  •  UPDATE Operations
  •  Example of the Update Operation
  •  Failed Exports
  •  Sqoop2
  •  Sqoop2 Architecture
  •  Summary

 

Chapter 6. Introduction to Functional Programming with Python

  •  What is Functional Programming (FP)?
  •  Terminology: First-Class and Higher-Order Functions
  •  Terminology: Lambda vs Closure
  •  A Short List of Languages that Support FP
  •  FP with Java
  •  FP With JavaScript
  •  Imperative Programming in JavaScript
  •  The JavaScript map (FP) Example
  •  The JavaScript reduce (FP) Example
  •  Using reduce to Flatten an Array of Arrays (FP) Example
  •  The JavaScript filter (FP) Example
  •  Common High-Order Functions in Python
  •  Common High-Order Functions in Scala
  •  Elements of FP in R
  •  Summary

 

Chapter 7. Introduction to Apache Spark

  •  What is Spark
  •  A Short History of Spark
  •  Where to Get Spark?
  •  The Spark Platform
  •  Spark Logo
  •  Common Spark Use Cases
  •  Languages Supported by Spark
  •  Running Spark on a Cluster
  •  The Driver Process
  •  Spark Applications
  •  Spark Shell
  •  The spark-submit Tool
  •  The spark-submit Tool Configuration
  •  The Executor and Worker Processes
  •  The Spark Application Architecture
  •  Interfaces with Data Storage Systems
  •  Limitations of Hadoop's MapReduce
  •  Spark vs MapReduce
  •  Spark as an Alternative to Apache Tez
  •  The Resilient Distributed Dataset (RDD)
  •  Spark Streaming (Micro-batching)
  •  Spark SQL
  •  Example of Spark SQL
  •  Spark Machine Learning Library
  •  GraphX
  •  Spark vs R
  •  Summary

 

Chapter 8. The Spark Shell

  •  The Spark Shell
  •  The Spark Shell UI
  •  Spark Shell Options
  •  Getting Help
  •  The Spark Context (sc) and SQL Context (sqlContext)
  •  The Shell Spark Context
  •  Loading Files
  •  Saving Files
  •  Basic Spark ETL Operations
  •  Summary

 

Chapter 9. Spark RDDs

  •  The Resilient Distributed Dataset (RDD)
  •  Ways to Create an RDD
  •  Custom RDDs
  •  Supported Data Types
  •  RDD Operations
  •  RDDs are Immutable
  •  Spark Actions
  •  RDD Transformations
  •  Other RDD Operations
  •  Chaining RDD Operations
  •  RDD Lineage
  •  The Big Picture
  •  What May Go Wrong
  •  Checkpointing RDDs
  •  Local Checkpointing
  •  Parallelized Collections
  •  More on parallelize() Method
  •  The Pair RDD
  •  Where do I use Pair RDDs?
  •  Example of Creating a Pair RDD with Map
  •  Example of Creating a Pair RDD with keyBy
  •  Miscellaneous Pair RDD Operations
  •  RDD Caching
  •  RDD Persistence
  •  The Tachyon Storage
  •  Summary

 

Chapter 10. Parallel Data Processing with Spark

  •  Running Spark on a Cluster
  •  Spark Stand-alone Option
  •  The High-Level Execution Flow in Stand-alone Spark Cluster
  •  Data Partitioning
  •  Data Partitioning Diagram
  •  Single Local File System RDD Partitioning
  •  Multiple File RDD Partitioning
  •  Special Cases for Small-sized Files
  •  Parallel Data Processing of Partitions
  •  Spark Application, Jobs, and Tasks
  •  Stages and Shuffles
  •  The "Big Picture"
  •  Summary

 

Chapter 11. Shared Variables in Spark

  •  Shared Variables in Spark
  •  Broadcast Variables
  •  Creating and Using Broadcast Variables
  •  Example of Using Broadcast Variables
  •  Accumulators
  •  Creating and Using Accumulators
  •  Example of Using Accumulators
  •  Custom Accumulators
  •  Summary

 

Chapter 12. Introduction to Spark SQL

  •  What is Spark SQL?
  •  Uniform Data Access with Spark SQL
  •  Hive Integration
  •  Hive Interface
  •  Integration with BI Tools
  •  Spark SQL is No Longer Experimental Developer API!
  •  What is a DataFrame?
  •  The SQLContext Object
  •  The SQLContext API
  •  Changes Between Spark SQL 1.3 to 1.4
  •  Example of Spark SQL (Scala Example)
  •  Example of Working with a JSON File
  •  Example of Working with a Parquet File
  •  Using JDBC Sources
  •  JDBC Connection Example
  •  Performance & Scalability of Spark SQL
  •  Summary



Lab Exercises

  • Lab 1. Learning the Lab Environment
  • Lab 2. The Hadoop Distributed File System
  • Lab 3. Hadoop Streaming MapReduce
  • Lab 4. Programming Java MapReduce Jobs on Hadoop
  • Lab 5. Data Import and Export with Sqoop
  • Lab 6. The Spark Shell
  • Lab 7. Spark ETL and HDFS Interface
  • Lab 8. Common Map / Reduce Programs in Spark
  • Lab 9. Using Broadcast Variables
  • Lab 10. Using Accumulators
  • Lab 11. Spark SQL



We regularly offer classes in these and other cities. Atlanta, Austin, Baltimore, Calgary, Chicago, Cleveland, Dallas, Denver, Detroit, Houston, Jacksonville, Miami, Montreal, New York City, Orlando, Ottawa, Philadelphia, Phoenix, Pittsburgh, Seattle, Toronto, Vancouver, Washington DC.
US Inquiries / 1.877.517.6540
Canadian Inquiries / 1.877.812.8887