Course #:TP2750

Spark Deep Dive Using SCALA Training for HDP Developers

This is an in-depth training course in Spark Core, Spark SQL and Spark Streaming using Scala. The course will use Spark 2.0 (or higher) on HDP 2.5 version (or higher). Zeppelin notebooks will be used for performing interactive data exploration through Spark and Eclipse will be used for developing batch/micro-batches in Spark.

Objectives

This intensive training course encompasses lectures and hands-on labs that help students learn theoretical knowledge and gain practical experience of above Spark libraries. Hands-on exercises will enable participants to work with various common datasources like HDFS, MySQL, HBase, Kafka etc. Also during the course, participants will get exposure to deal with variety of Dataformats including CSV, JSON, XML, Log files, Avro, Parquet, ORC etc. using Spark Framework.

Topics

•    Scala for Spark
•    Introduction to Spark
•    Quick recap of HDFS, YARN
•    Spark Overview, Architecture and Concepts
•    Spark Core
•    Spark SQL
•    Spark Streaming

Audience

Developers/Analysts/Architects/Team Leads

Prerequisites

Participants should have knowledge and experience working with a programming. SQL experience will also be required along with working knowledge of Linux environments (e.g. running shell commands, etc.).

Duration

5 Days

Outline of Spark Deep Dive Using SCALA Training for HDP Developers

CHAPTER 1. INTRODUCTION TO SCALA

•    History of Scala Language
•    What is Scala?
•    Design Goals of Scala
•    Advantages of Functional Programming
•    Scala vs Java
•    Scala and Java
•    Introduction to Eclipse IDE
•    Scala Shell Overview
•    Scala with Zeppelin Notebooks

CHAPTER 2. QUICK RECAP OF HADOOP FOR SPARK

•    Recap of HDFS for Spark
•    Recap of YARN w.r.t. Spark
•    Recap of HBase
•    How to use YARN Commands?
•    Recap of MapReduce Logical Architecture
•    Hands-on Exercise

CHAPTER 3. SCALA BASICS

•    Variables and Constants
•    Key Datatypes in Scala
•    Dealing with Numeric, Boolean and String types
•    Scala Shell Commands
•    Scala Key Built-in Functions
•    Scala Collections
•    Manipulating Tuples, Seq, Map, List etc.
•    Flow Control in Scala
•    Hands-on: Exercises

CHAPTER 4. MODULAR SCALA

•    User Defined Functions
•    Anonymous Functions
•    Classes and Objects
•    Packages
•    Traits
•    Ways to compile Scala Code
•    Compiling and Deploying Scala Code
•    Hands-on: Exercises

CHAPTER 5. DEALING WITH KEY DATA FORMATS IN SCALA

•    Processing CSV data using Scala
•    Dealing with XML files in Scala
•    JSON processing using Scala
•    Regular expressions
•    Processing Semi-structured data
•    Extending Hive using Use
•    Hands-on: Exercises

CHAPTER 6. INTRODUCTION TO SPARK

•    Spark Overview
•    Detailed discussion on “Why Spark”
•    Quick Recap of MapReduce
•    Spark vs MapReduce
•    Why Scala for Spark?

CHAPTER 7. SPARK CORE FRAMEWORK AND API

•    High level Spark Architecture
•    Role of Executor, Driver, SparkContext etc.
•    Resilient Distributed Datasets
•    Basic operations in Spark Core API i.e. Actions and Transformations
•    Using the Spark REPL for performing interactive data analysis
•    Hands-on Exercises

CHAPTER 8. DELVING DEEPER INTO SPARK API

•    Pair RDDs
•    Implementing MapReduce Algorithms using Spark
•    Ways to create Pair RDDs
•    JSON Processing
•    Code Example on JSON Processing
•    XML Processing
•    Joins
•    Playing with Regular Expressions
•    Log File Processing using Regular Expressions
•    Hands-on Exercises

CHAPTER 9. EXECUTING A SPARK APPLICATION

•    Writing Standalone Spark Application
•    Building Standalone Scala Spark Application using Maven
•    Various commands to execute and configure Spark Applications in various modes
•    Discussion on Application, Job, Stage, Executor, Tasks
•    Interpreting RDD Metadata/Lineage/DAG
•    Controlling degree of Parallelism in Spark Job
•    Physical execution of a Spark application
•    Discussion on: How Spark is better than MapReduce?
•    Hands-on Exercises

CHAPTER 10. ADVANCED FEATURES OF SPARK

•    Persistence
•    Location
•    Data Format of Persistence
•    Replication
•    Partitioned By
•    Coalesce
•    Accumulators
•    Broadcasting for optimizing performance of Spark jobs
•    Hands-on Exercises

CHAPTER 11. SPARK STREAMING

•    Analyzing streaming data using Spark
•    Stateless Streaming
•    Stateful Streaming
•    Quick introduction to Kafka Architecture
•    Role of Zookeeper, Brokers etc.
•    Hands-on Exercises

CHAPTER 12. SPARK SQL

•    Introduction
•    Dataframe API
•    Performing ad-hoc query analysis using Spark SQL
•    Working with Hive Partitioning
•    Hands-on Exercises

CHAPTER 13. ITERATIVE PROCESSING USING SPARK

•    Introduction to Iterative Processing
•    Quick Introduction to Machine Learning
•    Checkpointing
•    Checkpointing vs Persist
•    Example of Iterative Processing
•    K Means Clustering
•    Hands-on Exercises

CHAPTER 14. DATASET API

•    Introduction to Datasets
•    Why Datasets?
•    Datasets vs Dataframes
•    Using Dataset API
•    Hands-on Exercises

CHAPTER 15. STRUCTURED STREAMING

•    Structured Streaming Overview
•    How it is better than streaming?
•    Structured Streaming API
•    Hands-on Exercises


LAB EXERCISES

•    Lab 1. Learning the Lab Environment
•    Lab 2. Running application on YARN
•    Lab 3. Multiple Scala Hands-on Exercises on Basics, Collections etc.
•    Lab 4. CSV, JSON and XML Files Manipulation related Scala exercises
•    Lab 5. Working with Spark Shell and Zeppelin Notebook
•    Lab 6. Interactive Data Exploration using Spark
•    Lab 7. Working with Pair RDDs
•    Lab 8. Dealing with XML files in Spark
•    Lab 9. Processing JSON data in Spark  
•    Lab 10. Processing Log file data in Spark  
•    Lab 11. Caching in Spark
•    Lab 12. Using Broadcast Variables
•    Lab 13. Using Accumulators
•    Lab 14. Working with Dataframe API
•    Lab 15. Data processing using Avro and Parquet
•    Lab 16. Integrating Hive with Spark SQL
•    Lab 17. Spark SQL – Multiple exercises
•    Lab 18. Working with Dataset API
•    Lab 19. Spark Streaming: Part 1
•    Lab 20. Spark Streaming: Part 2
•    Lab 21. Integrating Kafka and Spark Streaming
•    Lab 22. Integrating Spark with HBase







We regularly offer classes in these and other cities. Atlanta, Austin, Baltimore, Calgary, Chicago, Cleveland, Dallas, Denver, Detroit, Houston, Jacksonville, Miami, Montreal, New York City, Orlando, Ottawa, Philadelphia, Phoenix, Pittsburgh, Seattle, Toronto, Vancouver, Washington DC.