OBJECTIVES

This intensive training course encompasses lectures and hands-on labs that would help participants learn theoretical knowledge and gain practical experience of above projects.

TOPICS

•    Holistic View of HDP
•    HDFS Architecture and Physical Structure
•    Sqoop Overview and Architecture  
•    Working with Data Formats
•    Intermediate to Advanced Hive
•    In-depth HBase
•    Spark Core
•    Spark SQL
•    Case Study

AUDIENCE

Analysts/Architects/Team Leads

PREREQUISITES

Participants should have the general knowledge of programming and SQL as well as experience working in Unix environments (e.g. running shell commands, etc.). Participants should be familiar with HDFS and Hive basics.

DURATION

5 Days

Outline for Big Data Architecture On The Hortonworks Distribution Training

CHAPTER 1. HOLISTIC VIEW (3 HOURS THEORY)

• Hadoop, Spark, Kafka and NoSQL Overview
• Big Data Pipeline – Batch and Streaming
• Big Data Common Architectures like Lambda, Kappa etc.
• Comparison of Key technologies – which one to choose when?
• Understanding Data Lake
• Connecting the dots

CHAPTER 2. HDFS (2 HOURS THEORY AND 30 MINS HANDS-ON)

• HDFS Overview
• HDFS Architecture
• Data Formats – Sequence File, Avro, ORC, Parquet
• Concepts - Schema Evolution, Container File Format etc.
• Compression format
• Data Organization
• Hands-on: Working with file formats
• Summary

CHAPTER 3. DISTRIBUTED INGESTION USING SQOOP (3 HOURS THEORY AND 2 HOURS HANDS-ON)

• Apache Sqoop Overview
• Map Reduce as a Concept – Logical and Physical Architecture
• Sqoop Capabilities
• Ways to import data through Sqoop
• Sqoop Commands
• Hands-on: Lot of Hands-on exercises on Sqoop commands
• Summary

CHAPTER 4. HIVE (4 HOURS THEORY AND 3 HOURS HANDS-ON)

• Hands-on: Hive Exercise to bring everyone on same page
• Usecases
• Data Formats
• Hive with various Data Formats
• Working with compressed data
• Converting data from one format to another
• Partitioning
• Joining
• Bucketing
• Indexing
• De-Duplication
• Processing Semi-structured data
• Extending Hive using User Defined Functions
• Hive Optimization Parameters/Configurations
• Hands-on: Advanced exercises
• Summary

CHAPTER 5. APACHE HBASE (5 HOURS THEORY & 3 HOURS HANDS-ON)

• What is HBase?
• Usecases
• HBase Design
• HBase Features
• HBase High Availability
• The Write-Ahead Log (WAL) and MemStore
• HBase vs RDBMS
• HBase vs Apache Cassandra
• Not Good Use Cases for HBase
• Interfacing with HBase
• HBase Table Design
• Column Families
• A Cell's Value Versioning
• Timestamps
• Accessing Cells
• HBase Table Design Digest
• Table Horizontal Partitioning with Regions
• HBase Compaction
• Loading Data in HBase
• Column Families Notes
• Rowkey Notes
• HBase Shell
• HBase Shell Command Groups
• Creating and Populating a Table in HBase Shell
• Getting a Cell's Value
• Counting Rows in an HBase Table
• HBase Data modelling
• Integrating Hive with HBase
• Integrating HBase with Spark
• Summary

CHAPTER 6. INTRODUCTION TO APACHE SPARK (2 HOURS THEORY & 2 HOURS HANDS-ON)

• What is Apache Spark
• A Short History of Spark
• Where to Get Spark?
• The Spark Platform
• Spark Logo
• Common Spark Use Cases
• Languages Supported by Spark
• Running Spark on a Cluster
• The Driver Process
• Spark Applications
• Spark Shell
• The spark-submit Tool
• The spark-submit Tool Configuration
• The Executor and Worker Processes
• The Spark Application Architecture
• Interfaces with Data Storage Systems
• Limitations of Hadoop's MapReduce
• Spark vs MapReduce
• Spark as an Alternative to Apache Tez
• The Resilient Distributed Dataset (RDD)
• Spark Streaming (Micro-batching)
• Spark SQL
• Example of Spark SQL
• The Spark Shell
• Spark Shell Options
• The Spark Context (sc) and SQL Context (sqlContext)
• The Shell Spark Context
• Loading Files
• Saving Files
• Basic Spark ETL Operations
• Summary

CHAPTER 7. SPARK RDD (3 HOURS THEORY AND 2 HOURS HANDS-ON)

• The Resilient Distributed Dataset (RDD)
• Ways to Create an RDD
• Custom RDDs
• Supported Data Types
• RDD Operations
• RDDs are Immutable
• Spark Actions
• RDD Transformations
• Other RDD Operations
• Chaining RDD Operations
• RDD Lineage
• The Big Picture
• What May Go Wrong
• Checkpointing RDDs
• Local Checkpointing
• Parallelized Collections
• More on parallelize() Method
• The Pair RDD
• Where do I use Pair RDDs?
• Example of Creating a Pair RDD with Map
• Example of Creating a Pair RDD with keyBy
• Miscellaneous Pair RDD Operations
• RDD Caching
• RDD Persistence
• The Tachyon Storage
• Summary

CHAPTER 8. SPARK SQL (1 HOUR THEORY & 4 HOURS HANDS-ON)

• What is Spark SQL?
• Uniform Data Access with Spark SQL
• Hive Integration
• Hive Interface
• Integration with BI Tools
• Spark SQL is No Longer Experimental Developer API!
• What is a DataFrame?
• The SQLContext Object
• The SQLContext API
• Example of Spark SQL  
• Example of Working with a JSON File
• Example of Working with a Parquet File
• Using JDBC Sources
• JDBC Connection Example
• Performance & Scalability of Spark SQL
• Summary

CHAPTER 9. CASE STUDY DISCUSSION (3 HOURS)

• Migrate RDBMS Data Model to Hadoop Data Model on Movies and Ratings Dataset