WA2867

Hive Programming Training

This is a beginner to advanced level training course on Hive.  This intensive training course encompasses lectures and hands-on labs that help students learn theoretical knowledge and gain practical experience of Hive projects.

Course Details

Duration

2 days

Prerequisites

  • General knowledge of programming and SQL
  • Experience working in Unix environments (e.g. running shell commands, etc.). Participants should be familiar with HDFS

Target Audience

  • Developers
  • Architects
  • Team Leads
  • Data Analysts
  • Data Scientists
Course Outline
  • Apache Hive
    • Traditional RDBMS Capabilities and TCO
    • What is Hive?
    • Apache Hive Logo
    • Hive's Value Proposition
    • Who uses Hive?
    • What Hive Does Not Have
    • Hive's Main Sub-Systems
    • Hive Features
    • The "Classic" Hive Architecture
    • The New Hive Architecture (Hive Server 2)
    • Multi-Client Concurrency in Hive Server 2
    • Components
    • Where are the Hive Tables Located?
    • Data Organization in Hive
    • Hive Tables
    • Managed and External Tables
    • Partitions
    • Buckets
    • Buckets and Partitions
    • Buckets Visually
    • Partitions Visually
    • HiveQL
    • The "Classic" Hive Command-line Interface (CLI)
    • The Beeline Command Shell
  • Hive Command-line Interface
    • Hive Command-line Interface (CLI)
    • The Hive Interactive Shell
    • Running Host OS Commands from the Hive Shell
    • Interfacing with HDFS from the Hive Shell
    • The Hive in Unattended Mode
    • The Hive CLI Integration with the OS Shell
    • Executing HiveQL Scripts
    • Comments in Hive Scripts
    • Variables and Properties in Hive CLI
    • Setting Properties in CLI
    • Passing Arguments to Hive Script
    • Hive Namespaces
    • Using the SET Command
    • Setting Properties in the Shell
    • Setting Properties for the New Shell Session
    • Setting Alternative Hive Execution Engines
    • The Beeline Shell
    • Connecting to the Hive Server in Beeline
    • Beeline Command Switches
    • Beeline Internal Commands
  • Hive Data Definition Language
    • Hive Data Definition Language
    • Creating Databases in Hive
    • Using Databases
    • Creating Tables in Hive
    • Supported Data Type Categories
    • Common Primitive Types
    • String and Date / Time Types
    • Complex Types
    • Miscellaneous Types
    • Example of CREATE TABLE Statement
    • Working with Complex Types
    • Table Partitioning
    • Partitions Benefits
    • Table Partitioning on Multiple Columns
    • Viewing Table Partitions
    • Bucketed Table DDL
    • Loading Data into Bucketed Table
    • File Format Storage
    • ORC, Parquet, and Avro Binary Data Formats Compared
    • Data Serializers / Deserializers
    • Row Format
    • Visualizing Row Format
    • Row Format with the SerDe Definition
    • A RegexSerDe Example
    • The ORC Data Format
    • Converting Text to ORC Data Format
    • The Parquet Data Storage Format
    • File Compression
    • The EXTERNAL DDL Parameter
    • Features Comparison
    • What Type is my Table?
    • Temporary Tables
    • Creating an Empty Table
    • Dropping a Table
    • Table / Partition(s) Truncation
    • Alter Table/Partition/Column
    • Views
    • Create View Statement
    • Why Use Views?
    • Restricting Amount of Viewable Data
    • Examples of Restricting Amount of Viewable Data
    • Hive Indexing
    • Describing Data
  • HiveQL
    • What is HiveQL?
    • HiveQL Main Features
    • Alternative Execution Engines
    • Data Validation
    • Hive Data Manipulation Language (DML)
    • Using the LOAD DATA statement
    • Loading Data with the INSERT Statement
    • Appending and Replacing Data with the INSERT Statement
    • Multi-Table Inserts
    • Multi-table Inserts Syntax
    • Multi-Table Inserts Example
    • INSERT … DIRECTORY
    • The Skewed Tables Concept
    • A Skewed Tables Example
    • Controlling the Number of Reducers
    • Computing Table Statistics
    • ANALYZE TABLE Command
    • DESCRIBE Command Variants
  • Hive Select Statement and Built-In Functions
    • The SELECT Statement Syntax
    • The WHERE Clause
    • Examples of the WHERE Statement
    • Partition-Based Queries
    • Create Table As Select Operation
    • Supported Numeric Operators
    • Built-in Mathematical Functions
    • Built-in Aggregate Functions
    • Built-in Statistical Functions
    • Other Useful Built-in Functions
    • The GROUP BY Clause
    • The HAVING Clause
    • The LIMIT Clause
    • The ORDER BY Clause
    • The JOIN Clause
    • Types of Joins
    • The Shuffle Join Visually
    • Map (Broadcast) Join Visually
    • Setting Up the Map Side (Broadcast) Join
    • Sort-Merge-Bucket Join Visually
    • The CASE … Clause
    • Re-Writing SELECT Statements
    • The TRANSFORM Clause
    • Performance Enhancements with Vectorization + ORC
  • Apache HUE
    • What is Apache HUE?
    • HUE Login Page
    • HUE Web UI at a Glance
    • Supported Editors and Dashboards
    • Hive / Impala Query Editor
    • Command Auto-completion and Metastore Look-Ups
    • Parameterizing Queries
    • Hue Configuration
  • Lab Exercises
    • Lab 1. Learning the Lab Environment
    • Lab 2. The Hadoop Distributed File System
    • Lab 3. The Hive and Beeline Shells
    • Lab 4. Understanding Tables in Hive
    • Lab 5. Querying Hive Tables
    • Lab 6. Extending Hive with UDFs
    • Lab 7. Partitioned and Skewed Tables in Hive
    • Lab 8. Working with the Parquet Data Format in Hive
    • Lab 9. Working with the Avro Data Format in Hive
    • Lab 10. Working with Regular Expressions in Hive
    • Lab 11. Working with Indexes in Hive (Optional)