WA2867
Hive Programming Training
This is a beginner to advanced level training course on Hive. This intensive training course encompasses lectures and hands-on labs that help students learn theoretical knowledge and gain practical experience of Hive projects.
Course Details
Duration
2 days
Prerequisites
- General knowledge of programming and SQL
- Experience working in Unix environments (e.g. running shell commands, etc.). Participants should be familiar with HDFS
Target Audience
- Developers
- Architects
- Team Leads
- Data Analysts
- Data Scientists
Course Outline
- Apache Hive
- Traditional RDBMS Capabilities and TCO
- What is Hive?
- Apache Hive Logo
- Hive's Value Proposition
- Who uses Hive?
- What Hive Does Not Have
- Hive's Main Sub-Systems
- Hive Features
- The "Classic" Hive Architecture
- The New Hive Architecture (Hive Server 2)
- Multi-Client Concurrency in Hive Server 2
- Components
- Where are the Hive Tables Located?
- Data Organization in Hive
- Hive Tables
- Managed and External Tables
- Partitions
- Buckets
- Buckets and Partitions
- Buckets Visually
- Partitions Visually
- HiveQL
- The "Classic" Hive Command-line Interface (CLI)
- The Beeline Command Shell
- Hive Command-line Interface
- Hive Command-line Interface (CLI)
- The Hive Interactive Shell
- Running Host OS Commands from the Hive Shell
- Interfacing with HDFS from the Hive Shell
- The Hive in Unattended Mode
- The Hive CLI Integration with the OS Shell
- Executing HiveQL Scripts
- Comments in Hive Scripts
- Variables and Properties in Hive CLI
- Setting Properties in CLI
- Passing Arguments to Hive Script
- Hive Namespaces
- Using the SET Command
- Setting Properties in the Shell
- Setting Properties for the New Shell Session
- Setting Alternative Hive Execution Engines
- The Beeline Shell
- Connecting to the Hive Server in Beeline
- Beeline Command Switches
- Beeline Internal Commands
- Hive Data Definition Language
- Hive Data Definition Language
- Creating Databases in Hive
- Using Databases
- Creating Tables in Hive
- Supported Data Type Categories
- Common Primitive Types
- String and Date / Time Types
- Complex Types
- Miscellaneous Types
- Example of CREATE TABLE Statement
- Working with Complex Types
- Table Partitioning
- Partitions Benefits
- Table Partitioning on Multiple Columns
- Viewing Table Partitions
- Bucketed Table DDL
- Loading Data into Bucketed Table
- File Format Storage
- ORC, Parquet, and Avro Binary Data Formats Compared
- Data Serializers / Deserializers
- Row Format
- Visualizing Row Format
- Row Format with the SerDe Definition
- A RegexSerDe Example
- The ORC Data Format
- Converting Text to ORC Data Format
- The Parquet Data Storage Format
- File Compression
- The EXTERNAL DDL Parameter
- Features Comparison
- What Type is my Table?
- Temporary Tables
- Creating an Empty Table
- Dropping a Table
- Table / Partition(s) Truncation
- Alter Table/Partition/Column
- Views
- Create View Statement
- Why Use Views?
- Restricting Amount of Viewable Data
- Examples of Restricting Amount of Viewable Data
- Hive Indexing
- Describing Data
- HiveQL
- What is HiveQL?
- HiveQL Main Features
- Alternative Execution Engines
- Data Validation
- Hive Data Manipulation Language (DML)
- Using the LOAD DATA statement
- Loading Data with the INSERT Statement
- Appending and Replacing Data with the INSERT Statement
- Multi-Table Inserts
- Multi-table Inserts Syntax
- Multi-Table Inserts Example
- INSERT … DIRECTORY
- The Skewed Tables Concept
- A Skewed Tables Example
- Controlling the Number of Reducers
- Computing Table Statistics
- ANALYZE TABLE Command
- DESCRIBE Command Variants
- Hive Select Statement and Built-In Functions
- The SELECT Statement Syntax
- The WHERE Clause
- Examples of the WHERE Statement
- Partition-Based Queries
- Create Table As Select Operation
- Supported Numeric Operators
- Built-in Mathematical Functions
- Built-in Aggregate Functions
- Built-in Statistical Functions
- Other Useful Built-in Functions
- The GROUP BY Clause
- The HAVING Clause
- The LIMIT Clause
- The ORDER BY Clause
- The JOIN Clause
- Types of Joins
- The Shuffle Join Visually
- Map (Broadcast) Join Visually
- Setting Up the Map Side (Broadcast) Join
- Sort-Merge-Bucket Join Visually
- The CASE … Clause
- Re-Writing SELECT Statements
- The TRANSFORM Clause
- Performance Enhancements with Vectorization + ORC
- Apache HUE
- What is Apache HUE?
- HUE Login Page
- HUE Web UI at a Glance
- Supported Editors and Dashboards
- Hive / Impala Query Editor
- Command Auto-completion and Metastore Look-Ups
- Parameterizing Queries
- Hue Configuration
- Lab Exercises
- Lab 1. Learning the Lab Environment
- Lab 2. The Hadoop Distributed File System
- Lab 3. The Hive and Beeline Shells
- Lab 4. Understanding Tables in Hive
- Lab 5. Querying Hive Tables
- Lab 6. Extending Hive with UDFs
- Lab 7. Partitioned and Skewed Tables in Hive
- Lab 8. Working with the Parquet Data Format in Hive
- Lab 9. Working with the Avro Data Format in Hive
- Lab 10. Working with Regular Expressions in Hive
- Lab 11. Working with Indexes in Hive (Optional)
Upcoming Course Dates