Objectives
Upon completion of the AWS Advanced Analytics for Structured Data course, participants will be able to:
- Navigate the AWS Console for key areas discussed in this class
- Utilize AWS for data processing and data management
- Describe patterns for handling structured data with AWS services
- Understand the usage of AWS Elastic Map Reduce (EMR)
- Understand the facilities provided by Elastic Map Reduce (EMR)
- Identify the facilities provided by Apache Airflow for workflow
- Outline the facilities provided by Glue (Data Catalog)
- Describe the facilities provided by Aurora MySQL
- Define the facilities provided by S3 – Simple Storage Service
- Understand the facilities provided by Informatica Cloud (ICS)
- Identify the features and functions of AWS Lambda
- Describe the features of Hive, HiveQL, and the Hive CLI
- Discuss file formats used in Advanced Analytics
- Understand AWS Athena usages across varied data sources
Audience
This is a general introduction course for anyone who wants a technical introduction to the understanding, creation and digital data supply chains for advanced analytics with AWS.
Prerequisites
- Basic understanding of a coding, AWS console, and cloud are helpful
Duration
2 Days
Outline for AWS Advanced Analytics for Structured Data Training
Chapter 1. Advanced Analytics with AWS
- What are advanced analytics?
- Introduction to AWS services for Analytics
- AWS Public Data Sets
- Forces and Trends in Cloud Analytics
- Data Storage Platforms
- Data Lifecycle and Events
- What is JSON?
Chapter 2. Elastic MapReduce
- What is Amazon EMR?
- Getting started with EMR
- EMR planning
- Running Hadoop Applications for data processing
- Hive and EMR
- Spark and EMR
- Kinesis and EMR
- ETL with EMR
- AWS CLI and EMR
- AWS Console Walkthrough: EMR
Chapter 3. AWS GLUE
- What is Glue?
- How Glue works
- AWS Glue Console
- Getting started with Glue
- Security management
- Glue Data Catalog
- Authoring with Glue
- Auto-population and schema inference
- Events and monitoring
- Troubleshooting
- ETL with Glue
- Glue Application Programming Interface (API)
- AWS Console Walkthrough: Glue
Chapter 4. Apache Airflow
- What is Apache Airflow?
- Introduction to Apache Airflow components
- Visualizing DAG
- Authoring DAGs
- Performance Insights
- Performance Graphs
- Airflow Features
- Use Cases
- Workflow Tables Stakes
- Incubation of Airflow
- Airflow at Work
Chapter 5. Amazon Aurora
- What is Amazon RDS?
- Introduction to Aurora
- MySQL and Aurora compatibility
- Service-oriented Architecture and RDS
- Data replication
- Fully managed
- Shared accountability
- Data encryption at rest and in motion
- Aurora as a meta store
- AWS Console Walkthrough: Aurora
Chapter 6. Introduction to Informatica Cloud (ICS)
- What is Informatica Cloud?
- Integration Platform as a Service
- Cloud-native migration and ICS
- Use cases for Informatica Cloud
- Cloud Connectors
- ICS Connectors
- Information Cloud Options
- Citizen developers and ICS
- Secure Agent
- Cloud Integration Hub
- ICS Console Walkthrough
Chapter 7. S3 – Simple Storage Service
- What is S3?
- Introduction to S3
- Storage
- Replication
- CAP Theorem
- Data Consistency
- Buckets
- Amazon Resource Name (ARN)
- Resource Sharing
- Versioning
- Lifecycle
- Security in S3
- Use cases for S3
- AWS Console Walkthrough: S3
Chapter 8. AWS Lambda
- What is Lambda?
- Introduction to Serverless Computing
- What can you do with Lambda?
- Lambda services
- Triggering for digital data supply chain
- Data processing with Lambda and Glue
- Managed analytics pipeline with Lambda
- AWS Console Walkthrough: Lambda
Chapter 9. HIVE
- What is Hive?
- Hive's value proposition
- Hive's Main Sub-Systems
- Hive Features
- The "Classic" Hive Architecture
- The New Hive Architecture
- HiveQL
- Where are the Hive tables located?
- Hive Command-line Interface (CLI)
- The Beeline Command Shell
- Differences and considerations for Hive on Amazon EMR
- Configuring an External Metastore for Hive
- Use the Hive JDBC Driver
- Hive release history
- Hive Walkthrough
Chapter 10. HIVE CLI
- Hive Command-line Interface (CLI)
- The Hive Interactive Shell
- Running Host OS Commands from the Hive Shell
- Interfacing with HDFS from the Hive Shell
- The Hive in Unattended Mode
- The Hive CLI Integration with the OS Shell
- Executing HiveQL Scripts
- Comments in Hive Scripts
- Variables and Properties in Hive CLI
- Setting Properties in CLI
- Example of Setting Properties in CLI
- Hive Namespaces
- Using the SET Command
- Setting Properties in the Shell
- Setting Properties for the New Shell Session
- Setting Alternative Hive Execution Engines
- The Beeline Shell
- Connecting to the Hive Server in Beeline
- Beeline Command Switches
- Beeline Internal Commands
Chapter 11. HIVE DDL
- Hive Data Definition Language
- Creating Databases in Hive
- Using Databases
- Creating Tables in Hive
- Supported Data Type Categories
- Common Numeric Types
- String and Date / Time Types
- Miscellaneous Types
- Example of the CREATE TABLE Statement
- Working with Complex Types
- Table Partitioning
- Table Partitioning
- Table Partitioning on Multiple Columns
- Viewing Table Partitions
- Row Format
- Data Serializers / Deserializers
- File Format Storage
- File Compression
- More on File Formats
- The EXTERNAL DDL Parameter
- Example of Using EXTERNAL
- Creating an Empty Table
- Dropping a Table
- Table / Partition(s) Truncation
- Alter Table/Partition/Column
- Views
- Create View Statement
- Why Use Views?
- Restricting Amount of Viewable Data
- Examples of Restricting Amount of Viewable Data
- Creating and Dropping Indexes
- Describing Data
Chapter 13. HIVE DML
- Hive Data Manipulation Language (DML)
- Using the LOAD DATA statement
- Example of Loading Data into a Hive Table
- Loading Data with the INSERT Statement
- Appending and Replacing Data with the INSERT Statement
- Examples of Using the INSERT Statement
- Multi Table Inserts
- Multi Table Inserts Syntax
- Multi Table Inserts Example
Chapter 14. Amazon Athena
- What is Amazon Athena?
- Athena in context
- Athena Policy
- Athena Data Sources
- Connectivity
- Getting started with Athena
Chapter 15. High Performance File System Formats
- Why file systems for Advanced Analytics?
- Columnar Data Storages
- Introduction to ORC
- Introduction to Parquet
- Creating ORC and Parquet from CSV with Hive
- Converting Text to ORC Data Format
Chapter 16. Introduction to Monitoring in AWS
- Evolution of monitoring in AWS Cloud
- What is Cloudwatch?
- What is Cloudtrail?
- What is AWS Config?
- Event-driven models
- Notifications driving events
- Serverless computing
- Introduction to Lamba
Lab Exercises
Lab 1. Learning the AWS Management Console
Lab 2. Managing Keys for Secure Connection
Lab 3. Using S3 Through Management Console
Lab 4. Managing IAM Users
Lab 5. Getting Started with the EC2 Service
Lab 6. Using AWS Lambda
Lab 7. Using S3 and Aurora MySQL in AWS Lambda