Course #:WA3032 Intermediate Data Engineering with Python Training 04/12/2021 - 04/13/2021 USD$1,575.00 Instructor Led Virtual 05/10/2021 - 05/11/2021 USD$1,575.00 Instructor Led Virtual 05/17/2021 - 05/18/2021 USD$1,575.00 Instructor Led Virtual 07/12/2021 - 07/13/2021 USD$1,575.00 Instructor Led Virtual 07/19/2021 - 07/20/2021 USD$1,575.00 Instructor Led Virtual This fast paced two day course focuses on data analytics through the use of the Python language, the Spark platform for highly scalable operations and Aws Glue for comprehensive data access. Extensive hands-on exercises are provided to ensure that students come away with the practical experience required to perform successfully. Duration 2 Days Outline of Intermediate Data Engineering with Python Training Chapter 1. Introduction to Apache Spark What is Apache Spark The Spark Platform Spark vs Hadoop's MapReduce (MR) Common Spark Use Cases Languages Supported by Spark Running Spark on a Cluster The Spark Application Architecture The Driver Process The Executor and Worker Processes Spark Shell Jupyter Notebook Shell Environment Spark Applications The spark-submit Tool The spark-submit Tool Configuration Interfaces with Data Storage Systems Project Tungsten The Resilient Distributed Dataset (RDD) Datasets and DataFrames Spark SQL, DataFrames, and Catalyst Optimizer Spark Machine Learning Library GraphX Extending Spark Environment with Custom Modules and Files Summary Chapter 2. The Spark Shell The Spark Shell The Spark v.2 + Command-Line Shells The Spark Shell UI Spark Shell Options Getting Help Jupyter Notebook Shell Environment Example of a Jupyter Notebook Web UI (Databricks Cloud) The Spark Context (sc) and Spark Session (spark) Creating a Spark Session Object in Spark Applications The Shell Spark Context Object (sc) The Shell Spark Session Object (spark) Loading Files Saving Files Summary Chapter 3. Spark RDDs The Resilient Distributed Dataset (RDD) Ways to Create an RDD Supported Data Types RDD Operations RDDs are Immutable Spark Actions RDD Transformations Other RDD Operations Chaining RDD Operations RDD Lineage The Big Picture What May Go Wrong Miscellaneous Pair RDD Operations RDD Caching Summary Chapter 4. Introduction to Spark SQL What is Spark SQL? Uniform Data Access with Spark SQL Hive Integration Hive Interface Integration with BI Tools What is a DataFrame? Creating a DataFrame in PySpark Commonly Used DataFrame Methods and Properties in PySpark Grouping and Aggregation in PySpark The "DataFrame to RDD" Bridge in PySpark The SQLContext Object Examples of Spark SQL / DataFrame (PySpark Example) Converting an RDD to a DataFrame Example Example of Reading / Writing a JSON File Using JDBC Sources JDBC Connection Example Performance, Scalability, and Fault-tolerance of Spark SQL Summary Chapter 5. Overview of the Amazon Web Services (AWS) Amazon Web Services The History of AWS The Initial Iteration of Moving amazon.com to AWS The AWS (Simplified) Service Stack Accessing AWS Direct Connect Shared Responsibility Model Trusted Advisor The AWS Distributed Architecture AWS Services Managed vs Unmanaged Amazon Services Amazon Resource Name (ARN) Compute and Networking Services Elastic Compute Cloud (EC2) AWS Lambda Auto Scaling Elastic Load Balancing (ELB) Virtual Private Cloud (VPC) Route53 Domain Name System Elastic Beanstalk Security and Identity Services Identity and Access Management (IAM) AWS Directory Service AWS Certificate Manager AWS Key Management Service (KMS) Storage and Content Delivery Elastic Block Storage (EBS) Simple Storage Service (S3) Glacier CloudFront Content Delivery Service Database Services Relational Database Service (RDS) DynamoDB Amazon ElastiCache Redshift Messaging Services Simple Queue Service (SQS) Simple Notifications Service (SNS) Simple Email Service (SES) AWS Monitoring with CloudWatch Other Services Example Summary Chapter 6. Introduction to AWS Glue What is AWS Glue? AWS Glue Components AWS Glue Components (Cont'd) Managing Notebooks AWS Glue Components (Cont'd) Putting it Together: The AWS Glue Environment Architecture AWS Glue Main Activities Additional Glue Services When To Use AWS Glue? Integration with other AWS Services Summary Chapter 7. Introduction to Apache Spark What is Apache Spark The Spark Platform Uniform Data Access with Spark SQL Common Spark Use Cases Languages Supported by Spark Running Spark on a Cluster The Spark Application Architecture The Driver Process The Executor and Worker Processes Spark Shell Jupyter Notebook Shell Environment Interfaces with Data Storage Systems The Resilient Distributed Dataset (RDD) Datasets and DataFrames Data Partitioning Data Partitioning Diagram Summary Chapter 8. AWS Glue PySpark Extensions AWS Glue and Spark The DynamicFrame Object The DynamicFrame API The GlueContext Object Glue Transforms A Sample Glue PySpark Script Using PySpark AWS Glue PySpark SDK Summary Lab Exercises Lab 1. Learning the Databricks Community Cloud Lab EnvironmentLab 2. Data Visualization and EDA with pandas and seabornLab 3. Correlating Cause and EffectLab 4. Learning PySpark Shell EnvironmentLab 5. Understanding Spark DataFramesLab 6. Learning the PySpark DataFrame APILab 7. Data Repair and Normalization in PySparkLab 8. Working with Parquet File Format in PySpark and pandasLab 9. AWS Glue OverviewLab 10. AWS Glue Crawlers and ClassifiersLab 11. Creating an S3 Bucket for AWS Glue ETL Script OutputLab 12. Creating and Working with Glue Scripts Using Dev EndpointsLab 13. Using PySpark API DirectlyLab 14. Understanding AWS Glue ETL Jobs We regularly offer classes in these and other cities. Atlanta, Austin, Baltimore, Calgary, Chicago, Cleveland, Dallas, Denver, Detroit, Houston, Jacksonville, Miami, Montreal, New York City, Orlando, Ottawa, Philadelphia, Phoenix, Pittsburgh, Seattle, Toronto, Vancouver, Washington DC. View Course Outline Share This Request On-Site or Customized Course Info REGISTER FOR A COURSEWARE SAMPLE x Sent First Name Last Name Email Request On-Site or Customized Course Info x Sent First Name Last Name Phone Number Company Name Email Question
Course #:WA3032 Intermediate Data Engineering with Python Training 04/12/2021 - 04/13/2021 USD$1,575.00 Instructor Led Virtual 05/10/2021 - 05/11/2021 USD$1,575.00 Instructor Led Virtual 05/17/2021 - 05/18/2021 USD$1,575.00 Instructor Led Virtual 07/12/2021 - 07/13/2021 USD$1,575.00 Instructor Led Virtual 07/19/2021 - 07/20/2021 USD$1,575.00 Instructor Led Virtual This fast paced two day course focuses on data analytics through the use of the Python language, the Spark platform for highly scalable operations and Aws Glue for comprehensive data access. Extensive hands-on exercises are provided to ensure that students come away with the practical experience required to perform successfully. Duration 2 Days Outline of Intermediate Data Engineering with Python Training Chapter 1. Introduction to Apache Spark What is Apache Spark The Spark Platform Spark vs Hadoop's MapReduce (MR) Common Spark Use Cases Languages Supported by Spark Running Spark on a Cluster The Spark Application Architecture The Driver Process The Executor and Worker Processes Spark Shell Jupyter Notebook Shell Environment Spark Applications The spark-submit Tool The spark-submit Tool Configuration Interfaces with Data Storage Systems Project Tungsten The Resilient Distributed Dataset (RDD) Datasets and DataFrames Spark SQL, DataFrames, and Catalyst Optimizer Spark Machine Learning Library GraphX Extending Spark Environment with Custom Modules and Files Summary Chapter 2. The Spark Shell The Spark Shell The Spark v.2 + Command-Line Shells The Spark Shell UI Spark Shell Options Getting Help Jupyter Notebook Shell Environment Example of a Jupyter Notebook Web UI (Databricks Cloud) The Spark Context (sc) and Spark Session (spark) Creating a Spark Session Object in Spark Applications The Shell Spark Context Object (sc) The Shell Spark Session Object (spark) Loading Files Saving Files Summary Chapter 3. Spark RDDs The Resilient Distributed Dataset (RDD) Ways to Create an RDD Supported Data Types RDD Operations RDDs are Immutable Spark Actions RDD Transformations Other RDD Operations Chaining RDD Operations RDD Lineage The Big Picture What May Go Wrong Miscellaneous Pair RDD Operations RDD Caching Summary Chapter 4. Introduction to Spark SQL What is Spark SQL? Uniform Data Access with Spark SQL Hive Integration Hive Interface Integration with BI Tools What is a DataFrame? Creating a DataFrame in PySpark Commonly Used DataFrame Methods and Properties in PySpark Grouping and Aggregation in PySpark The "DataFrame to RDD" Bridge in PySpark The SQLContext Object Examples of Spark SQL / DataFrame (PySpark Example) Converting an RDD to a DataFrame Example Example of Reading / Writing a JSON File Using JDBC Sources JDBC Connection Example Performance, Scalability, and Fault-tolerance of Spark SQL Summary Chapter 5. Overview of the Amazon Web Services (AWS) Amazon Web Services The History of AWS The Initial Iteration of Moving amazon.com to AWS The AWS (Simplified) Service Stack Accessing AWS Direct Connect Shared Responsibility Model Trusted Advisor The AWS Distributed Architecture AWS Services Managed vs Unmanaged Amazon Services Amazon Resource Name (ARN) Compute and Networking Services Elastic Compute Cloud (EC2) AWS Lambda Auto Scaling Elastic Load Balancing (ELB) Virtual Private Cloud (VPC) Route53 Domain Name System Elastic Beanstalk Security and Identity Services Identity and Access Management (IAM) AWS Directory Service AWS Certificate Manager AWS Key Management Service (KMS) Storage and Content Delivery Elastic Block Storage (EBS) Simple Storage Service (S3) Glacier CloudFront Content Delivery Service Database Services Relational Database Service (RDS) DynamoDB Amazon ElastiCache Redshift Messaging Services Simple Queue Service (SQS) Simple Notifications Service (SNS) Simple Email Service (SES) AWS Monitoring with CloudWatch Other Services Example Summary Chapter 6. Introduction to AWS Glue What is AWS Glue? AWS Glue Components AWS Glue Components (Cont'd) Managing Notebooks AWS Glue Components (Cont'd) Putting it Together: The AWS Glue Environment Architecture AWS Glue Main Activities Additional Glue Services When To Use AWS Glue? Integration with other AWS Services Summary Chapter 7. Introduction to Apache Spark What is Apache Spark The Spark Platform Uniform Data Access with Spark SQL Common Spark Use Cases Languages Supported by Spark Running Spark on a Cluster The Spark Application Architecture The Driver Process The Executor and Worker Processes Spark Shell Jupyter Notebook Shell Environment Interfaces with Data Storage Systems The Resilient Distributed Dataset (RDD) Datasets and DataFrames Data Partitioning Data Partitioning Diagram Summary Chapter 8. AWS Glue PySpark Extensions AWS Glue and Spark The DynamicFrame Object The DynamicFrame API The GlueContext Object Glue Transforms A Sample Glue PySpark Script Using PySpark AWS Glue PySpark SDK Summary Lab Exercises Lab 1. Learning the Databricks Community Cloud Lab EnvironmentLab 2. Data Visualization and EDA with pandas and seabornLab 3. Correlating Cause and EffectLab 4. Learning PySpark Shell EnvironmentLab 5. Understanding Spark DataFramesLab 6. Learning the PySpark DataFrame APILab 7. Data Repair and Normalization in PySparkLab 8. Working with Parquet File Format in PySpark and pandasLab 9. AWS Glue OverviewLab 10. AWS Glue Crawlers and ClassifiersLab 11. Creating an S3 Bucket for AWS Glue ETL Script OutputLab 12. Creating and Working with Glue Scripts Using Dev EndpointsLab 13. Using PySpark API DirectlyLab 14. Understanding AWS Glue ETL Jobs We regularly offer classes in these and other cities. Atlanta, Austin, Baltimore, Calgary, Chicago, Cleveland, Dallas, Denver, Detroit, Houston, Jacksonville, Miami, Montreal, New York City, Orlando, Ottawa, Philadelphia, Phoenix, Pittsburgh, Seattle, Toronto, Vancouver, Washington DC. View Course Outline Share This Request On-Site or Customized Course Info REGISTER FOR A COURSEWARE SAMPLE x Sent First Name Last Name Email Request On-Site or Customized Course Info x Sent First Name Last Name Phone Number Company Name Email Question