WA2936

Advanced Data Analytics with PySpark Training

Leverage the Apache Spark platform's massively parallel processing capabilities using PySpark, a Python-based language supported by Spark. Along with introducing PySpark, this course covers Spark Shell to interactively explore and manipulate data. Spark SQL is introduced for a uniform programming API to work with structured data. The course ends with covering Pandas for data manipulation and analysis and data visualization with seaborn.

Course Details

Duration

2 days

Prerequisites

Knowledge of SQL, familiarity with Python (or the ability to learn the basics of a new language)

Target Audience

Business Analysts who want a scalable platform for solving SQL-centric problem

Skills Gained

  • Learn PySpark Shell Environment
  • Understand Spark DataFrames
  • Process Data with the PySpark DataFrame API
  • Work with Pivot Tables in PySpark
  • Perform Data Visualization and Exploratory Data Analysis (EDA) in PySpark
Course Outline
  • Introduction to Apache Spark
    • What is Apache Spark
    • The Spark Platform
    • Spark vs Hadoop's MapReduce (MR)
    • Common Spark Use Cases
    • Languages Supported by Spark
    • Running Spark on a Cluster
    • The Spark Application Architecture
    • The Driver Process
    • The Executor and Worker Processes
    • Spark Shell
    • Jupyter Notebook Shell Environment
    • Spark Applications
    • The spark-submit Tool
    • The spark-submit Tool Configuration
    • Interfaces with Data Storage Systems
    • Project Tungsten
    • The Resilient Distributed Dataset (RDD)
    • Datasets and DataFrames
    • Spark SQL, DataFrames, and Catalyst Optimizer
    • Spark Machine Learning Library
    • GraphX
    • Extending Spark Environment with Custom Modules and Files
  • The Spark Shell
    • The Spark Shell
    • The Spark v.2 + Command-Line Shells
    • The Spark Shell UI
    • Spark Shell Options
    • Getting Help
    • Jupyter Notebook Shell Environment
    • The Spark Context (sc) and Spark Session (spark)
    • Creating a Spark Session Object in Spark Applications
    • The Shell Spark Context Object (sc)
    • The Shell Spark Session Object (spark)
    • Loading Files
    • Saving Files
  • Introduction to Spark SQL
    • What is Spark SQL?
    • Uniform Data Access with Spark SQL
    • Hive Integration
    • Hive Interface
    • Integration with BI Tools
    • What is a DataFrame?
    • Creating a DataFrame in PySpark
    • Commonly Used DataFrame Methods and Properties in PySpark
    • Grouping and Aggregation in PySpark
    • The "DataFrame to RDD" Bridge in PySpark
    • The SQLContext Object
    • Using JDBC Sources
    • Performance, Scalability, and Fault-tolerance of Spark SQL
  • Practical Introduction to Pandas
    • What is pandas?
    • The Series Object
    • Accessing Values and Indexes in Series
    • Setting Up Your Own Index
    • Using the Series Index as a Lookup Key
    • Can I Pack a Python Dictionary into a Series?
    • The DataFrame Object
    • The DataFrame's Value Proposition
    • Creating a pandas DataFrame
    • Getting DataFrame Metrics
    • Accessing DataFrame Columns
    • Accessing DataFrame Rows
    • Accessing DataFrame Cells
    • Using iloc
    • Using loc
    • DataFrames are Mutable via Object Reference!
    • Deleting Rows and Columns
    • Adding a New Column to a DataFrame
    • Appending / Concatenating DataFrame and Series Objects
    • Re-indexing Series and DataFrames
    • Getting Descriptive Statistics of DataFrame Columns
    • Getting Descriptive Statistics of DataFrames
    • Applying a Function
    • Sorting DataFrames
    • Reading From CSV Files
    • Writing to the System Clipboard
    • Writing to a CSV File
    • Fine-Tuning the Column Data Types
    • Changing the Type of a Column
    • What May Go Wrong with Type Conversion
  • Data Visualization with seaborn in Python
    • Data Visualization
    • Data Visualization in Python
    • Matplotlib
    • Getting Started with matplotlib
    • Figures
    • Saving Figures to a File
    • Seaborn
    • Getting Started with seaborn
    • Histograms and KDE
    • Plotting Bivariate Distributions
    • Scatter plots in seaborn
    • Pair plots in seaborn
    • Heatmaps
  • Quick Introduction to Python for Data Engineers [OPTIONAL]
    • What is Python?
    • Additional Documentation
    • Which version of Python am I running?
    • Python Dev Tools and REPLs
    • IPython
    • Jupyter
    • Jupyter Operation Modes
    • Jupyter Common Commands
    • Anaconda
    • Python Variables and Basic Syntax
    • Variable Scopes
    • PEP8
    • The Python Programs
    • Getting Help
    • Variable Types
    • Assigning Multiple Values to Multiple Variables
    • Null (None)
    • Strings
    • Finding Index of a Substring
    • String Splitting
    • Triple-Delimited String Literals
    • Raw String Literals
    • String Formatting and Interpolation
    • Boolean
    • Boolean Operators
    • Numbers
    • Looking Up the Runtime Type of a Variable
    • Divisions
    • Assignment-with-Operation
    • Comments:
    • Relational Operators
    • The if-elif-else Triad
    • Conditional Expressions (a.k.a. Ternary Operator)
    • The While-Break-Continue Triad
    • The for Loop
    • try-except-finally
    • Lists
    • Main List Methods
    • Dictionaries
    • Working with Dictionaries
    • Sets
    • Common Set Operations
    • Finding Unique Elements in a List
    • Enumerate
    • Tuples
    • Unpacking Tuples
    • Functions
    • Dealing with Arbitrary Number of Parameters
    • Keyword Function Parameters
    • The range Object
    • Random Numbers
    • Python Modules
    • Importing Modules
    • Installing Modules
    • Listing Methods in a Module
    • Creating Your Own Modules
    • Creating a Runnable Application
    • List Comprehension
    • Zipping Lists
    • Working with Files
    • Reading and Writing Files
    • Reading Command-Line Parameters
    • Accessing Environment Variables
    • What is Functional Programming (FP)?
    • Terminology: Higher-Order Functions
    • Lambda Functions in Python
    • Regular Expressions
    • Python Data Science-Centric Libraries
  • Lab Exercises
    • Lab 1. Learning the Databricks Community Cloud Lab Environment
    • Lab 2. Learning PySpark Shell Environment
    • Lab 3. Understanding Spark DataFrames
    • Lab 4. Learning the PySpark DataFrame API
    • Lab 5. Processing Data in PySpark using the DataFrame API (Project)
    • Lab 6. Working with Pivot Tables in PySpark (Project)
    • Lab 7. Data Visualization and EDA in PySpark
    • Lab 8. Data Visualization and EDA in PySpark (Project)