Course Catalog > Data Engineering and Big Data > Apache Spark

WA2936

Advanced Data Analytics with PySpark Training

Leverage the Apache Spark platform's massively parallel processing capabilities using PySpark, a Python-based language supported by Spark. Along with introducing PySpark, this course covers Spark Shell to interactively explore and manipulate data. Spark SQL is introduced for a uniform programming API to work with structured data. The course ends with covering Pandas for data manipulation and analysis and data visualization with seaborn.

Request On-Site or Customized Course Info

Course Details

Duration

2 days

Prerequisites

Knowledge of SQL, familiarity with Python (or the ability to learn the basics of a new language)

Target Audience

Business Analysts who want a scalable platform for solving SQL-centric problem

Skills Gained

Learn PySpark Shell Environment
Understand Spark DataFrames
Process Data with the PySpark DataFrame API
Work with Pivot Tables in PySpark
Perform Data Visualization and Exploratory Data Analysis (EDA) in PySpark

Course Outline

Introduction to Apache Spark
- What is Apache Spark
- The Spark Platform
- Spark vs Hadoop's MapReduce (MR)
- Common Spark Use Cases
- Languages Supported by Spark
- Running Spark on a Cluster
- The Spark Application Architecture
- The Driver Process
- The Executor and Worker Processes
- Spark Shell
- Jupyter Notebook Shell Environment
- Spark Applications
- The spark-submit Tool
- The spark-submit Tool Configuration
- Interfaces with Data Storage Systems
- Project Tungsten
- The Resilient Distributed Dataset (RDD)
- Datasets and DataFrames
- Spark SQL, DataFrames, and Catalyst Optimizer
- Spark Machine Learning Library
- GraphX
- Extending Spark Environment with Custom Modules and Files
The Spark Shell
- The Spark Shell
- The Spark v.2 + Command-Line Shells
- The Spark Shell UI
- Spark Shell Options
- Getting Help
- Jupyter Notebook Shell Environment
- The Spark Context (sc) and Spark Session (spark)
- Creating a Spark Session Object in Spark Applications
- The Shell Spark Context Object (sc)
- The Shell Spark Session Object (spark)
- Loading Files
- Saving Files
Introduction to Spark SQL
- What is Spark SQL?
- Uniform Data Access with Spark SQL
- Hive Integration
- Hive Interface
- Integration with BI Tools
- What is a DataFrame?
- Creating a DataFrame in PySpark
- Commonly Used DataFrame Methods and Properties in PySpark
- Grouping and Aggregation in PySpark
- The "DataFrame to RDD" Bridge in PySpark
- The SQLContext Object
- Using JDBC Sources
- Performance, Scalability, and Fault-tolerance of Spark SQL
Practical Introduction to Pandas
- What is pandas?
- The Series Object
- Accessing Values and Indexes in Series
- Setting Up Your Own Index
- Using the Series Index as a Lookup Key
- Can I Pack a Python Dictionary into a Series?
- The DataFrame Object
- The DataFrame's Value Proposition
- Creating a pandas DataFrame
- Getting DataFrame Metrics
- Accessing DataFrame Columns
- Accessing DataFrame Rows
- Accessing DataFrame Cells
- Using iloc
- Using loc
- DataFrames are Mutable via Object Reference!
- Deleting Rows and Columns
- Adding a New Column to a DataFrame
- Appending / Concatenating DataFrame and Series Objects
- Re-indexing Series and DataFrames
- Getting Descriptive Statistics of DataFrame Columns
- Getting Descriptive Statistics of DataFrames
- Applying a Function
- Sorting DataFrames
- Reading From CSV Files
- Writing to the System Clipboard
- Writing to a CSV File
- Fine-Tuning the Column Data Types
- Changing the Type of a Column
- What May Go Wrong with Type Conversion
Data Visualization with seaborn in Python
- Data Visualization
- Data Visualization in Python
- Matplotlib
- Getting Started with matplotlib
- Figures
- Saving Figures to a File
- Seaborn
- Getting Started with seaborn
- Histograms and KDE
- Plotting Bivariate Distributions
- Scatter plots in seaborn
- Pair plots in seaborn
- Heatmaps
Quick Introduction to Python for Data Engineers [OPTIONAL]
- What is Python?
- Additional Documentation
- Which version of Python am I running?
- Python Dev Tools and REPLs
- IPython
- Jupyter
- Jupyter Operation Modes
- Jupyter Common Commands
- Anaconda
- Python Variables and Basic Syntax
- Variable Scopes
- PEP8
- The Python Programs
- Getting Help
- Variable Types
- Assigning Multiple Values to Multiple Variables
- Null (None)
- Strings
- Finding Index of a Substring
- String Splitting
- Triple-Delimited String Literals
- Raw String Literals
- String Formatting and Interpolation
- Boolean
- Boolean Operators
- Numbers
- Looking Up the Runtime Type of a Variable
- Divisions
- Assignment-with-Operation
- Comments:
- Relational Operators
- The if-elif-else Triad
- Conditional Expressions (a.k.a. Ternary Operator)
- The While-Break-Continue Triad
- The for Loop
- try-except-finally
- Lists
- Main List Methods
- Dictionaries
- Working with Dictionaries
- Sets
- Common Set Operations
- Finding Unique Elements in a List
- Enumerate
- Tuples
- Unpacking Tuples
- Functions
- Dealing with Arbitrary Number of Parameters
- Keyword Function Parameters
- The range Object
- Random Numbers
- Python Modules
- Importing Modules
- Installing Modules
- Listing Methods in a Module
- Creating Your Own Modules
- Creating a Runnable Application
- List Comprehension
- Zipping Lists
- Working with Files
- Reading and Writing Files
- Reading Command-Line Parameters
- Accessing Environment Variables
- What is Functional Programming (FP)?
- Terminology: Higher-Order Functions
- Lambda Functions in Python
- Regular Expressions
- Python Data Science-Centric Libraries
Lab Exercises
- Lab 1. Learning the Databricks Community Cloud Lab Environment
- Lab 2. Learning PySpark Shell Environment
- Lab 3. Understanding Spark DataFrames
- Lab 4. Learning the PySpark DataFrame API
- Lab 5. Processing Data in PySpark using the DataFrame API (Project)
- Lab 6. Working with Pivot Tables in PySpark (Project)
- Lab 7. Data Visualization and EDA in PySpark
- Lab 8. Data Visualization and EDA in PySpark (Project)

Upcoming Course Dates

USD $1,460

Online Virtual Class

Scheduled

Date: Nov 4 - 5, 2024

Time: 10 AM - 6 PM ET

USD $1,460

Online Virtual Class

Scheduled

Date: Nov 11 - 12, 2024

Time: 10 AM - 6 PM ET

USD $1,460

Online Virtual Class

Scheduled

Date: Dec 16 - 17, 2024

Time: 10 AM - 6 PM ET

USD $1,460

Online Virtual Class

Scheduled

Date: Dec 23 - 24, 2024

Time: 10 AM - 6 PM ET

USD $1,460

Online Virtual Class

Scheduled

Date: Jan 27 - 28, 2025

Time: 10 AM - 6 PM ET

USD $1,460

Online Virtual Class

Scheduled

Date: Feb 3 - 4, 2025

Time: 10 AM - 6 PM ET

USD $1,460

Online Virtual Class

Scheduled

Date: Mar 10 - 11, 2025

Time: 10 AM - 6 PM ET

USD $1,460

Online Virtual Class

Scheduled

Date: Mar 17 - 18, 2025

Time: 10 AM - 6 PM ET

USD $1,460

Online Virtual Class

Scheduled

Date: Apr 21 - 22, 2025

Time: 10 AM - 6 PM ET

USD $1,460

Online Virtual Class

Scheduled

Date: Apr 28 - 29, 2025

Time: 10 AM - 6 PM ET

USD $1,460

Online Virtual Class

Scheduled

Date: Jun 2 - 3, 2025

Time: 10 AM - 6 PM ET

USD $1,460

Online Virtual Class

Scheduled

Date: Jun 9 - 10, 2025

Time: 10 AM - 6 PM ET

Advanced Data Analytics with PySpark Training

Duration

Prerequisites

Target Audience

Skills Gained

Course Catalog

Upskilling and Reskilling

Resources

About Us

Contact