Web Age Solutions Inc
Providing Technology Training and Mentoring For Modern Technology Adoption
Web Age Aniversary Logo
US Inquiries / 1.877.517.6540
Canadian Inquiries / 1.877.812.8887
Course #:WA2845

Practical Machine Learning with Apache Spark Training

Courseware: Available for sale

This intensive hands-on training introduces the audience to the core aspects of scalable data processing using Python on the Apache Spark platform.  The students will learn the essentials of Python with the primary focus being on the capabilities of the Apache Spark platform and its Machine Learning module.  The students will be introduced to the terminology, concepts, and algorithms used in Machine Learning.


Data Scientists, Business Analysts, Software Developers, IT Architects


Participants should have the general knowledge of statistics and programming


Three days.

Outline of Practical Machine Learning with Apache Spark Training

Chapter 1. Defining Data Science

  • Data Science, Machine Learning, AI?
  • The Data-Related Roles
  • Data Science Ecosystem
  • Business Analytics vs. Data Science
  • Who is a Data Scientist?
  • The Break-Down of Data Science Project Activities
  • Data Scientists at Work
  • The Data Engineer Role
  • What is Data Wrangling (Munging)?
  • Examples of Data Science Projects
  • Data Science Gotchas
  • Summary

Chapter 2. Machine Learning Life-cycle Phases

  • Data Analytics Pipeline
  • Data Discovery Phase
  • Data Harvesting Phase
  • Data Priming Phase
  • Data Cleansing
  • Feature Engineering
  • Data Logistics and Data Governance
  • Exploratory Data Analysis
  • Model Planning Phase
  • Model Building Phase
  • Communicating the Results
  • Production Roll-out
  • Summary

Chapter 3. Quick Introduction to Python Programming

  • Module Overview
  • Some Basic Facts about Python
  • Dynamic Typing Examples
  • Code Blocks and Indentation
  • Importing Modules
  • Lists and Tuples
  • Dictionaries
  • List Comprehension
  • What is Functional Programming (FP)?
  • Terminology: Higher-Order Functions
  • A Short List of Languages that Support FP
  • Lambda
  • Common High-Order Functions in Python 3
  • Summary

Chapter 4. Introduction to Apache Spark

  • What is Apache Spark
  • Where to Get Spark?
  • The Spark Platform
  • Spark Logo
  • Common Spark Use Cases
  • Languages Supported by Spark
  • Running Spark on a Cluster
  • The Driver Process
  • Spark Applications
  • Spark Shell
  • The spark-submit Tool
  • The spark-submit Tool Configuration
  • The Executor and Worker Processes
  • The Spark Application Architecture
  • Interfaces with Data Storage Systems
  • Limitations of Hadoop's MapReduce
  • Spark vs MapReduce
  • Spark as an Alternative to Apache Tez
  • The Resilient Distributed Dataset (RDD)
  • Datasets and DataFrames
  • Spark SQL
  • Spark Machine Learning Library
  • GraphX
  • Summary

Chapter 5. The Spark Shell

  • The Spark Shell
  • The Spark v.2 + Shells
  • The Spark Shell UI
  • Spark Shell Options
  • Getting Help
  • The Spark Context (sc) and Spark Session (spark)
  • The Shell Spark Context Object (sc)
  • The Shell Spark Session Object (spark)
  • Loading Files
  • Saving Files
  • Summary

Chapter 6. Quick Intro to Jupyter Notebooks

  • Python Dev Tools and REPLs
  • IPython
  • Jupyter
  • Jupyter Operation Modes
  • Basic Edit Mode Shortcuts
  • Basic Command Mode Shortcuts
  • Summary

Chapter 7. Data Visualization in Python using matplotlib

  • Data Visualization
  • What is matplotlib?
  • Getting Started with matplotlib
  • The matplotlib.pyplot.plot() Function
  • The matplotlib.pyplot.scatter() Function
  • Labels and Titles
  • Styles
  • The matplotlib.pyplot.bar() Function
  • The matplotlib.pyplot.hist () Function
  • The matplotlib.pyplot.pie () Function
  • The Figure Object
  • The matplotlib.pyplot.subplot() Function
  • Selecting a Grid Cell
  • Saving Figures to a File
  • Summary

Chapter 8. Data Science and ML Algorithms with PySpark

  • In-Class Discussion
  • Types of Machine Learning
  • Supervised vs Unsupervised Machine Learning
  • Supervised Machine Learning Algorithms
  • Classification (Supervised ML) Examples
  • Unsupervised Machine Learning Algorithms
  • Clustering (Unsupervised ML) Examples
  • Choosing the Right Algorithm
  • Terminology: Observations, Features, and Targets
  • Representing Observations
  • Terminology: Labels
  • Terminology: Continuous and Categorical Features
  • Continuous Features
  • Categorical Features
  • Common Distance Metrics
  • The Euclidean Distance
  • What is a Model
  • Model Evaluation
  • The Classification Error Rate
  • Data Split for Training and Test Data Sets
  • Data Splitting in PySpark
  • Hold-Out Data
  • Cross-Validation Technique
  • Spark ML Overview
  • DataFrame-based API is the Primary Spark ML API
  • Estimators, Models, and Predictors
  • Descriptive Statistics
  • Data Visualization and EDA
  • Correlations
  • Hands-on Exercise
  • Feature Engineering
  • Scaling of the Features
  • Feature Blending (Creating Synthetic Features)
  • Hands-on Exercise
  • The 'One-Hot' Encoding Scheme
  • Example of 'One-Hot' Encoding Scheme
  • Bias-Variance (Underfitting vs Overfitting) Trade-off
  • The Modeling Error Factors
  • One Way to Visualize Bias and Variance
  • Underfitting vs Overfitting Visualization
  • Balancing Off the Bias-Variance Ratio
  • Linear Model Regularization
  • ML Model Tuning Visually
  • Linear Model Regularization in Spark
  • Regularization, Take Two
  • Dimensionality Reduction
  • PCA and isomap
  • The Advantages of Dimensionality Reduction
  • Spark Dense and Sparse Vectors
  • Labeled Point
  • Python Example of Using the LabeledPoint Class
  • The LIBSVM format
  • LIBSVM in PySpark
  • Example of Reading a File In LIBSVM Format
  • Life-cycles of Machine Learning Development
  • Regression Analysis
  • Regression vs Correlation
  • Regression vs Classification
  • Simple Linear Regression Model
  • Linear Regression Illustration
  • Least-Squares Method (LSM)
  • Gradient Descent Optimization
  • Locally Weighted Linear Regression
  • Regression Models in Excel
  • Multiple Regression Analysis
  • Evaluating Regression Model Accuracy
  • The R
  • 2
  • Model Score
  • The MSE Model Score
  • Hands-on Exercise
  • Linear Logistic (Logit) Regression
  • Interpreting Logistic Regression Results
  • Hands-on Exercise
  • Naive Bayes Classifier (SL)
  • Naive Bayesian Probabilistic Model in a Nutshell
  • Bayes Formula
  • Classification of Documents with Naive Bayes
  • Hands-on Exercise
  • Decision Trees
  • Decision Tree Terminology
  • Properties of Decision Trees
  • Decision Tree Classification in the Context of Information Theory
  • The Simplified Decision Tree Algorithm
  • Using Decision Trees
  • Random Forests
  • Hands-On Exercise
  • Support Vector Machines (SVMs)
  • Hands-On Exercise
  • Unsupervised Learning Type: Clustering
  • k-Means Clustering (UL)
  • k-Means Clustering in a Nutshell
  • k-Means Characteristics
  • Global vs Local Minimum Explained
  • Hands-On Exercise
  • Time-Series Analysis
  • Decomposing Time-Series
  • A Better Algorithm or More Data?
  • Summary

Lab Exercises

Lab 1. Learning the Lab Environment
Lab 2. Elements of Functional Programming with Python
Lab 3. The Spark Shell
Lab 4. Using the spark-submit Tool
Lab 5. Understanding the DataFrame Object
Lab 6. Data Transformation with PySpark
Lab 7. Switching to PySpark Jupyter Notebooks
Lab 8. Data Visualization with matplotlib
Lab 9. Descriptive Statistics and EDA
Lab 10. Data Repair and Normalization in PySpark
Lab 11. Understanding Linear Regression
Lab 12. Logistic Regression
Lab 13. Classification with Naive Bayes
Lab 14. Random Forest Classification
Lab 15. Support Vector Machine Classification
Lab 16. Using kMeans Algorithm

We regularly offer classes in these and other cities. Atlanta, Austin, Baltimore, Calgary, Chicago, Cleveland, Dallas, Denver, Detroit, Houston, Jacksonville, Miami, Montreal, New York City, Orlando, Ottawa, Philadelphia, Phoenix, Pittsburgh, Seattle, Toronto, Vancouver, Washington DC.
US Inquiries / 1.877.517.6540
Canadian Inquiries / 1.877.812.8887