Course #:WA2845

Practical Machine Learning with Apache Spark Training

Practical Machine Learning with Apache Spark

Duration

Three days.

Outline of Practical Machine Learning with Apache Spark Training

Chapter 1. Machine Learning (ML) and Data Science Overview

  • Data Science vs ML vs AI
  • Machine Learning landscape
  • Supervised and unsupervised ML algorithms
  • ML at scale

Chapter 2. ML with Spark Python

  • Spark ML Overview
  • Introduction to Jupyter notebooks
  • Lab: Working with Jupyter + Python + Spark

Chapter 3. Machine Learning (ML) and Statistical Concepts

  • Terminology
  • Descriptive statistics
  • Errors and Residuals
  • ML Model Overfitting / Underfitting
  • Training and testing
  • Cross-validation, bootstrapping
  • Confusion Matrix
  • ROC curve, Area Under Curve (AUC)
  • Lab: Basic stats

Chapter 4. Feature Engineering (FE)

  • Preparing data for ML
  • Dealing with multicollinearity
  • Feature extraction and synthesis
  • Data normalization and scaling
  • Dealing with missing data
  • Data Visualization
  • Lab: Repairing and normalizing data
  • Lab: Data visualization

Chapter 5. Linear regression (LR)

  • Understanding LR
  • Simple LR
  • Multiple LR
  • Lab: Linear Regression with Spark ML

Chapter 6. Logistic Regression

  • The sigmoid function
  • Understanding Logistic Regression
  • Lab: Applying the Logistic Regression Algorithm

Chapter 7. Classification: SVM (Supervised Vector Machines)

  • SVM concepts and theory
  • SVM kernels
  • Lab: Getting started with SVM
  • Classification: Decision Trees & Random Forests
  • Classification and Regression Trees (CART) introduction
  • Decision Tree concepts
  • Gini index
  • Information entropy
  • Dealing with model overfitting
  • Bias-Variance Tradeoff
  • Ensemble of weak learners
  • The Random Forest algorithm
  • Lab: Using Random Forest Classifier

Chapter 8. Classification: Naive Bayes

  • Naïve Bayes theory
  • Naïve Bayes use cases
  • Lab: Using Naïve Bayes Classifier

Chapter 9. Clustering (K-Means)

  • Understanding K-Means algorithm
  • Running K-Means algorithm in Spark
  • Estimating accuracy of K-Means models
  • Lab: Unsupervised ML with K-Means

Chapter 10. Principal Component Analysis (PCA)

  • Understanding PCA concepts
  • PCA applications
  • Running a PCA algorithm in Spark
  • Evaluating results
  • Lab: Reducing Data Dimensionality with PCA

Chapter 11. Recommendations (Collaborative filtering)

  • Understanding recommender systems
  • Collaborative Filtering concepts

Labs

Lab 1 - Learning the Lab Environment 3
Lab 2 - Elements of Functional Programming with Python 7
Lab 3 - The Spark Shell 12
Lab 4 - Using the spark-submit Tool 22
Lab 5 - Understanding the DataFrame Object 27
Lab 6 - Data Transformation with PySpark 39
Lab 7 - Switching to PySpark Jupyter Notebooks 46
Lab 8 - Data Visualization with matplotlib 48
Lab 9 - Descriptive Statistics and EDA 57
Lab 10 - Data Repair and Normalization in PySpark 68
Lab 11 - Understanding Linear Regression 75
Lab 12 - Logistic Regression 81
Lab 13 - Classification with Naive Bayes 88
Lab 14 - Random Forest Classification 95
Lab 15 - Support Vector Machine Classification 102
Lab 16 - Using kMeans Algorithm 107

We regularly offer classes in these and other cities. Atlanta, Austin, Baltimore, Calgary, Chicago, Cleveland, Dallas, Denver, Detroit, Houston, Jacksonville, Miami, Montreal, New York City, Orlando, Ottawa, Philadelphia, Phoenix, Pittsburgh, Seattle, Toronto, Vancouver, Washington DC.