Course #:WA2845

Practical Machine Learning with Apache Spark Training

Practical Machine Learning with Apache Spark

Duration

Three days.

Outline of Practical Machine Learning with Apache Spark Training

Chapter 1. Machine Learning (ML) and Data Science Overview

  • Data Science vs ML vs AI
  • Machine Learning landscape
  • Supervised and unsupervised ML algorithms
  • ML at scale

Chapter 2. ML with Spark Python

  • Spark ML Overview
  • Introduction to Jupyter notebooks
  • Lab: Working with Jupyter + Python + Spark

Chapter 3. Machine Learning (ML) and Statistical Concepts

  • Terminology
  • Descriptive statistics
  • Errors and Residuals
  • ML Model Overfitting / Underfitting
  • Training and testing
  • Cross-validation, bootstrapping
  • Confusion Matrix
  • ROC curve, Area Under Curve (AUC)
  • Lab: Basic stats

Chapter 4. Feature Engineering (FE)

  • Preparing data for ML
  • Dealing with multicollinearity
  • Feature extraction and synthesis
  • Data normalization and scaling
  • Dealing with missing data
  • Data Visualization
  • Lab: Repairing and normalizing data
  • Lab: Data visualization

Chapter 5. Linear regression (LR)

  • Understanding LR
  • Simple LR
  • Multiple LR
  • Lab: Linear Regression with Spark ML

Chapter 6. Logistic Regression

  • The sigmoid function
  • Understanding Logistic Regression
  • Lab: Applying the Logistic Regression Algorithm

Chapter 7. Classification: SVM (Supervised Vector Machines)

  • SVM concepts and theory
  • SVM kernels
  • Lab: Getting started with SVM
  • Classification: Decision Trees & Random Forests
  • Classification and Regression Trees (CART) introduction
  • Decision Tree concepts
  • Gini index
  • Information entropy
  • Dealing with model overfitting
  • Bias-Variance Tradeoff
  • Ensemble of weak learners
  • The Random Forest algorithm
  • Lab: Using Random Forest Classifier

Chapter 8. Classification: Naive Bayes

  • Naïve Bayes theory
  • Naïve Bayes use cases
  • Lab: Using Naïve Bayes Classifier

Chapter 9. Clustering (K-Means)

  • Understanding K-Means algorithm
  • Running K-Means algorithm in Spark
  • Estimating accuracy of K-Means models
  • Lab: Unsupervised ML with K-Means

Chapter 10. Principal Component Analysis (PCA)

  • Understanding PCA concepts
  • PCA applications
  • Running a PCA algorithm in Spark
  • Evaluating results
  • Lab: Reducing Data Dimensionality with PCA

Chapter 11. Recommendations (Collaborative filtering)

  • Understanding recommender systems
  • Collaborative Filtering concepts
We regularly offer classes in these and other cities. Atlanta, Austin, Baltimore, Calgary, Chicago, Cleveland, Dallas, Denver, Detroit, Houston, Jacksonville, Miami, Montreal, New York City, Orlando, Ottawa, Philadelphia, Phoenix, Pittsburgh, Seattle, Toronto, Vancouver, Washington DC.