AC3350

Comprehensive Machine Learning with Python Training

This Machine Learning (ML) course introduces natural language processing (NLP) and teaches attendees Python Programming basics. Students learn how to use Python to import and manipulate data, perform exploratory data analysis, build machine learning models, and evaluate their performance. The training also covers H2O, a powerful ML platform.

Course Details

Duration

4 days

Prerequisites

All attendees should have completed the Comprehensive Data Science with Python class or have equivalent experience.

Target Audience

  • Programmers
  • Software Engineers
  • Computer Scientists
  • Data Scientists
  • Data Engineers
  • Data Analysts

Skills Gained

  • Review core Python concepts and learn how to use the Anaconda computing environment
  • Import and manipulate data with Pandas, and perform exploratory data analysis with Pandas and Seaborn
  • Learn the theory behind machine learning and how to apply it to supervised and unsupervised learning tasks
  • Build machine learning models for regression and classification, and evaluate their performance using cross-validation
  • Learn how to use H2O for data munging and machine learning, and explore natural language processing (NLP) techniques for text data
Course Outline
  • Review of Core Python Concepts
    • Anaconda Computing Environment
    • Importing and manipulating Data with Pandas
    • Exploratory Data Analysis with Pandas and Seaborn
    • NumPy ndarrays versus Pandas Dataframes
  • An Overview of Machine Learning
    • Machine Learning Theory
    • Data pre-processing
      • Missing Data
      • Dummy Coding
      • Standardization
      • Data Validation Strategies
    • Supervised Versus Unsupervised Learning
  • Supervised Learning: Regression
    • Linear Regression
    • Penalized Linear Regression
    • Stochastic Gradient Descent
    • Decision Tree Regressor
    • Random Forest Regression
    • Gradient Boosting Regressor
    • Scoring New Data Sets
    • Cross Validation
    • Variance-Bias Tradeoff
    • Feature Importance
  • Supervised Learning: Classification
    • Logistic Regression
    • LASSO
    • Support Vector Machine
    • Random Forest
    • Ensemble Methods
    • Feature Importance
    • Scoring New Data Sets
    • Cross Validation
  • Unsupervised Learning: Clustering
    • Preparing Data for Ingestion
    • K-Means Clustering
    • Visualizing Clusters
    • Comparison of Clustering Methods
    • Agglomerative Clustering and DBSCAN
    • Evaluating Cluster Performance with Silhouette Scores
    • Scaling
    • Mean Shift, Affinity Propagation and Birch
    • Scaling Clustering with mini-batch approaches
  • Clustering for Treatment Effect Heterogeneity
    • Understand average versus conditional treatment effects
    • Estimating conditional average treatment effects for a sample
    • Summarizing and Interpreting
  • Data Munging and Machine Learning Via H20
    • Intro to H20
    • Launching the cluster, checking status
    • Data Import, manipulation in H20
    • Fitting models in H20
    • Generalized Linear Models
    • naïve bayes
    • Random forest
    • Gradient boosting machine (GBM)
    • Ensemble model building
    • automl
    • data preparation
    • leaderboards
    • Methods for explaining modeling output
  • Introduction to Natural Language Processing (NLP)
    • Transforming Raw Text Data into a Corpus of Documents
    • Identifying Methods for Representing Text Data
    • Transformations of Text Data
    • Summarizing a Corpus into a TF—IDF Matrix
    • Visualizing Word Frequencies
  • NLP Normalization, Parts-of-speech and Topic Modeling
    • Installing And Accessing Sample Text Corpora
    • Tokenizing Text
    • Cleaning/Processing Tokens
    • Segmentation
    • Tagging And Categorizing Tokens
    • Stopwords
    • Vectorization Schemes for Representing Text
    • Parts-of-speech (POS) Tagging
    • Sentiment Analysis 
    • Topic Modeling with Latent Semantic Analysis
  • NLP and Machine Learning
    • Unsupervised Machine Learning and Text Data
    • Topic Modeling via Clustering
    • Supervised Machine Learning Applications in NLP
  • Conclusion