Web Age Solutions Inc
Providing Technology Training and Mentoring For Modern Technology Adoption
Web Age Aniversary Logo
US Inquiries / 1.877.517.6540
Canadian Inquiries / 1.877.812.8887
Course #:WA3057

Data Science and Data Engineering for Architects Training

This intensive training course covers the theoretical and practical aspects of applying the principles and methods of Data Science and Data Engineering in practice.  The students are introduced to the relevant concepts, terminology, theory, and tools used in the field.  This training course is complemented by a variety of hands-on exercises to help the attendees reinforce their theoretical knowledge of the material being studied.

TOPICS

Applied data science, business analytics, and data engineering

Common data science/machine learning algorithms for supervised and unsupervised machine learning

NumPy, pandas, matplotlib,  seaborn, scikit-learn

Python REPLs

Jupyter notebooks

Data analytics life-cycle phases

Data repairing and normalizing

Data aggregation and grouping

Data visualization and EDA

Operational data analytics

Distributed and scalable data processing 

Cloud machine learning and data engineering capabilities

AUDIENCE

IT architects and technical managers

PREREQUISITES

Participants should have a working knowledge of Python (or have the programming background and/or the ability to quickly pick up Python’s syntax), and be familiar with core statistical concepts (variance, correlation, etc.)

DURATION

4 days

Outline of Data Science and Data Engineering for Architects Training

Chapter 1. Defining Data Science

  • What is Data Science?
  • Data Science, Machine Learning, AI?
  • The Data-Related Roles
  • The Data Science Ecosystem
  • Tools of the Trade
  • Who is a Data Scientist?
  • The Data Science Skill Sets
  • Data Scientists at Work
  • Examples of Data Science Projects
  • An Example of a Data Product
  • Applied Data Science at Google
  • Data Science Gotchas
  • Summary

Chapter 2. Defining Data Engineering

  • Data is King
  • Translating Data into Operational and Business Insights
  • What is Data Engineering
  • The Data Engineer Role
  • Core Skills and Competencies
  • The Data Exchange Interoperability Options
  • Summary

Chapter 3. Distributed Computing Concepts

  • The Traditional Client–Server Processing Pattern
  • Enter Distributed Computing
  • Data Physics
  • Data Locality (Distributed Computing Economics)
  • The CAP Theorem
  • Mechanisms to Guarantee a Single CAP Property
  • Eventual Consistency
  • The NoSQL Systems CAP Triangle
  • Summary

Chapter 4. Data Processing Phases

  • Typical Data Processing Pipeline
  • Data Discovery Phase
  • Data Harvesting Phase
  • Data Priming Phase
  • Exploratory Data Analysis
  • Model Planning Phase
  • Model Building Phase
  • Communicating the Results
  • Production Roll-out
  • Data Logistics and Data Governance
  • Data Processing Workflow Engines
  • Apache Airflow
  • Data Lineage and Provenance
  • Apache NiFi
  • Summary

Chapter 5. Practical Introduction to NumPy

  • SciPy
  • NumPy
  • The First Take on NumPy Arrays
  • Getting Help
  • Understanding Axes
  • Indexing Elements in a NumPy Array
  • NumPy Arrays
  • Understanding Types
  • Re-Shaping
  • Commonly Used Array Metrics
  • Commonly Used Aggregate Functions
  • Sorting Arrays
  • Vectorization
  • Broadcasting
  • Filtering
  • Array Arithmetic Operations
  • Array Slicing
  • 2-D Array Slicing
  • The Linear Algebra Functions
  • Summary

Chapter 6. Practical Introduction to pandas

  • What is pandas?
  • The Series Object
  • Accessing Values and Indexes in Series
  • Setting Up Your Own Index
  • Using the Series Index as a Lookup Key
  • Can I Pack a Python Dictionary into a Series?
  • The DataFrame Object
  • The DataFrame's Value Proposition
  • Creating a pandas DataFrame
  • Getting DataFrame Metrics
  • Accessing DataFrame Columns
  • Accessing DataFrame Rows
  • Accessing DataFrame Cells
  • Using iloc
  • Using loc
  • Examples of Using loc
  • DataFrames are Mutable via Object Reference!
  • Deleting Rows and Columns
  • Adding a New Column to a DataFrame
  • Appending / Concatenating DataFrame and Series Objects
  • Example of Appending / Concatenating DataFrames
  • Re-indexing Series and DataFrames
  • Getting Descriptive Statistics of DataFrame Columns
  • Getting Descriptive Statistics of DataFrames
  • Applying a Function
  • Sorting DataFrames
  • Reading From CSV Files
  • Writing to the System Clipboard
  • Writing to a CSV File
  • Fine-Tuning the Column Data Types
  • Changing the Type of a Column
  • What May Go Wrong with Type Conversion
  • Summary

Chapter 7. Data Grouping and Aggregation with pandas

  • Data Aggregation and Grouping
  • Sample Data Set
  • The pandas.core.groupby.SeriesGroupBy Object
  • Grouping by Two or More Columns
  • Emulating SQL's WHERE Clause
  • The Pivot Tables
  • Cross-Tabulation
  • Summary

Chapter 8. Descriptive Statistics Computing Features in Python

  • Descriptive Statistics
  • Non-uniformity of a Probability Distribution
  • Using NumPy for Calculating Descriptive Statistics Measures
  • Finding Min and Max in NumPy
  • Using pandas for Calculating Descriptive Statistics Measures
  • Correlation
  • Regression and Correlation
  • Covariance
  • Getting Pairwise Correlation and Covariance Measures
  • Finding Min and Max in pandas DataFrame
  • Summary

Chapter 9. Repairing and Normalizing Data

  • Repairing and Normalizing Data
  • Dealing with the Missing Data
  • Sample Data Set
  • Getting Info on Null Data
  • Dropping a Column
  • Interpolating Missing Data in pandas
  • Replacing the Missing Values with the Mean Value
  • Scaling (Normalizing) the Data
  • Data Preprocessing with scikit-learn
  • Scaling with the scale() Function
  • The MinMaxScaler Object
  • Summary

Chapter 10. Data Visualization in Python

  • Data Visualization
  • EDA (Exploratory Data Analysis)
  • Data Visualization in Python
  • Matplotlib
  • Seaborn
  • The matplotlib.pyplot.plot() Function
  • The matplotlib.pyplot.bar() Function
  • The matplotlib.pyplot.pie () Function
  • Subplots
  • The matplotlib.pyplot.subplot() Function
  • Figures
  • Saving Figures to a File
  • Understanding boxplots
  • Histograms and KDE
  • Plotting Bivariate Distributions
  • Heatmaps
  • ggplot
  • Summary

Chapter 11. Data Science and ML Algorithms

  • In-Class Discussion
  • Types of Machine Learning
  • Terminology: Features and Observations
  • Representing Observations
  • Terminology: Labels
  • Terminology: Continuous and Categorical Features
  • Continuous Features
  • Categorical Features
  • Common Distance Metrics
  • The Euclidean Distance
  • What is a Model
  • Supervised vs Unsupervised Machine Learning
  • Supervised Machine Learning Algorithms
  • Classification (Supervised ML) Examples
  • Unsupervised Machine Learning Algorithms
  • Unsupervised Learning: Clustering
  • Clustering Examples
  • The scikit-learn Package
  • Terminology: Estimators, Models, and Predictors
  • Model Evaluation
  • The Error Rate
  • Confusion Matrix
  • The Binary Classification Confusion Matrix
  • Multi-class Classification Confusion Matrix Example
  • Feature Engineering
  • Scaling of the Features
  • Feature Blending (Creating Synthetic Features)
  • The 'One-Hot' Encoding Scheme
  • Example of 'One-Hot' Encoding Scheme
  • Bias-Variance (Underfitting vs Overfitting) Trade-off
  • The Modeling Error Factors
  • One Way to Visualize Bias and Variance
  • Underfitting vs Overfitting Visualization
  • Balancing Off the Bias-Variance Ratio
  • Regularization
  • Dimensionality Reduction
  • The Advantages of Dimensionality Reduction
  • PCA and isomap
  • The LIBSVM format
  • Life-cycles of Machine Learning Development
  • Data Splitting into Training and Test Datasets
  • ML Model Tuning Visually
  • Cross-Validation Technique
  • Classifying with k-Nearest Neighbors
  • k-Nearest Neighbors Algorithm
  • Regression Analysis
  • Simple Linear Regression Model
  • Linear Regression Illustration
  • Least-Squares Method (LSM)
  • Gradient Descent Optimization
  • Multiple Regression Analysis
  • Evaluating Regression Model Accuracy
  • The R2
  • The MSE Model Score
  • Logistic Regression (Logit)
  • Linear Logistic Regression Results
  • Decision Trees
  • Decision Tree Terminology
  • Properties of Decision Trees
  • Decision Tree Classification in the Context of Information Theory
  • The Simplified Decision Tree Algorithm
  • Using Decision Trees
  • Random Forests
  • Support Vector Machines (SVMs)
  • Naive Bayes Classifier (SL)
  • Naive Bayesian Probabilistic Model in a Nutshell
  • Bayes Formula
  • Classification of Documents with Naive Bayes
  • k-Means Clustering (UL)
  • k-Means Clustering in a Nutshell
  • k-Means Characteristics
  • Global vs Local Minimum Explained
  • XGBoost
  • The Typical ML Workflow
  • Which Algorithm to Choose?
  • A Better Algorithm or More Data?
  • Neural Networks and Deep Learning
  • Deep Learning vs Traditional ML
  • Summary

Chapter 12. Parallel Data Processing with PySpark

  • What is Apache Spark
  • The Spark Platform
  • Languages Supported by Spark
  • Running Spark on a Cluster
  • The Spark Shell
  • The High-Level Execution Flow in Stand-alone Spark Cluster
  • The Spark Application Architecture
  • The Resilient Distributed Dataset (RDD)
  • The Lineage Concept
  • Datasets and DataFrames
  • Data Partitioning
  • Data Partitioning Diagram
  • Finding the Most Frequently Used Words in PySpark
  • Summary

Chapter 13. Operational Data Analytics with Splunk

  • Splunk Defined
  • Splunk Products
  • Splunk Editions
  • Deployment Options
  • Common Components
  • Splunk Admin Dashboard (Web UI)
  • Events
  • Data Indexing
  • Web UI for Adding Data to Indexer
  • Distributed Splunk Indexing and Searching
  • Architecture for a Multi-Tier Splunk Enterprise Deployment
  • Data Source Types
  • The Source Types Automatically Recognized by Splunk
  • The "Pre-trained" Data Source Types
  • Windows ® Data Sources
  • Custom Event Format
  • Web UI: Adding Data Flow for Local File Upload
  • Web UI: Add Data for Monitoring
  • Data Searching
  • Search Processing Language (SPL)
  • Searching and Reporting Activities
  • The Search Page
  • Core Search Concepts
  • The Search Basics
  • Search Command Categories
  • Command Examples
  • More Examples of Search Commands
  • Statistical and Time Functions
  • From SQL to SPL - the Translation Table
  • Visualizations
  • Save Your Searches as Dashboards
  • Summary

Chapter 14. Python as a Cloud Scripting Language

  • Python's Value
  • Python on AWS
  • AWS SDK For Python (boto3)
  • What is Serverless Computing?
  • How Functions Work
  • The AWS Lambda Event Handler
  • What is AWS Glue?
  • PySpark on Glue - Sample Script
  • Summary

Chapter 15. Amazon SageMaker

  • What is SageMaker
  • ML with SageMaker
  • The ML Phases Diagram
  • Supported Systems and Frameworks
  • ML Algorithms Supported by SageMaker
  • SageMaker in the AWS Management Console
  • Ground Truth
  • Notebooks
  • Training
  • Training Options
  • The Model Training Flow Diagram
  • Inference
  • Deployment of Models to the SageMaker Hosting Service
  • The SagaMaker Hosting Service Architecture
  • Improving Your ML Models
  • The AWS Marketplace of ML Algorithms
  • EC2 P3 Instances
  • SageMaker Pricing
  • Summary

Chapter 16. Introduction to AWS Glue

  • What is AWS Glue?
  • AWS Glue Components
  • AWS Glue Components (Cont'd)
  • Managing Notebooks
  • AWS Glue Components (Cont'd)
  • Putting it Together: The AWS Glue Environment Architecture
  • AWS Glue Main Activities
  • Additional Glue Services
  • When To Use AWS Glue?
  • Integration with other AWS Services
  • Summary
We regularly offer classes in these and other cities. Atlanta, Austin, Baltimore, Calgary, Chicago, Cleveland, Dallas, Denver, Detroit, Houston, Jacksonville, Miami, Montreal, New York City, Orlando, Ottawa, Philadelphia, Phoenix, Pittsburgh, Seattle, Toronto, Vancouver, Washington DC.
US Inquiries / 1.877.517.6540
Canadian Inquiries / 1.877.812.8887