This tutorial is adapted from the Web Age course https://www.webagesolutions.com/courses/WA3057datascienceanddataengineeringforarchitects.
1.1 What is Data Science?
 Data science focuses on the extraction of knowledge and business insights from data
 It does so by leveraging techniques and theories from many applied and pure science fields such as statistics, pattern recognition, machine learning, data warehousing, data visualization, scalable and highperformance computing, etc.
1.2 Data Science, Machine Learning, AI?
 Machine learning (ML) is a subset of data science that uses existing data to train ML algorithms to make predictions or take action on new (never seen before) data
 Existing (training) data can be either labeled (classified by humans) or unlabeled
 ML is also sometimes being referred to as data mining or predictive analytics
 Data science includes, in addition to ML, statistics, advanced data analysis, data visualization, data engineering, etc.
 Artificial Intelligence (AI) aims at automating/augmenting/substituting complex human activities through a number of specialized computerassisted solutions
 Some of the solutions are based on deep learning through neural networks
1.3 The Data Science Ecosystem
Source: http://en.wikipedia.org/wiki/File:DataScienceDisciplines.png
Notes:
Another take on the Data Science Skill Sets
The Data Science Skill Sets Venn Diagram
Source: http://drewconway.com/zia/2013/3/26/thedatasciencevenndiagram
1.4 Tools of the Trade
 Standalone Python:
 Modules and libraries: scikitlearn, NumPy, pandas, matplotlib, seaborn
 Dev tools: Jupyter notebooks, Visual Source Code, PyCharm
 The Apache Spark scalable platform:
 A choice of programming languages: Python (called PySpark), Scala, and Java
 Spark ML module
 Dev tools: Spark Shell, Jupyter notebooks
 R statistical programming language
 Deep Learning:
 TensorFlow with its highlevel Python API called Keras; PyTorch
1.5 The DataRelated Roles
 Datadriven organizations establish the following three datarelated roles which are highly interconnected:
 Data Scientist
 Someone who uses existing data to train machine learning (ML) algorithms to make predictions and/or generalize (take actions) on new (never seen before) data; practitioners in the field apply scientific experimentation techniques when working with data trying out different ML models
 Data Analyst
 Someone who uses traditional business intelligence (BI) tools to understand, describe, categorize, and report on the existing data
 Data Engineer
 Most of these activities fall under the category of ETL (Extract, Transform and Load) processes and are carried out in support of the above two roles with their data needs
 Data Scientist
1.6 Data Scientists at Work
 Jeff Hammerbacher, who built the “Data” team at Facebook, described the work done by their data science group as follows:
 “… on any given day, a team member could author a multistage processing pipeline in Python, design a hypothesis test, perform a regression analysis over data samples with R, design and implement an algorithm for some dataintensive product or service in Hadoop, or communicate the results of our analysis to other members of the organization.”
Notes:
Data analysts and data scientists will do themselves and their organizations a big favor by learning basic data engineering skills.
As Maxime Beauchemin wrote in his article [ http://bit.ly/DATENG2019 ]:
“I joined Facebook in 2011 as a business intelligence engineer. By the time I left in 2013, I was a data engineer.
I wasn’t promoted or assigned to this new role. Instead, Facebook came to realize that the work we were doing transcended classic business intelligence. The role we’d created for ourselves was a new discipline entirely.
My team was at forefront of this transformation. We were developing new skills, new ways of doing things, new tools, and — more often than not — turning our backs to traditional methods.
We were pioneers. We were data engineers!”
1.7 Examples of Data Science Projects
 Build correlation models based on user requests/searches/product reviews (or any other data collected from uses) to predict users’ choices
 Engage user data in a feedback loop in which it contributes to improving Company’s products and services
 Develop a new customer segmentation model for the marketing department
 Recommendation systems (to facilitate crossselling)
 Sentiment analysis
 Fraud detection
1.8 The Concept of a Data Product
 One of the facets of data science as a discipline is to identify the data aspect of user activities
 In some cases, a separate data product needs to be created that would help gain insight into user activities
 An early data product on the Web was the CDDB database (http://en.wikipedia.org/wiki/CDDB) built by Gracenote for the task of identifying CDs
 Problem: The audio CD format does not include metadata about the CD (the disc or track titles, performing artists, etc.)
 Solution: A digital “fingerprint” of a CD is created by performing calculations on the CD’s track lengths which then can be used as an index to the CD metadata stored in the CDDB online database
 Now, with this data product in place, a whole range of usage/business analytics can be performed using it
1.9 Applied Data Science at Google
 Google’s PageRank algorithm was among the first to rank websites in their search engine results based on the number and quality of links pointing to a page
 Google built their infrastructure around this concept
 During the Swine Flu epidemic of 2009, Google used their search data to predict flu trends around the world
 Google identified a correlation between how many people search for flurelated topics and how many people actually have flu symptoms
1.10 Data Science and ML Terminology: Features and Observations
 In data science, machine learning (ML), and statistics, features are variables (like the year a house was built, number of rooms in a house, presence of a pool, etc.) that are used in making predictions (e.g. the price of the house); they are also called predictors or independent variables
 A feature is similar to a relational table’s column (entity attribute, or property)
 Features are the inputs for a ML model
 The value that you predict using features is referred to as response, or outcome, or predicted variable, or dependent variable
 Observation is a data point, a single recorded instance of a phenomenon in a problem domain, a.k.a. sample or example
 An observation is like a table’s row or record
Notes:
For more terminology used in data science and ML, visit
https://mlcheatsheet.readthedocs.io/en/latest/glossary.html#glossaryinstance , and/or
1.11 Terminology: Labels and Ground Truth
 A label is a type/class of object that we assign to an observation
 You can have labeled and unlabeled observations (examples); the former are mostly used in classification and those labeled observations are referred to as the “ground truth“; the latter — unlabeled observations — occur when we have no recourse to labels and let a machine learning algorithm label those observations using some sort of data grouping/clustering mechanisms in the socalled unsupervised ML
 Labels are also used in linear regression models to denote the numeric values we are trying to predict (e.g. the global temperature in the year 2051)
1.12 Label Examples
 Label examples:
 Trading recommendation: Buy, Sell, Hold
 Email category: Spam, Nonspam
 Disease outbreak category: Outbreak, Endemic, Epidemic, Pandemic
 House sale price: A numeric value
 Labels are usually encoded with some numeric values, e.g. the trading recommendations: Buy, Sell, Hold could be encoded as 0,1, and 2
1.13 Terminology: Continuous and Categorical Features
 Features can be of two types:
 Continuous: something that can be physically or theoretically measured in numeric values, e.g. blood pressure, size of a black hole, plane speed, humidity, etc.
 Categorical: discrete, enumerated values like hurricane category, day of the week, car type, etc.; this feature type, in turn, is divided into nominal and ordinal features:
 Nominal categories have no ordering, e.g. card suits: hearts, diamonds, spades, and clubs (ordering may be card gamespecific)
 Ordinal categories imply some sort of ordering, e.g. the ranks in each suit of playing cards: Ace, 2, 3, 4, …., J, Q, K
Notes:
Feature types visually
1.14 Encoding Categorical Features using OneHot Encoding Scheme
 When dealing with categorical features (nominal or ordinal), e.g. trading recommendations made: buy, sell, or hold, you need to have a way to encode the input examples for further processing
 The common encoding technique is the “OneHot” scheme, which works as follows (see the next slide for an example):
 Introduce as many variables/features as there are distinct values in the categorical feature; the variables are usually named after the <feature name>_<categorical value>, e.g. trade_buy, trade_sell, and trade_hold
 Initialize the variables using the onehot encoding scheme:
 Assign 1 to the variable if the observation has the matching category name; 0 otherwise
 In the onehot transformation, you, essentially, end up with as many new features as there are levels in that categorical variable
1.15 Example of ‘OneHot’ Encoding Scheme
 In our goto trading actions example, the onehot encoding will create these three new variables/features (that may be named trade_buy, trade_sell, and trade_hold) holding either 0 or 1:
trade_buy trade_sell trade_hold 1 0 0 #the mapping of 'buy' 0 0 1 #the mapping of 'hold' 1 0 0 #the mapping of 'buy' again 0 1 0 #the mapping of 'sell'
 As you can see, the new threefeature set forms a sparse matrix
1.16 Gartner’s Magic Quadrant for Data Science and Machine Learning Platforms (a Labeling Example)
Notes:
Gartner Research developed a process for assessing the ability of companies in a specific industry to innovate and deliver value to their customers. In addition to square labeling companies according to their ability to execute and the completeness of their strategy and vision using four categories: Niche Players, Visionaries; Challengers, and Leaders, Gartner blended in the spatial positioning of individual companies in Gartner’s magic quadrant.
1.17 Machine Learning in a Nutshell
 At the core of ML lies the concept of distance between observations (data points) that helps measure the degree of proximity/affinity/similarity between them
 During the model training phase, ML algorithms infer object grouping/classification decision boundary (a function) that performs object discrimination using distances between observations
 Regression ML models predict a dependent (response) variable based on the historical/known values of at least one independent (explanatory/predictor) variable
1.18 Common Distance Metrics
 For continuous numeric variables, the Minkowski distance is used, which has this generic form:
 The Minkowski distance has three special cases:
 For p=1, the distance is known as the Manhattan distance (a.k.a the L1 norm)
 For p=2, the distance is known as the Euclidean distance (a.k.a. the L2 norm)
 When p → +infinity, the distance is known as the Chebyshev distance
 In text classification scenarios, the most commonly used distance metric is Hamming distance
Notes:
Calculating an L2 norm of a twofeature vector defined as [3, 4] — essentially, a hypotenuse of a rightangled triangle with sides 3 and 4 (the Pythagorean theorem) — using the NumPy API (two options):
The “hard” way:
import numpy as np x = np.array([3,4]) np.sqrt(np.sum(x * x)) # 5.0
The “easy” way:
from numpy.linalg import norm norm(x, ord=2) # 5.0
1.19 The Euclidean Distance
 The most commonly used distance in ML for continuous numeric variables is the Euclidean distance
 In Cartesian coordinates, if we have two data points in Euclidean nspace: p and q, the distance from p to q (or from q to p) is given by the Pythagorean formula:
1.20 Decision Boundary Examples (Object Classification)
We have a twofeature (plotted along the X and Y coordinates) dataset of objects of two classes (depicted as red circles and golden triangles) .
Adapted from https://www.semanticscholar.org/
1.21 What is a Model?
 In data science and ML, a model is a formula, an algorithm, or a prediction function that establishes a relationship between
 features (predictors)
 that act as the model’s input and
 labels (the output/predicted variable)
 that act as the model’s output
 features (predictors)
 A model is trained to predict (make an inference of) the labels or predict continuous values
1.22 Training a Model to Make Predictions
 There are two major lifecycle phases of a ML model:
 Model training (fitting)
 You train or let your model learn on labeled observations (examples) fed into the model
 During training, the model seeks a set of weights — variable coefficients — (and biases, if applicable) that minimize loss
 Loss is a quantitative measure of error e.g. the mean squared error
 Inference (predicting)
 Here you use your trained model to calculate/predict the labels of unlabeled observations (examples) or numeric values
 Model training (fitting)
1.23 Types of Machine Learning
 There are three main types of machine learning (ML):
 unsupervised learning
 supervised learning, and
 reinforcement learning
 In this course, we will be dealing only with the first two types: unsupervised and supervised learning
 FYI: The goal of reinforcement learning is to instruct computerbased algorithms to select actions that maximize a domainspecific gain or minimize a cost (which, essentially, emulates the way humans learn)
1.24 Supervised vs Unsupervised Machine Learning
Supervised learning (SL) defines a target variable that needs to be predicted/estimated by applying an SL algorithm using predictor (independent) variables (features). SL algorithms are built on top of mathematical formulas with predictive capacity SL uses labeled examples Classification and regression are examples of SL algorithms  Unsupervised learning (UL) is the opposite of SL: UL does not have the concept of a target value that needs to be found or estimated; rather, a UL algorithm, for example, can deal with the task of grouping (forming a cluster of) similar items together based on some automatically defined or discovered criteria of data elements’ affinity (automatic classification technique) UL uses unlabeled examples In essence, UL attempts to extract patterns without much human intervention 
Notes:
Some classification systems are referred to as expert systems that are created in order to let computers take much of the technical drudgery out of data processing leaving humans with the authority, in most cases, to make the final decision.
1.25 Supervised Machine Learning Algorithms
 Some of the more popular supervised ML algorithms are:
 Decision Trees/Random Forest
 kNearest Neighbors (kNN)
 Naive Bayes
 Regression (linear simple, multiple, locally weighted, etc.)
 Support Vector Machines (SVMs)
 Logistic Regression
1.26 Unsupervised Machine Learning Algorithms
 Some of the more popular unsupervised ML algorithms are:
 kMeans
 Hierarchical clustering
 Gaussian mixture models
 Dimensionality reduction falls into the realm of unsupervised learning:
 PCA, Isomap, tSNE (2D visualizations of highdimensional datasets)
1.27 Which ML Algorithm to Choose?
Notes:
The rules below may help you get your direction but those are not written in stone.
If you are trying to find the probability of an event or predict a value based on existing historical observations, look at the supervised learning (SL) algorithms. Otherwise, refer to unsupervised learning (UL).
If you are dealing with discrete (nominal) values like TRUE:FALSE, bad:good:excellent, buy:hold:sell, etc., you need to go with the classification algorithms of SL.
If you are dealing with continuous numerical values, you need to go with the regression algorithms of SL.
If you want to let the machine categorize data into a number of groups, you need to go with the clustering algorithms of UL.
1.28 BiasVariance (Underfitting vs Overfitting) Tradeoff
 Underfitting is a property of your model which makes your model less accurate by virtue of being too generic, or biased
 Such a model appears to be rather simple failing to account for some important regularities in the training data and that has low variance in predictions
 Overfitting is the opposite of underfitting – it makes your model too sensitive to information noise/variance in your training data
 Usually, this property is exhibited in more complex data models which are trying to describe your training data as close as possible
 A good model strikes a good balance between bias and its overreaction to variance (a biasvariance balance/tradeoff)
 The biasvariance tradeoff applies to classification and regression models (supervised learning)
Notes:
Balancing Off the BiasVariance Ratio
The common techniques to balance off the biasvariance ratio are
 Dimensionality reduction, feature selection, and regularization
Dimensionality reduction is the process of transforming the original feature set into another one with fewer features: features may be dropped or combined using some interfeature relationships.
Examples of dimensionality reduction:
 Compressing a video stream by reducing the number of colors and/or pixels
 Creating a digest (executive summary) of some textual material
Regularization techniques introduce penalty (sort of a dial knob) that can programmatically decrease high variance by increasing the model’s bias (and vice versa); generally, this leads to smoother decision boundaries and simpler ML models.
Another way to decrease variance is by getting larger training sets.
Many ML algorithms offer some configuration mechanisms (called hyperparameters) to control bias and variance.
The scikitlearn’s Ridge regression algorithm improves on the ordinary linear regression models by introducing the alpha hyperparameter which is a penalty on the size of the regression coefficients e.g:
from sklearn import linear_model regModel = linear_model.Ridge (alpha = .01) regModel.fit(X, y)
…
To learn more about regularization support in scikitlearn, visit http://scikitlearn.org/stable/modules/linear_model.html
1.29 Underfitting vs Overfitting (a Regression Model Example) Visually
1.30 ML Model Evaluation
Notes:There are a number of other model evaluation metrics, e.g. ROC (Receiver Operating Characteristics) curve, that we are not discussing here. 1.31 Mean Squared Error (MSE) and Mean Absolute Error (MAE)
