Home  > Resources  > Blog

Defining Data Science for Architects


This tutorial is adapted from the Web Age course https://www.webagesolutions.com/courses/WA3057-data-science-and-data-engineering-for-architects.

1.1 What is Data Science?

  • Data science focuses on the extraction of knowledge and business insights from data
    • It does so by leveraging techniques and theories from many applied and pure science fields such as statistics, pattern recognition, machine learning, data warehousing, data visualization, scalable and high-performance computing, etc.

1.2 Data Science, Machine Learning, AI?

  • Machine learning (ML) is a subset of data science that uses existing data to train ML algorithms to make predictions or take action on new (never seen before) data
    • Existing (training) data can be either labeled (classified by humans) or unlabeled
  • ML is also sometimes being referred to as data mining or predictive analytics
  • Data science includes, in addition to ML, statistics, advanced data analysis, data visualization, data engineering, etc.
  • Artificial Intelligence (AI) aims at automating/augmenting/substituting complex human activities through a number of specialized computer-assisted solutions
    • Some of the solutions are based on deep learning through neural networks

1.3 The Data Science Ecosystem

Source: http://en.wikipedia.org/wiki/File:DataScienceDisciplines.png


Another take on the Data Science Skill Sets

The Data Science Skill Sets Venn Diagram

Source: http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram

1.4 Tools of the Trade

  • Stand-alone Python:
    • Modules and libraries: scikit-learn, NumPy, pandas, matplotlib, seaborn
    • Dev tools: Jupyter notebooks, Visual Source Code, PyCharm
  • The Apache Spark scalable platform:
    • A choice of programming languages: Python (called PySpark), Scala, and Java
    • Spark ML module
    • Dev tools: Spark Shell, Jupyter notebooks
  • R statistical programming language
  • Deep Learning:
    • TensorFlow with its high-level Python API called Keras; PyTorch

1.5 The Data-Related Roles

  • Data-driven organizations establish the following three data-related roles which are highly interconnected:
    • Data Scientist
      • Someone who uses existing data to train machine learning (ML) algorithms to make predictions and/or generalize (take actions) on new (never seen before) data; practitioners in the field apply scientific experimentation techniques when working with data trying out different ML models
    • Data Analyst
      • Someone who uses traditional business intelligence (BI) tools to understand, describe, categorize, and report on the existing data
    • Data Engineer
      • Most of these activities fall under the category of ETL (Extract, Transform and Load) processes and are carried out in support of the above two roles with their data needs

1.6 Data Scientists at Work

  • Jeff Hammerbacher, who built the “Data” team at Facebook, described the work done by their data science group as follows:
    • “… on any given day, a team member could author a multistage processing pipeline in Python, design a hypothesis test, perform a regression analysis over data samples with R, design and implement an algorithm for some data-intensive product or service in Hadoop, or communicate the results of our analysis to other members of the organization.”


Data analysts and data scientists will do themselves and their organizations a big favor by learning basic data engineering skills.

As Maxime Beauchemin wrote in his article [ http://bit.ly/DATENG2019 ]:

“I joined Facebook in 2011 as a business intelligence engineer. By the time I left in 2013, I was a data engineer.

I wasn’t promoted or assigned to this new role. Instead, Facebook came to realize that the work we were doing transcended classic business intelligence. The role we’d created for ourselves was a new discipline entirely.

My team was at forefront of this transformation. We were developing new skills, new ways of doing things, new tools, and — more often than not — turning our backs to traditional methods.

We were pioneers. We were data engineers!”

1.7 Examples of Data Science Projects

  • Build correlation models based on user requests/searches/product reviews (or any other data collected from uses) to predict users’ choices
  • Engage user data in a feedback loop in which it contributes to improving Company’s products and services
  • Develop a new customer segmentation model for the marketing department
  • Recommendation systems (to facilitate cross-selling)
  • Sentiment analysis
  • Fraud detection

1.8 The Concept of a Data Product

  • One of the facets of data science as a discipline is to identify the data aspect of user activities
  • In some cases, a separate data product needs to be created that would help gain insight into user activities
  • An early data product on the Web was the CDDB database (http://en.wikipedia.org/wiki/CDDB) built by Gracenote for the task of identifying CDs
    • Problem: The audio CD format does not include metadata about the CD (the disc or track titles, performing artists, etc.)
    • Solution: A digital “fingerprint” of a CD is created by performing calculations on the CD’s track lengths which then can be used as an index to the CD metadata stored in the CDDB online database
  • Now, with this data product in place, a whole range of usage/business analytics can be performed using it

1.9 Applied Data Science at Google

  • Google’s PageRank algorithm was among the first to rank websites in their search engine results based on the number and quality of links pointing to a page
    • Google built their infrastructure around this concept
  • During the Swine Flu epidemic of 2009, Google used their search data to predict flu trends around the world
    • Google identified a correlation between how many people search for flu-related topics and how many people actually have flu symptoms

1.10 Data Science and ML Terminology: Features and Observations

  • In data science, machine learning (ML), and statistics, features are variables (like the year a house was built, number of rooms in a house, presence of a pool, etc.) that are used in making predictions (e.g. the price of the house); they are also called predictors or independent variables
    • A feature is similar to a relational table’s column (entity attribute, or property)
    • Features are the inputs for a ML model
  • The value that you predict using features is referred to as response, or outcome, or predicted variable, or dependent variable
  • Observation is a data point, a single recorded instance of a phenomenon in a problem domain, a.k.a. sample or example
    • An observation is like a table’s row or record


For more terminology used in data science and ML, visit

https://ml-cheatsheet.readthedocs.io/en/latest/glossary.html#glossary-instance , and/or


1.11 Terminology: Labels and Ground Truth

  • A label is a type/class of object that we assign to an observation
    • You can have labeled and unlabeled observations (examples); the former are mostly used in classification and those labeled observations are referred to as the “ground truth“; the latter — unlabeled observations — occur when we have no recourse to labels and let a machine learning algorithm label those observations using some sort of data grouping/clustering mechanisms in the so-called unsupervised ML
  • Labels are also used in linear regression models to denote the numeric values we are trying to predict (e.g. the global temperature in the year 2051)

1.12 Label Examples

  • Label examples:
    • Trading recommendation: Buy, Sell, Hold
    • E-mail category: Spam, Non-spam
    • Disease outbreak category: Outbreak, Endemic, Epidemic, Pandemic
    • House sale price: A numeric value
  • Labels are usually encoded with some numeric values, e.g. the trading recommendations: Buy, Sell, Hold could be encoded as 0,1, and 2

1.13 Terminology: Continuous and Categorical Features

  • Features can be of two types:
    • Continuous: something that can be physically or theoretically measured in numeric values, e.g. blood pressure, size of a black hole, plane speed, humidity, etc.
    • Categorical: discrete, enumerated values like hurricane category, day of the week, car type, etc.; this feature type, in turn, is divided into nominal and ordinal features:
      • Nominal categories have no ordering, e.g. card suits: hearts, diamonds, spades, and clubs (ordering may be card game-specific)
      • Ordinal categories imply some sort of ordering, e.g. the ranks in each suit of playing cards: Ace, 2, 3, 4, …., J, Q, K


Feature types visually

1.14 Encoding Categorical Features using One-Hot Encoding Scheme

  • When dealing with categorical features (nominal or ordinal), e.g. trading recommendations made: buy, sell, or hold, you need to have a way to encode the input examples for further processing
  • The common encoding technique is the “One-Hot” scheme, which works as follows (see the next slide for an example):
    • Introduce as many variables/features as there are distinct values in the categorical feature; the variables are usually named after the <feature name>_<categorical value>, e.g. trade_buy, trade_sell, and trade_hold
    • Initialize the variables using the one-hot encoding scheme:
      • Assign 1 to the variable if the observation has the matching category name; 0 otherwise
  • In the one-hot transformation, you, essentially, end up with as many new features as there are levels in that categorical variable

1.15 Example of ‘One-Hot’ Encoding Scheme

  • In our go-to trading actions example, the one-hot encoding will create these three new variables/features (that may be named trade_buy, trade_sell, and trade_hold) holding either 0 or 1:
trade_buy trade_sell  trade_hold 
1	     0		0   	   #the mapping of 'buy'
0	     0		1   #the mapping of 'hold'
1	     0		0   	   #the mapping of 'buy' again
0	     1	0   	   #the mapping of 'sell'
  • As you can see, the new three-feature set forms a sparse matrix

1.16 Gartner’s Magic Quadrant for Data Science and Machine Learning Platforms (a Labeling Example)


Gartner Research developed a process for assessing the ability of companies in a specific industry to innovate and deliver value to their customers. In addition to square labeling companies according to their ability to execute and the completeness of their strategy and vision using four categories: Niche Players, Visionaries; Challengers, and Leaders, Gartner blended in the spatial positioning of individual companies in Gartner’s magic quadrant.

1.17 Machine Learning in a Nutshell

  • At the core of ML lies the concept of distance between observations (data points) that helps measure the degree of proximity/affinity/similarity between them
  • During the model training phase, ML algorithms infer object grouping/classification decision boundary (a function) that performs object discrimination using distances between observations
  • Regression ML models predict a dependent (response) variable based on the historical/known values of at least one independent (explanatory/predictor) variable

1.18 Common Distance Metrics

  • For continuous numeric variables, the Minkowski distance is used, which has this generic form:
  • The Minkowski distance has three special cases:
    • For p=1, the distance is known as the Manhattan distance (a.k.a the L1 norm)
    • For p=2, the distance is known as the Euclidean distance (a.k.a. the L2 norm)
    • When p → +infinity, the distance is known as the Chebyshev distance
  • In text classification scenarios, the most commonly used distance metric is Hamming distance


Calculating an L2 norm of a two-feature vector defined as [3, 4] — essentially, a hypotenuse of a right-angled triangle with sides 3 and 4 (the Pythagorean theorem) — using the NumPy API (two options):

The “hard” way:

import numpy as np
x = np.array([3,4])
np.sqrt(np.sum(x * x))  # 5.0 

The “easy” way:

from numpy.linalg import norm
norm(x, ord=2)    # 5.0 

1.19 The Euclidean Distance

  • The most commonly used distance in ML for continuous numeric variables is the Euclidean distance
  • In Cartesian coordinates, if we have two data points in Euclidean n-space: p and q, the distance from p to q (or from q to p) is given by the Pythagorean formula:

1.20 Decision Boundary Examples (Object Classification)

We have a two-feature (plotted along the X and Y coordinates) dataset of objects of two classes (depicted as red circles and golden triangles) .

Adapted from https://www.semanticscholar.org/

1.21 What is a Model?

  • In data science and ML, a model is a formula, an algorithm, or a prediction function that establishes a relationship between
    • features (predictors)
      • that act as the model’s input and
    • labels (the output/predicted variable)
      • that act as the model’s output
  • A model is trained to predict (make an inference of) the labels or predict continuous values

1.22 Training a Model to Make Predictions

  • There are two major life-cycle phases of a ML model:
    • Model training (fitting)
      • You train or let your model learn on labeled observations (examples) fed into the model
      • During training, the model seeks a set of weights — variable coefficients — (and biases, if applicable) that minimize loss
        • Loss is a quantitative measure of error e.g. the mean squared error
    • Inference (predicting)
      • Here you use your trained model to calculate/predict the labels of unlabeled observations (examples) or numeric values

1.23 Types of Machine Learning

  • There are three main types of machine learning (ML):
    • unsupervised learning
    • supervised learning, and
    • reinforcement learning
  • In this course, we will be dealing only with the first two types: unsupervised and supervised learning
  • FYI: The goal of reinforcement learning is to instruct computer-based algorithms to select actions that maximize a domain-specific gain or minimize a cost (which, essentially, emulates the way humans learn)

1.24 Supervised vs Unsupervised Machine Learning

Supervised learning (SL) defines a target variable that needs to be predicted/estimated by applying an SL algorithm using predictor (independent) variables (features).
SL algorithms are built on top of mathematical formulas with predictive capacity
SL uses labeled examples Classification and regression are examples of SL algorithms
Unsupervised learning (UL) is the opposite of SL: UL does not have the concept of a target value that needs to be found or estimated; rather, a UL algorithm, for example, can deal with the task of grouping (forming a cluster of) similar items together based on some automatically defined or discovered criteria of data elements’ affinity (automatic classification technique)
UL uses unlabeled examples
In essence, UL attempts to extract patterns without much human intervention


Some classification systems are referred to as expert systems that are created in order to let computers take much of the technical drudgery out of data processing leaving humans with the authority, in most cases, to make the final decision.

1.25 Supervised Machine Learning Algorithms

  • Some of the more popular supervised ML algorithms are:
    • Decision Trees/Random Forest
    • k-Nearest Neighbors (kNN)
    • Naive Bayes
    • Regression (linear simple, multiple, locally weighted, etc.)
    • Support Vector Machines (SVMs)
    • Logistic Regression

1.26 Unsupervised Machine Learning Algorithms

  • Some of the more popular unsupervised ML algorithms are:
    • k-Means
    • Hierarchical clustering
    • Gaussian mixture models
    • Dimensionality reduction falls into the realm of unsupervised learning:
      • PCA, Isomap, t-SNE (2-D visualizations of high-dimensional datasets)

1.27 Which ML Algorithm to Choose?


The rules below may help you get your direction but those are not written in stone.

If you are trying to find the probability of an event or predict a value based on existing historical observations, look at the supervised learning (SL) algorithms. Otherwise, refer to unsupervised learning (UL).

If you are dealing with discrete (nominal) values like TRUE:FALSE, bad:good:excellent, buy:hold:sell, etc., you need to go with the classification algorithms of SL.

If you are dealing with continuous numerical values, you need to go with the regression algorithms of SL.

If you want to let the machine categorize data into a number of groups, you need to go with the clustering algorithms of UL.

1.28 Bias-Variance (Underfitting vs Overfitting) Trade-off

  • Underfitting is a property of your model which makes your model less accurate by virtue of being too generic, or biased
    • Such a model appears to be rather simple failing to account for some important regularities in the training data and that has low variance in predictions
  • Overfitting is the opposite of underfitting – it makes your model too sensitive to information noise/variance in your training data
    • Usually, this property is exhibited in more complex data models which are trying to describe your training data as close as possible
  • A good model strikes a good balance between bias and its overreaction to variance (a bias-variance balance/trade-off)
  • The bias-variance trade-off applies to classification and regression models (supervised learning)


Balancing Off the Bias-Variance Ratio

The common techniques to balance off the bias-variance ratio are

  • Dimensionality reduction, feature selection, and regularization

Dimensionality reduction is the process of transforming the original feature set into another one with fewer features: features may be dropped or combined using some inter-feature relationships.

Examples of dimensionality reduction:

  • Compressing a video stream by reducing the number of colors and/or pixels
  • Creating a digest (executive summary) of some textual material

Regularization techniques introduce penalty (sort of a dial knob) that can programmatically decrease high variance by increasing the model’s bias (and vice versa); generally, this leads to smoother decision boundaries and simpler ML models.

Another way to decrease variance is by getting larger training sets.

Many ML algorithms offer some configuration mechanisms (called hyperparameters) to control bias and variance.

The scikit-learn’s Ridge regression algorithm improves on the ordinary linear regression models by introducing the alpha hyperparameter which is a penalty on the size of the regression coefficients e.g:

from sklearn import linear_model
regModel = linear_model.Ridge (alpha = .01)
regModel.fit(X, y)

To learn more about regularization support in scikit-learn, visit http://scikit-learn.org/stable/modules/linear_model.html

1.29 Underfitting vs Overfitting (a Regression Model Example) Visually

1.30 ML Model Evaluation

  • The quality of ML models — their predictive capability — is commonly evaluated using the following metrics:
    • Mean Squared Error (MSE)
    • Mean Absolute Error (MAE)
    • Coefficient of determination, denoted by R2
    • Confusion matrix
  • Whatever metric you are using, you need to assess the ability of your model to make accurate predictions (generalize) on new data (not seen during training where the model can simply memorize all the signals, nooks, and patterns that exist in the data)
  • Note: ML practitioners use the term performance to mean how correct a model is in making predictions


There are a number of other model evaluation metrics, e.g. ROC (Receiver Operating Characteristics) curve, that we are not discussing here.

1.31 Mean Squared Error (MSE) and Mean Absolute Error (MAE)


1.32 Coefficient of Determination

  • One of the most popular scoring metrics in statistics that is widely used in evaluating regression (forecasting) models
  • Denoted by R²
  • It shows the proportion of the variance in the dependent variable that is predictable from the independent variable(s)
    • See the slide’s notes for the R² formula
  • R² is a normalized value between 0 and 1, where
    • 0 (or a value close to zero) indicates that there is no linear relationship (no correlation), and
    • 1 (or, more practically, a value close to one) indicates that your model is a good fit and can explain most of the data


1.33 Confusion Matrix

  • A confusion matrix is used to assess the accuracy of a classification model
  • It is a square table with columns representing the predicted class and rows representing the actual class
  • The confusion matrix for the binary classifier (e.g. a predictor of the presence (yes or no) of a disease in a group of patients) would hold the following values in its cells:
    • TP – true positives, which are the cases of correct predictions (these lie at the intersection of the predicted and actual class), the ill patients have been classified as having a disease
    • TN – true negatives, which are the cases of patients correctly classified as healthy (not afflicted by the disease)
    • FP – false positives (a.k.a “Type I error”), which indicates that some healthy patients have been wrongfully classified as being ill
    • FN – false negatives (a.k.a “Type II error”) is the most undesired case where patients having the disease were classified as healthy
  • In multi-class classifications (involving more than 2 classes), the confusion matrix simply holds the counts of correctly and incorrectly classified instances (see an example a couple of slides later …)

1.34 The Binary Classification Confusion Matrix

  • The Yes class label (the first column’s caption) defines the (True or False) positive classification outcomes
  • The No class label (the second column’s caption) defines the (False or True) negative classification outcomes

1.35 The Typical Machine Learning Process

1.36A Better Algorithm or More Data?

  • If your model does not yield the expected level of accuracy, you have a choice between:
    • Tuning the model using hyperparameters
      • Hyperparameters are parameters whose values are set before the model training step
    • Another (and, hopefully, better) learning algorithm
    • More training data
      • You may want to add more features (feature engineering is important) but be aware of the Curse of Dimensionality
  • Generally, ML practitioners have this rule of thumb:
    • A dumb algorithm with enough data to feed it beats a smart algorithm that is starved with a small amount of data

1.37 The Typical Data Processing Pipeline in Data Science

Adapted from http://www.infoq.com/presentations/big-data-agile-analytics

1.38 Data Discovery Phase

  • Identify data important for your data (analytics) project
    • Align the discovery phase targets with strategic business goals
    • The data scientist must work with the business to learn about the problem domain
  • Identify the source(s) and the size of datasets
  • Perform capacity planning for computing/data processing/storage needs
    • The number of computers, hardware specs, etc.
  • Note: This phase is not reflected on the Data Processing Pipeline diagram

1.39 Data Harvesting Phase

  • Acquire data from the identified source(s)
    • If necessary, ETL (Extract, Transform, and Load) activities are performed:
    • Note: This phase is mapped into the Data Integration box on the Data Processing Pipeline diagram

    1.40 Data Cleaning/Priming/Enhancing Phase

    • Notes:
      • Steps listed below are sometimes referred to as data wrangling or munging
      • Activities at this phase may take up a significant portion of your data science and machine learning projects and are critical to their success
    • Typical activities:
      • Sensitive data is either removed or replaced with opaque tokens (like secure hashes) that can prevent sensitive data leakage
      • Obvious outliers are trimmed off
      • Corrupt/duplicate/missing/incomplete data records identified, fixed, or removed
      • Data is aggregated/augmented/enhanced for better data quality
    • Note: This phase is mapped into the Clean Data box on the Data Processing Pipeline diagram

    1.41 Exploratory Data Analysis and Feature Selection

    • Exploratory Data Analysis (EDA) employs a variety of techniques for better understanding data (the datasets are usually used in raw format)
    • EDA methods include data visualization, descriptive statistics, and other quantitative analysis techniques
    • This phase is a place for active collaboration between data engineers and data scientists
    • Note: This phase is mapped into two elements on the Data Processing Pipeline diagram:
      • Feature Selection/Data Sampling, and
      • (partially) Modeling Data

    1.42 Exploratory Data Analysis and Feature Selection Cont’d

    • EDA is usually performed before the statistical/machine learning models are built and include such activities as:
      • Uncover underlying data structure and patterns
        • In this activity, you can, for example, reduce your “Big Data” to smaller, yet statistically significant and representative, datasets
      • Determine and engineer important variables (features)
        • This activity helps with dimensionality reduction in your datasets that can dramatically improve such processing metrics, as processing time, CPU utilization, memory footprint, etc. of your models
      • Data normalization (where the values are vastly different in scale or may affect the accuracy of ML models) to improve the computing speed and accuracy of your ML models

    1.43 ML Model Planning Phase

    • This phase includes the following activities:
      • Select a statistical/ML model that fits the data and the project objectives
      • An important activity here is to ensure that the features are independent (you need to remove strongly correlated variables and solve the multicollinearity problems, if those exist), discriminating, and informative; you can also generate new features based on the existing ones using feature engineering techniques
      • Perform a quick PoC on a data sample
        • Ideally, the business should get involved here to validate the first results so you can gain confidence that you are doing the “right” thing
        • Step back to Feature Selection/Data Sampling, if corrections are needed
    • Note: Logically, this phase spans two types of activities on the Data Processing Pipeline diagram:
      • Modeling Data and
      • Analytical Modeling

    1.44 Feature Engineering

    • The process of creating/transforming predictors from the raw data is called feature engineering or feature extraction
    • Common operations here include: scaling, creating (combing) additional (synthetic) features based on the original features, dropping features that might correlate with the ones you have selected, etc.
      • Synthetic features are made or two of more raw features
      • Feature crossing (multiplication), like X1 * X2 can enable linear models to work in nonlinear problem domains
      • An example of a synthetic feature is the BMI (the Body Mass Index)


    The formula for a person’s BMI is (the person’s weight [kg])/(person’s height [m])^2

    A BMI of 25.0 or more indicates an overweight problem; the healthy range is 18.5 to 24.9.

    Feature Blending (Creating Synthetic Features)

    • If variables X1 and X2 share variance (they are correlated), you may try to introduce a new feature that blends both variables using some form of relationship
    • There may be more than two variables involved
    • You need to come up with the importance of each variable and assign their weights accordingly
    • For example, you can create a new variable X3 by combining variables X1 and X2 like so:
    X3 = w2 * X2 + w1 * X1  
    	where w1   + w2  = 1.0  (you keep the weights scaled)

    Note: X1 and X2 should be normalized before being blended

    You may want to calculate the w1 and w2 weights using the Singular Value Decomposition (SVD) technique [ http://bit.ly/2rEu6QE]

    1.45 ML Model Building Phase

    • Once you have decided on the ML model you want to use for your data, the following activities need to happen:
      • Your dataset(s) need to be split into the training and test data sets
        • The training dataset is used to train your ML model, and
        • The test dataset (that contains the data your model has not seen before) is needed to verify the accuracy of your model and its ability to generalize (beyond the training dataset)
        • In some cases, before you do a test run, you may want to use a model validation step
      • You iterate between the model training (optionally, validating, ) and testing steps until you get the acceptable performance of the model
      • Business to review the modeling results
    • Note: Logically, ML model building spans these activities on the Data Processing Pipeline diagram:
      • Data Partitioning,
      • Analytical Modeling,
      • (Nominating) Candidate Model, and
      • Modeling Validation

    1.46 Capacity Planning and Resource Provisioning

    • During the Model Building Phase, capture the following run-time metrics to help with production capacity planning and resource provisioning:
      • Prediction (and potentially model re-training) time
      • Compute and storage resources required
      • New training data acquisition speed in case more data is required

    1.47 Communicating the Results

    • Communicate results to the business and other non-technical and technical stakeholders
    • Document results in graphs, charts, diagrams, and other suitable visual artifacts
      • Ideally, try to tell a data-driven story
    • Avoid using Data science jargon; be prepared to act as an educator

    1.48 Production Roll-out

    • Upon all stakeholders’ acceptance, proceed to deployment in production
    • Monitor the running of the system to ensure it runs within established run-time parameters
      • Provide additional resources, if needed
    • Collect information for possible system enhancement
      • Log deviations of input data from expected values

    1.49 Data Science Gotchas

    • Not understanding the nature of data at hand
      • Not filtering out the information “noise” in it
      • Falling prey to the GIGO (“Garbage data” In, Garbage Out) principle
    • Going too deep on the statistical side of things and/or creating unnecessarily complex models
      • Not seeing the big picture (and not seeing the forests for the trees)
      • Simple models are likely to be more appropriate, effective, and “productionalizable”, even at the expense of accuracy
    • Not having open communication channels with business and other stakeholders
      • Not doing the “right” thing
      • Not validating (intermediate) results with business early in the project


    Let’s consider the following scenario: the data science team at Company X is looking around for a proof-of-concept project and decides to use k-means clustering in order to develop a new customer segmentation model for the marketing department. The team gathers a large quantity of data, and then spends a month cleansing it, vectorizing it, normalizing it, and experimenting with different choices of distance metrics and values of K. At the end of the month, they present their best clustering to the marketing department, who proceed to respond in one of two ways:

    Response #1: “Yep, that’s exactly how we think about our customers. Thank you for telling us something we already know.”

    Response #2: “No, that doesn’t really match up with how we think about our customers. You must have done something wrong– go back and try it again.” (The data science team that receives this response spends several months iterating on distance metrics and values of K until they manage to converge to Response #1.)

    Don’t let this situation happen to you: always start a new project by taking the time to understand and document how the performance of your modeling and analysis will be evaluated, and avoid situations where personal opinions matter more than business metrics. A data scientist’s time is a terrible thing to waste.

    (Source: http://blog.cloudera.com/blog/2013/03/cloudera_ml_data_science_tools/ )

    1.50 Summary

    • In this tutorial, we
      • Defined what data science is
      • Compared data mining with the emerging data science discipline
      • Defined who a data scientist is
      • Described the data scientist’s typical activities
      • Introduced data science and ML terminology
      • Reviewed the typical ML process and data processing phases