Comparing Hive with Spark SQL

This tutorial is adapted from Web Age course Hadoop Programming on the Cloudera Platform.

In this tutorial, you will work through two functionally equivalent examples / demos – one written in Hive (v. 1.1) and the other written using PySpark API for the Spark SQL module (v. 1.6) – to see the differences between the command syntax of these popular Big Data processing systems. Both examples / demos have been prepared using CDH 5.13.0 and that’s what you are going to use in this exercise.

We will work against the same input file in both examples, running functionally equivalent query statements against that file.

Both Hive and PySpark shells support the Unix command-line short-cuts that allow you to quickly navigate along a single command line:

Ctrl-a Moves the cursor to the start of the line.

Ctrl-e Moves the cursor to the end of the line.

Ctrl-k Deletes (kills) the line contents to the right of the cursor.

Ctrl-u Deletes the line contents to the left of cursor.

Part 1 – Connect to the  Environment

  • Download and install Cloudera’s Quickstart VM. Follow this link for instructions.

Part 2 – Setting up the Working Environment

All the steps in this tutorial will be performed in the /home/cloudera/Works directory. Create this direectory if it is not existing.

1. In a new terminal window, type in the following command:

cd ~/Works

Before we begin, we need to make sure that Hive-related services are up and running.

2. Enter the following commands one after another:

sudo /etc/init.d/hive-metastore status  
sudo /etc/init.d/hive-server2 status 

If you see their status as not running, start the stopped service(s) using the following commands:

sudo /etc/init.d/hive-metastore start 
sudo /etc/init.d/hive-server2 start 

3. Enter the following command:

hive --version

You should see the following output:

Hive 1.1.0-cdh5.13.0
Subversion file:///data/jenkins/workspace/generic-package-rhel64-6-0/topdir/BUILD/hive-1.1.0-cdh5.13.0 -r Unknown
Compiled by jenkins on Wed Oct 4 11:06:55 PDT 2017
From source with checksum 4c9678e964cc1d15a0190a0a1867a837

4. Enter the following command to download the input file:

wget 'http://bit.ly/36fGR32'  -O files.csv

The file is in the CSV format with a header row; it has the following comma-delimited fields:

1. File name
2. File size 
3. Month of creation 
4. Day of creation

And the first four rows of the file are:

FNAME,FSIZE,MONTH,DAY
a2p,112200,Feb,21
abrt-action-analyze-backtrace,13896,Feb,22
abrt-action-analyze-c,12312,Feb,22

5. Enter the following commands to put the file on HDFS:

hadoop fs -mkdir hive_demo
hadoop fs  -put files.csv hive_demo/

The file is now in the hive_demo directory on HDFS – that’s where we are going to load it from when working with both Hive and Spark.

Part 3 – The Hive Example / Demo

We will use the hive tool to start the interactive Hive shell (REPL) instead of the now recommended beeline tool which is a bit more ceremonial and here we are going to ignore its client-server architecture advantages in favor of the simplicity of hive.

1. Enter the following command to Start Hive REPL:

hive 

Note: You start the Beeline shell in embedded mode by running this command:

beeline -u jdbc:hive2://

To connect to a remote Hive Server (that’s is the primary feature of Beeline, use this command:

beeline -n hive -u jdbc:hive2://<hive server IP>:<port>,

e.g beeline -n hive -u jdbc:hive2://quickstart:10000

Now, let’s create a table using the Hive Data Definition Language (DDL).

2. Enter the following command:

CREATE EXTERNAL TABLE xfiles
(fileName STRING,fileSize INT,month STRING,day INT)
    ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
    LOCATION '/user/cloudera/hive_demo/'
    tblproperties ("skip.header.line.count"="1");

Note that we have created it an external Hive table.

3. Enter the following command to print the basic table schema info:

describe xfiles;

You should see the following output:

filename            	string              	                    
filesize            	int                 	                    
month               	string              	                    
day                 	int 

4. Enter the following command to print the extended table metadata:

desc formatted xfiles;

You should see the following output:

# col_name            	data_type           	comment             
	 	 
filename            	string              	                    
filesize            	int                 	                    
month               	string              	                    
day                 	int                 	                    
	 	 
# Detailed Table Information	 	 
Database:           	default             	 
Owner:              	cloudera            	 
CreateTime:         	Mon Feb 24 07:35:26 PST 2020	 
LastAccessTime:     	UNKNOWN             	 
Protect Mode:       	None                	 
Retention:          	0                   	 
Location:           	hdfs://quickstart.cloudera:8020/user/cloudera/hive_demo 
Table Type:         	EXTERNAL_TABLE      	 
Table Parameters:	 	 
	EXTERNAL            	TRUE                
	skip.header.line.count	1                   
	transient_lastDdlTime	1582558526          
	 	 
# Storage Information	 	 
SerDe Library:      	org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe	 
InputFormat:        	org.apache.hadoop.mapred.TextInputFormat	 
OutputFormat:       	org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat	 
Compressed:         	No                  	 
Num Buckets:        	-1                  	 
Bucket Columns:     	[]                  	 
Sort Columns:       	[]                  	 
Storage Desc Params:	 	 
	field.delim         	,                   
	serialization.format	,   

5. Enter the following command to fetch the first ten rows of the table:

select * from xfiles limit 10;

6. Enter the following basic SQL command:

select month, count(*) from xfiles group by month;

You should see the following output:

Apr	305
Aug	93
Dec	103
Feb	279
Jan	5
Jul	63
Jun	129
Mar	2
May	5
Nov	221
Oct	9
Sep	66

7. Enter the following more advanced SQL command:

select month, count(*) as fc from xfiles group by month having fc > 100;

You should see the following output:

Apr	305
Dec	103
Feb	279
Jun	129
Nov	221

8. Enter the following Hive DDL and DML commands (we are creating and populating a table in PARQUET columnar store format):

CREATE TABLE xfiles_prq LIKE xfiles STORED AS PARQUET;
INSERT OVERWRITE TABLE xfiles_prq SELECT * FROM xfiles;

9. Enter the following command:

desc formatted xfiles_prq;

Locate in the output of the above command the location of the xfiles_prq Hive managed table; it should be:

hdfs://quickstart.cloudera:8020/user/hive/warehouse/xfiles_prq

It is a directory that contains a file in the PARQUET format:

/user/hive/warehouse/xfiles_prq/000000_0

Part 4 – The Spark Example / Demo

1. Start a new terminal and navigate to ~/Works

2. Start a new PySpark session

Notice that PySpark v.1.6 uses Python v.2.6.6.

3. Enter the following command:

xfiles = sqlContext.read.load('/user/hive/warehouse/xfiles_prq/')

We are reading the file in the PARQUET format, which is the default for the load method.

4. Enter the following command:

type(xfiles)

You should see the following output:

<class 'pyspark.sql.dataframe.DataFrame'>

5. Enter the following command to print the xfiles‘s schema (it is somewhat equivalent to Hive’s describe xfiles command):

xfiles.printSchema()

You should see the following output (notice that Spark accurately recreated the schema of the file. From where do you think it got the clues?)

root
 |-- filename: string (nullable = true)
 |-- filesize: integer (nullable = true)
 |-- month: string (nullable = true)
 |-- day: integer (nullable = true)

Now we will repeat all the queries from the Hive shell lab part.

In PySpark you have two primary options for querying structured data:

  • using the sqlContext’s sql method, or

  • using the Spark DataFrame API

we will illustrate both.

To use the sqlContext’s sql method, you need to first register the DataFrame as a temp table that is associated with the active sqlContext; this temp table’s lifetime is tied to that of your current Spark session.

6. Enter the following command:

xfiles.registerTempTable('xfiles_tmp')

7. Enter the following command to fetch the first ten rows of the table:

sqlContext.sql('select * from xfiles_tmp').show(10)

You should see the following output (abridged for space below):

+--------------------+--------+-----+---+
|            filename|filesize|month|day|
+--------------------+--------+-----+---+
|                 a2p|  112200|  Feb| 21|
|abrt-action-analy...|   13896|  Feb| 22|
|abrt-action-analy...|   12312|  Feb| 22|
|abrt-action-analy...|    6676|  Feb| 22|
|abrt-action-analy...|   10720|  Feb| 22|
|abrt-action-analy...|   11016|  Feb| 22|
. . . 

8. Enter the following command:

xfiles.select ('*').show(10)

9. Enter the following command:

sqlContext.sql('select month, count(*) from xfiles group by month').show(10)

You should see the following output:

+-----+---+                                                                     
|month|_c1|
+-----+---+
|  Jul| 63|
|  Jun|129|
|  Apr|305|
|  Feb|279|
|  Oct|  9|
|  Nov|221|
|  Mar|  2|
|  May|  5|
|  Dec|103|
|  Aug| 93|
+-----+---+

10. Enter the following command:

 xfiles.groupBy('month').count().show(10)

You should see similar to the above output:

+-----+-----+
|month|count|
+-----+-----+
|  Jul|   63|
|  Jun|  129|
|  Apr|  305|
|  Feb|  279|
|  Oct|    9|
|  Nov|  221|
|  Mar|    2|
|  May|    5|
|  Dec|  103|
|  Aug|   93|
+-----+-----+

11. Enter the following command:

sqlContext.sql('select month, count(*) as fc from xfiles group by month having fc > 100').show()

You should see the following output:

+-----+---+
|month| fc|
+-----+---+
|  Jun|129|
|  Apr|305|
|  Feb|279|
|  Nov|221|
|  Dec|103|
+-----+---+

12. Enter the following command:

xfiles.groupBy('month').count()\
.rdd.filter(lambda r: r['count'] > 100)\
.toDF().show()

You should see the output matching the one generated by the previous command.

Alternatively, you can use this more verbose but more Pythonian type of command:

map(lambda r: (r[‘month’], r[‘count’]), filter(lambda r: r[‘count’] > 100, xfiles.groupBy(‘month’).count().collect()) )

The output will be a bit more difficult to read, though:

[(u'Jun', 129), (u'Apr', 305), (u'Feb', 279), (u'Nov', 221), (u'Dec', 103)] 

Part 5 – Creating a Spark DataFrame from Scratch

You can create a Spark DataFrame from raw data sitting in memory and have Spark infer the schema from the data itself. The choices for column types are somewhat limited: PySpark does not support dates or booleans, but in most practical cases what PySpark supports is more than enough.

1. Enter the following commands to create a list of tuples (records) representing our data (we simulate data coming form different sources here):

txnids = [1,2,3]
items = ['A', 'B', 'C']
dates = ['2020-02-03', '2020-02-10', '2020-02-17']
prices = [123.99, 3.50, 45.67]
records = [(t,d,i,p) for t,d,i,p in zip(txnids, dates, items, prices)]

2. Enter the following command to create a list of column names:

col_names = ['txnid', 'Date', 'Item', 'Price']

3. Enter the following command to create a DataFrame:

df = sqlContext.createDataFrame(records, col_names)

If you print the DataFrame’s schema and its data, you will see these outputs:

root
 |-- txnid: long (nullable = true)
 |-- Date: string (nullable = true)
 |-- Item: string (nullable = true)
 |-- Price: double (nullable = true

+-----+----------+----+------+
|txnid|      Date|Item| Price|
+-----+----------+----+------+
|    1|2020-02-03|   A|123.99|
|    2|2020-02-10|   B|   3.5|
|    3|2020-02-17|   C| 45.67|
+-----+----------+----+------+

You can apply date-time related transformations using the functions from the pyspark.sql.functions module.

4. Enter the following commands:

from pyspark.sql.functions import date_format
df2 = df.select (df.txnid, df.Item, date_format('Date', 'MM-dd-yyyy').alias('usdate'))

This command is functionally equivalent to the INSERT INTO …. SELECT … FROM SQL command data transfer idiom. We also apply the date formatting function date_format().

The df2 DataFrame has the following data:

+-----+----+----------+
|txnid|Item|    usdate|
+-----+----+----------+
|    1|   A|02-03-2020|
|    2|   B|02-10-2020|
|    3|   C|02-17-2020|
+-----+----+----------+

 

Data Visualization with matplotlib and seaborn in Python

This tutorial is adapted from Web Age course Advanced Data Analytics with Pyspark.

1.1 Data Visualization

 The common wisdom states that ‘Seeing is believing and a picture is worth a thousand words’. Data visualization techniques help users understand the data, underlying trends and patterns by displaying it in a variety of graphical forms (heat maps, scatter plots, charts, etc.) . Data visualization is also a great vehicle for communicating analysis results to stakeholders. Data visualization is an indispensable activity in exploratory data analysis (EDA). Business intelligence software vendors usually bundle data visualization tools into their products. There are a number of free tools that may offer similar capabilities in certain areas of data visualization.

1.2 Data Visualization in Python

The three most popular data visualization libraries with Python developers

are:

  • matplotlib,
  • seaborn, and
  • ggplot

seaborn is built on top of matplotlib and you need to perform the required matplotlib imports.

1.3 Matplotlib

Matplotlib [https://matplotlib.org/] is a Python graphics library for data visualization. The project dates back to 2002 and offers Python developers a MATLAB-like plotting interface. It depends on NumPy. You can generate plots, histograms, power spectra, bar charts, error charts, scatter plots, etc., with just a few lines of code. Matplotlib’s main focus is 2D plotting; 3D plotting is possible with the mplot3d package. It is a 2D and 3D desktop plotting package for Python. 3D plots are supported through the mtplot3d toolkit. It supports different graphics platforms and toolkits, as well as all the common vector and raster graphics formats (JPG, PNG, GIF, SVG, PDF, etc.). Matplotlib can be used in Python scripts, IPython REPL, and Jupyter notebooks.

1.4 Getting Started with matplotlib

In your Python program, you start by importing the matplotlib.pyplot module and aliasing it like so:

import matplotlib.pyplot as plt

In Jupyter notebooks, you can instruct the graphics rendering engine to embed the generated graphs with the notebook page with this “magic” command:

%matplotlib inline

The generated graphics will be in-lined in your notebook and there will be no plotting window popping up as in stand-alone Python (including IPython). You can now use the matplotlib.pyplot object to draw your plots using its graphics functions. When done, invoke plt.show() command to render your plot. The show() function discards the object when you close the plot window (you cannot run plt.show() again on the same object). In Jupyter notebook you are not required to use the show() method, also, in order to suppress some diagnostic messages, simply add ‘;’ at the end of the last graph rendering command.

1.5 Figures

The matplotlib.pyplot.figure() method call will launch the plotting window and render the image there. You can create multiple figures before the final call to show(), upon which all the images will be rendered in their respective plotting windows. You can optionally pass the function a number or a string as a parameter representing the figure coordinates to help moving back and forth between the figures.  An important function parameter is figsize which holds a tuple of the figure width and height in inches, e.g. plt.figure(figsize=[12,8]). The default figsize values are 6.4 and 4.8 inches. 

Examples of using the figure() function in stand-alone Python

plt.figure(1) # Subsequent graphics commands will be rendered in the first plotting window

plt.subplot(211) # You can set the figure’s grid layout

plt.plot( …

plt.subplot(212)

plt.plot( …

plt.figure(2) # Now all the subsequent graphics will be

# rendered in a second window

plt.plot( …

plt.figure(1) # You can go back to figure #1

plt.show() # Two stacked-up plotting windows will be generated

Note: You can drop the figure() parameters in case you do not plan to alternate between the figures.

1.6 Saving Figures to a File

Use the matplotlib.pyplot.savefig() function to save the generated figure to a file.  Matplotlib will try to figure out the file’s format using the file’s extension.  Supported formats are eps, jpeg, jpg, pdf, pgf, png, ps, raw, rgba, svg, svgz, tif, tiff.

gif is not supported.

Example:

plt.plot(range(20), ‘rx’)

plt.savefig(‘img/justRedLineToX.jpeg’, dpi=600)

The destination directory must exist.  No show() call is needed. For more details, visit: https://matplotlib.org/api/_as_gen/matplotlib.pyplot.savefig.html#matplotlib.pyplot.savefig

1.7 Seaborn

seaborn is a popular data visualization and EDA library [https://seaborn.pydata.org/]. It is based on matplotlib and is closely integrated with pandas data structures. It has a number of attractive features. It has a dataset-oriented API for examining relationships between multiple variables. It has a convenient views of complex datasets. It has high-level abstractions for structuring multi-plot grids and it has concise control over matplotlib figure styling with several built-in themes.

1.8 Getting Started with seaborn

The required imports are as follows:

%matplotlib inline

import matplotlib.pyplot as plt

import seaborn as sns

Optionally, you can start your data visualization session by resetting the rendering engine settings to seaborn’s default theme and color palette using this command:

sns.set()

1.9 Histograms and KDE

You can render histogram plots along with the fitted kernel density estimate (KDE) line with the distplot() function, e.g.

sns.distplot (pandas_df.column_name)

1.10 Plotting Bivariate Distributions

In addition to plotting univariate distributions (using the distplot() function), seaborn offers a way to plot bivariate distributions using the joinplot() function:

sns.jointplot(x=”col_nameA”, y=”col_nameB”, data=DF, kind=”kde”);

1.11 Scatter plots in seaborn

Scatter plots are rendered using the scatterplot() function, for example:

sns.scatterplot(x, y, hue=[list of color levels]);

1.12 Pair plots in seaborn

The pairplot() function automatically plots pairwise relationships between variables in a dataset. A sample output of the function is shown below. 

Note: Trying to plot too many variables (stored as columns in you DataFrame) in one go may clutter the resulting pair plot.

1.13 Heatmaps

Heatmaps, popularized by Microsoft Excel, are supported in seaborn through its heatmap() function.

A sample output of the function is shown below.

1.14 Summary

In this  tutorial, we reviewed two main data visualization packages in Python:

  • matplotlib
  • seaborn

Data Science and ML Algorithms with PySpark

This tutorial is adapted from Web Age course Practical Machine Learning with Apache Spark.

8.1 Types of Machine Learning

There are three main types of machine learning (ML),  unsupervised learning,  supervised learning, and  reinforcement learning.  We will be learning only about the unsupervised and supervised learning types.

8.2 Supervised vs Unsupervised Machine Learning

In essence, unsupervised learning (UL) attempts to extract patterns without much human intervention; supervised learning (SL) tries to fit rules and equations.  SL defines a target variable that needs to be predicted / estimated by applying an SL algorithm using predictor (independent) variables (features).  Classification and regression are examples of SL algorithms.  SL uses labeled examples.  SL algorithms are built on top of mathematical formulas with predictive capacity.  UL is the opposite of SL.  UL does not have a concept of a target value that needs to be found or estimated.  Rather, a UL algorithm, for example, can deal with the task of grouping (forming a cluster of) similar items together based on some automatically defined or discovered criteria of data elements’ affinity (automatic classification technique). UL uses unlabeled examples.
Notes:
Some classification systems are referred to as expert systems that are created in order to let computers take much of the technical drudgery out of data processing leaving humans with the authority, in most cases, to make the final decision.

8.3 Supervised Machine Learning Algorithms

Some of the most popular supervised ML algorithms are Decision Trees, Random Forest, k-Nearest Neighbors (kNN), Naive Bayes, Regression (linear simple, multiple, locally weighted, etc.) and Support Vector Machines (SVMs).

8.4 Classification (Supervised ML) Examples

Identification of prospective borrowers who are likely to default on their loans (based on historical observations), Spam detection and Image recognition (a smiling face, a type of a musical instrument).

8.5 Unsupervised Machine Learning Algorithms

 Some of the most popular unsupervised ML algorithms are  k-Means,  Hierarchical clustering,  Gaussian mixture models. Dimensionality reduction falls into the realm of unsupervised learning: PCA, Isomap.

8.6  Clustering (Unsupervised ML) Examples

Some of the examples are Building groups of genes with related expression patterns, Customer segmentation, Grouping experiment outcomes and Differentiating social network communities.

8.7 Choosing the Right Algorithm

 The rules below may help you get your direction but those are not written in stone:

  • If you are trying to find the probability of an event or predict a value based on existing historical observations, look at the supervised learning (SL) algorithms. Otherwise refer to the unsupervised learning(UL).
  • If you are dealing with discrete (nominal) values like TRUE:FALSE, bad:good:excellent, etc., you need to go with classification algorithms of
    SL.
  • If you are dealing with continuous numerical values, you need to go with regression algorithms of SL.
  • If you want to let the machine categorize data into a number of groups, you need to go with clustering algorithms of UL.

8.8 Terminology: Observations, Features, and Targets

 In data science, machine learning (ML), and statistics, features are variables that are used in making predictions; they are also called predictors or independent variables.  They are the inputs of your model.  A feature is similar to a relational table’s column (entity attribute, or property).  The value that you predict using features is referred to as a target, or response, or outcome, or predicted variable, or dependent variable.  This is the output of your model.  An observation is the registered value of a particular variable (feature) like temperature, an individual product’s price, etc.  An observation is like a table’s row or record. In ML, observations are often referred to as examples.

8.9 Representing Observations

Vector notation is widely used to represent observations.  The observation vector’s elements are features that are usually denoted as x1, x2, x3, … xN.  Generally, those elements may be vectors themselves, so X below may, in fact, reference a matrix – an array (vector) of arrays:
X {x1, x2, x3, … xN}

8.10 Terminology: Labels

A label is a type of object or a value that we assign to an observation or what we are trying to predict.  You can have labeled and unlabeled observations (examples); the former are mostly used in classification, the latter are found in clustering (unsupervised learning). In classification, labeled examples are used to train the model; after that the trained model is fed unlabeled observations (examples) to have the model infer (intelligently guess) the labels of the observations. 
Label examples:

  •  Software defect severity levels: Blocker, Critical, Major, Minor
  •  Trading recommendations: Buy, Sell, Hold
  •  E-mail categories: Spam, Non-spam

8.11 Terminology: Continuous and Categorical Features

 Features can be of two types:  continuous, or  categorical. Categorical features, in turn, are divided into nominal and ordinal.

8.12 Continuous Features

Continuous features represent something that can be physically or theoretically measured in numeric values, e.g. blood pressure, size of a black hole, plane speed, humidity etc.  Regression models work with continuous features for learning and predicting and help answer this type of questions,” Given our past sales, what is the expected sales figure for the next month?”

8.13 Categorical Features

Categorical variables are discrete, enumerated types that can be ordinal or nominal, like hurricane category, gender, security threat level, city regions, car types, etc. The nominal and ordinal categories can be illustrated using playing cards.  Nominal categories are represented by suits: hearts, diamonds, spades, and clubs (generally, there is no ordering in suites and if one exists, it is game-specific). Ordinal categories are (with some variations) represented by the ranks in each suite (Ace, 2, 3, 4, …., J, Q, K).

8.14 Common Distance Metrics

 A data point is a value at the intersection of a feature (column) and an instance of the observation instance (row). Data science (including ML) uses the concept of a distance between data points as a measure of object similarity.  For continuous numeric variables, the Minkowski distance is used, which has this generic form:


The Minkowski distance has three special cases:

  • For p=1, the distance is known as the Manhattan distance (a.k.a the L1 norm)
  • For p=2, the distance is known as the Euclidean distance (a.k.a. the L2 norm)
  • When p → +infinity, the distance is known as the Chebyshev distance 

In text classification scenarios, the most commonly used distance metric is Hamming distance.

8.15 The Euclidean Distance

The most commonly used distance in ML for continuous numeric variables is the Euclidean distance . In Cartesian coordinates, if we have two points in Euclidean n-space: p and q, the distance from p to q (or from q to p) is given by the Pythagorean formula:


8.16 What is a Model

 A model is a formula, or an algorithm, or a prediction function that establishes a relationship between features (predictors) and labels (the output / predicted variable).  A model is trained to predict (make inference of) the labels or predict values.  There are two major life-cycle phases of a model:

Model training (fitting) – You train or let your model learn on labeled observations (examples) fed to the model
Inference (predicting)-  Here you use your trained model to calculate / predict the labels of unlabeled observations (examples) or numeric values

8.17 Model Evaluation

Once you have your ML model built, you can (and should) evaluate the quality of your model, or its predictive capability (i.e. how well it can do predictions). Your model should have the ability to make accurate predictions (or generalize) on new data (not seen during training). The model evaluation is specific to the estimator you use and you need to refer to the estimator’s documentation page. In this module, we will demonstrate how to evaluate various models.

8.18 The Classification Error Rate

 A common measure of a classification model’s accuracy is the error rate which is the number of wrong predictions (e.g. classification of test observations) divided by the total number of tests. In the ideal world (when you have a perfect training set and your test objects have strong affinity with some classes), your model makes predictions with no errors (the error rate is 0).  On the other side of the spectrum, an error rate of 1.0 indicates a major problem with the training set and/or ambiguous test objects.  Error tolerance levels depend on the type of the model.

8.19 Data Split for Training and Test Data Sets

In ML, the common practice is not to use the same data used for training your model and testing it.  If you must use the same data set for training and testing your ML model (e.g. due to limited availability of the source data), you need to split the data into two parts so that you have one data set for training your model, and one for testing.  In this case, your trained model is validated on data it has not previously seen, which gives you an estimate of how well your model can generalize (fit) to new data. There are many variations of the splitting techniques, but the most common (and simple) one is to allocate about 70-80% of the initial data for training and 20-30% for testing.

8.20 Data Splitting in PySpark

There are a number of considerations when performing a data split operation.  Observations (records) must be selected randomly to prevent record sequence bias.  Every labeled record (for classification models) should have a fair share in both training and testing data sets. PySpark simplifies this operation through the randomSplit() method of the DataFrame object:
train_df, test_df = input_df.randomSplit([0.8,0.2])

8.21 Hold-Out Data

Observations that are set aside for testing and not used during training are called “holdout” data.  Holdout data is used to evaluate your model’s predictive capability and its ability to generalize to data other than the data the model saw during the trained step.

8.22 Cross-Validation Technique

A popular model validation technique used in cases when it is not practical to split the source dataset into training and test parts (e.g. due to the small
data set’s size). The most commonly used variation of this technique is called k-fold crossvalidation, which is an iterative process that works as follows:

  • First you divide the source dataset into k equal folds (parts)
  • Each of the folds takes turns to become the hold-out validation set (test data), the rest k-1 folds are merged to create the training data set
  • Process repeats for all k folds
  • When done, calculate the average accuracy across all the k folds

8.23 Spark ML Overview

Spark ML offers the following modalities:
ML Algorithms:  common learning algorithms such as classification, regression, clustering, and collaborative filtering
Featurization: feature extraction, transformation, dimensionality reduction, and selection
Pipelines: tools for constructing, evaluating, and tuning ML Pipelines
Persistence:  saving and load algorithms, models, and Pipelines
Utilities: linear algebra, statistics, data handling, etc.

8.24 DataFrame-based API is the Primary Spark ML API

As of Spark 2.0, the RDD-based APIs in the spark.mllib package have entered maintenance mode. RDD-based API in spark.mllib will be still supported with bug fixes. No new features will be added to the RDD-based API. The RDD-based API is expected to be removed in Spark 3.0. The primary ML API for Spark is now the DataFrame-based API in the spark.ml package. DataFrames provide a uniform and more developer-friendly API than RDDs.  DataFrames support feature transformations and building ML pipelines.

8.25 Estimators, Models, and Predictors

 ML systems use the terms  model,  estimator  and  predictor in most cases interchangeably.

8.26 Descriptive Statistics

Descriptive statistics helps quantitatively describe data sets.  This discipline provides numerical measures to quantify data. The typical measures of continuous variables are:  mean, median and mode (central tendency measures); the minimum and maximum values of the variables, standard deviation; (or variance), kurtosis and skewness (measures of variability ). Essentially, here you find the central tendency and spread of each
continuous feature (variable).  To describe categorical variables, you can use frequency, or percentage table to find the distribution of each category

8.27 Data Visualization and EDA

Descriptive statistics is supplemented by such visualization modalities and graphical forms as Histograms, Heat Maps, Charts, Box Plots, etc.
Data visualization is an activity in exploratory data analysis (EDA) aimed at discovering the underlying trends and patterns, as well as communicating the results of the analysis. Business intelligence software vendors usually bundle data visualization tools with their products.

8.29 Correlations

Correlation describes the relationship or association between two variables. Correlation coefficients are used to measure the degree to which two
variables are associated. The sign (+ or -) of the correlation coefficient indicates the direction of the relationship with the ‘+’ sign indicating the positive relationship and ‘–’ the inverse one.  The value of the coefficient indicates the strength of the relationship. Correlation coefficients range from 0 indicating that there is no observable relationship up to 1 for absolutely strong relationship.