Comparing Hive with Spark SQL

This tutorial is adapted from Web Age course Hadoop Programming on the Cloudera Platform.

In this tutorial, you will work through two functionally equivalent examples / demos – one written in Hive (v. 1.1) and the other written using PySpark API for the Spark SQL module (v. 1.6) – to see the differences between the command syntax of these popular Big Data processing systems. Both examples / demos have been prepared using CDH 5.13.0 and that’s what you are going to use in this exercise.

We will work against the same input file in both examples, running functionally equivalent query statements against that file.

Both Hive and PySpark shells support the Unix command-line short-cuts that allow you to quickly navigate along a single command line:

Ctrl-a Moves the cursor to the start of the line.

Ctrl-e Moves the cursor to the end of the line.

Ctrl-k Deletes (kills) the line contents to the right of the cursor.

Ctrl-u Deletes the line contents to the left of cursor.

Part 1 – Connect to the  Environment

  • Download and install Cloudera’s Quickstart VM. Follow this link for instructions.

Part 2 – Setting up the Working Environment

All the steps in this tutorial will be performed in the /home/cloudera/Works directory. Create this direectory if it is not existing.

1. In a new terminal window, type in the following command:

cd ~/Works

Before we begin, we need to make sure that Hive-related services are up and running.

2. Enter the following commands one after another:

sudo /etc/init.d/hive-metastore status  
sudo /etc/init.d/hive-server2 status 

If you see their status as not running, start the stopped service(s) using the following commands:

sudo /etc/init.d/hive-metastore start 
sudo /etc/init.d/hive-server2 start 

3. Enter the following command:

hive --version

You should see the following output:

Hive 1.1.0-cdh5.13.0
Subversion file:///data/jenkins/workspace/generic-package-rhel64-6-0/topdir/BUILD/hive-1.1.0-cdh5.13.0 -r Unknown
Compiled by jenkins on Wed Oct 4 11:06:55 PDT 2017
From source with checksum 4c9678e964cc1d15a0190a0a1867a837

4. Enter the following command to download the input file:

wget 'http://bit.ly/36fGR32'  -O files.csv

The file is in the CSV format with a header row; it has the following comma-delimited fields:

1. File name
2. File size 
3. Month of creation 
4. Day of creation

And the first four rows of the file are:

FNAME,FSIZE,MONTH,DAY
a2p,112200,Feb,21
abrt-action-analyze-backtrace,13896,Feb,22
abrt-action-analyze-c,12312,Feb,22

5. Enter the following commands to put the file on HDFS:

hadoop fs -mkdir hive_demo
hadoop fs  -put files.csv hive_demo/

The file is now in the hive_demo directory on HDFS – that’s where we are going to load it from when working with both Hive and Spark.

Part 3 – The Hive Example / Demo

We will use the hive tool to start the interactive Hive shell (REPL) instead of the now recommended beeline tool which is a bit more ceremonial and here we are going to ignore its client-server architecture advantages in favor of the simplicity of hive.

1. Enter the following command to Start Hive REPL:

hive 

Note: You start the Beeline shell in embedded mode by running this command:

beeline -u jdbc:hive2://

To connect to a remote Hive Server (that’s is the primary feature of Beeline, use this command:

beeline -n hive -u jdbc:hive2://<hive server IP>:<port>,

e.g beeline -n hive -u jdbc:hive2://quickstart:10000

Now, let’s create a table using the Hive Data Definition Language (DDL).

2. Enter the following command:

CREATE EXTERNAL TABLE xfiles
(fileName STRING,fileSize INT,month STRING,day INT)
    ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
    LOCATION '/user/cloudera/hive_demo/'
    tblproperties ("skip.header.line.count"="1");

Note that we have created it an external Hive table.

3. Enter the following command to print the basic table schema info:

describe xfiles;

You should see the following output:

filename            	string              	                    
filesize            	int                 	                    
month               	string              	                    
day                 	int 

4. Enter the following command to print the extended table metadata:

desc formatted xfiles;

You should see the following output:

# col_name            	data_type           	comment             
	 	 
filename            	string              	                    
filesize            	int                 	                    
month               	string              	                    
day                 	int                 	                    
	 	 
# Detailed Table Information	 	 
Database:           	default             	 
Owner:              	cloudera            	 
CreateTime:         	Mon Feb 24 07:35:26 PST 2020	 
LastAccessTime:     	UNKNOWN             	 
Protect Mode:       	None                	 
Retention:          	0                   	 
Location:           	hdfs://quickstart.cloudera:8020/user/cloudera/hive_demo 
Table Type:         	EXTERNAL_TABLE      	 
Table Parameters:	 	 
	EXTERNAL            	TRUE                
	skip.header.line.count	1                   
	transient_lastDdlTime	1582558526          
	 	 
# Storage Information	 	 
SerDe Library:      	org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe	 
InputFormat:        	org.apache.hadoop.mapred.TextInputFormat	 
OutputFormat:       	org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat	 
Compressed:         	No                  	 
Num Buckets:        	-1                  	 
Bucket Columns:     	[]                  	 
Sort Columns:       	[]                  	 
Storage Desc Params:	 	 
	field.delim         	,                   
	serialization.format	,   

5. Enter the following command to fetch the first ten rows of the table:

select * from xfiles limit 10;

6. Enter the following basic SQL command:

select month, count(*) from xfiles group by month;

You should see the following output:

Apr	305
Aug	93
Dec	103
Feb	279
Jan	5
Jul	63
Jun	129
Mar	2
May	5
Nov	221
Oct	9
Sep	66

7. Enter the following more advanced SQL command:

select month, count(*) as fc from xfiles group by month having fc > 100;

You should see the following output:

Apr	305
Dec	103
Feb	279
Jun	129
Nov	221

8. Enter the following Hive DDL and DML commands (we are creating and populating a table in PARQUET columnar store format):

CREATE TABLE xfiles_prq LIKE xfiles STORED AS PARQUET;
INSERT OVERWRITE TABLE xfiles_prq SELECT * FROM xfiles;

9. Enter the following command:

desc formatted xfiles_prq;

Locate in the output of the above command the location of the xfiles_prq Hive managed table; it should be:

hdfs://quickstart.cloudera:8020/user/hive/warehouse/xfiles_prq

It is a directory that contains a file in the PARQUET format:

/user/hive/warehouse/xfiles_prq/000000_0

Part 4 – The Spark Example / Demo

1. Start a new terminal and navigate to ~/Works

2. Start a new PySpark session

Notice that PySpark v.1.6 uses Python v.2.6.6.

3. Enter the following command:

xfiles = sqlContext.read.load('/user/hive/warehouse/xfiles_prq/')

We are reading the file in the PARQUET format, which is the default for the load method.

4. Enter the following command:

type(xfiles)

You should see the following output:

<class 'pyspark.sql.dataframe.DataFrame'>

5. Enter the following command to print the xfiles‘s schema (it is somewhat equivalent to Hive’s describe xfiles command):

xfiles.printSchema()

You should see the following output (notice that Spark accurately recreated the schema of the file. From where do you think it got the clues?)

root
 |-- filename: string (nullable = true)
 |-- filesize: integer (nullable = true)
 |-- month: string (nullable = true)
 |-- day: integer (nullable = true)

Now we will repeat all the queries from the Hive shell lab part.

In PySpark you have two primary options for querying structured data:

  • using the sqlContext’s sql method, or

  • using the Spark DataFrame API

we will illustrate both.

To use the sqlContext’s sql method, you need to first register the DataFrame as a temp table that is associated with the active sqlContext; this temp table’s lifetime is tied to that of your current Spark session.

6. Enter the following command:

xfiles.registerTempTable('xfiles_tmp')

7. Enter the following command to fetch the first ten rows of the table:

sqlContext.sql('select * from xfiles_tmp').show(10)

You should see the following output (abridged for space below):

+--------------------+--------+-----+---+
|            filename|filesize|month|day|
+--------------------+--------+-----+---+
|                 a2p|  112200|  Feb| 21|
|abrt-action-analy...|   13896|  Feb| 22|
|abrt-action-analy...|   12312|  Feb| 22|
|abrt-action-analy...|    6676|  Feb| 22|
|abrt-action-analy...|   10720|  Feb| 22|
|abrt-action-analy...|   11016|  Feb| 22|
. . . 

8. Enter the following command:

xfiles.select ('*').show(10)

9. Enter the following command:

sqlContext.sql('select month, count(*) from xfiles group by month').show(10)

You should see the following output:

+-----+---+                                                                     
|month|_c1|
+-----+---+
|  Jul| 63|
|  Jun|129|
|  Apr|305|
|  Feb|279|
|  Oct|  9|
|  Nov|221|
|  Mar|  2|
|  May|  5|
|  Dec|103|
|  Aug| 93|
+-----+---+

10. Enter the following command:

 xfiles.groupBy('month').count().show(10)

You should see similar to the above output:

+-----+-----+
|month|count|
+-----+-----+
|  Jul|   63|
|  Jun|  129|
|  Apr|  305|
|  Feb|  279|
|  Oct|    9|
|  Nov|  221|
|  Mar|    2|
|  May|    5|
|  Dec|  103|
|  Aug|   93|
+-----+-----+

11. Enter the following command:

sqlContext.sql('select month, count(*) as fc from xfiles group by month having fc > 100').show()

You should see the following output:

+-----+---+
|month| fc|
+-----+---+
|  Jun|129|
|  Apr|305|
|  Feb|279|
|  Nov|221|
|  Dec|103|
+-----+---+

12. Enter the following command:

xfiles.groupBy('month').count()\
.rdd.filter(lambda r: r['count'] > 100)\
.toDF().show()

You should see the output matching the one generated by the previous command.

Alternatively, you can use this more verbose but more Pythonian type of command:

map(lambda r: (r[‘month’], r[‘count’]), filter(lambda r: r[‘count’] > 100, xfiles.groupBy(‘month’).count().collect()) )

The output will be a bit more difficult to read, though:

[(u'Jun', 129), (u'Apr', 305), (u'Feb', 279), (u'Nov', 221), (u'Dec', 103)] 

Part 5 – Creating a Spark DataFrame from Scratch

You can create a Spark DataFrame from raw data sitting in memory and have Spark infer the schema from the data itself. The choices for column types are somewhat limited: PySpark does not support dates or booleans, but in most practical cases what PySpark supports is more than enough.

1. Enter the following commands to create a list of tuples (records) representing our data (we simulate data coming form different sources here):

txnids = [1,2,3]
items = ['A', 'B', 'C']
dates = ['2020-02-03', '2020-02-10', '2020-02-17']
prices = [123.99, 3.50, 45.67]
records = [(t,d,i,p) for t,d,i,p in zip(txnids, dates, items, prices)]

2. Enter the following command to create a list of column names:

col_names = ['txnid', 'Date', 'Item', 'Price']

3. Enter the following command to create a DataFrame:

df = sqlContext.createDataFrame(records, col_names)

If you print the DataFrame’s schema and its data, you will see these outputs:

root
 |-- txnid: long (nullable = true)
 |-- Date: string (nullable = true)
 |-- Item: string (nullable = true)
 |-- Price: double (nullable = true

+-----+----------+----+------+
|txnid|      Date|Item| Price|
+-----+----------+----+------+
|    1|2020-02-03|   A|123.99|
|    2|2020-02-10|   B|   3.5|
|    3|2020-02-17|   C| 45.67|
+-----+----------+----+------+

You can apply date-time related transformations using the functions from the pyspark.sql.functions module.

4. Enter the following commands:

from pyspark.sql.functions import date_format
df2 = df.select (df.txnid, df.Item, date_format('Date', 'MM-dd-yyyy').alias('usdate'))

This command is functionally equivalent to the INSERT INTO …. SELECT … FROM SQL command data transfer idiom. We also apply the date formatting function date_format().

The df2 DataFrame has the following data:

+-----+----+----------+
|txnid|Item|    usdate|
+-----+----+----------+
|    1|   A|02-03-2020|
|    2|   B|02-10-2020|
|    3|   C|02-17-2020|
+-----+----+----------+

 

Data Visualization with matplotlib and seaborn in Python

This tutorial is adapted from Web Age course Advanced Data Analytics with Pyspark.

1.1 Data Visualization

 The common wisdom states that ‘Seeing is believing and a picture is worth a thousand words’. Data visualization techniques help users understand the data, underlying trends and patterns by displaying it in a variety of graphical forms (heat maps, scatter plots, charts, etc.) . Data visualization is also a great vehicle for communicating analysis results to stakeholders. Data visualization is an indispensable activity in exploratory data analysis (EDA). Business intelligence software vendors usually bundle data visualization tools into their products. There are a number of free tools that may offer similar capabilities in certain areas of data visualization.

1.2 Data Visualization in Python

The three most popular data visualization libraries with Python developers

are:

  • matplotlib,
  • seaborn, and
  • ggplot

seaborn is built on top of matplotlib and you need to perform the required matplotlib imports.

1.3 Matplotlib

Matplotlib [https://matplotlib.org/] is a Python graphics library for data visualization. The project dates back to 2002 and offers Python developers a MATLAB-like plotting interface. It depends on NumPy. You can generate plots, histograms, power spectra, bar charts, error charts, scatter plots, etc., with just a few lines of code. Matplotlib’s main focus is 2D plotting; 3D plotting is possible with the mplot3d package. It is a 2D and 3D desktop plotting package for Python. 3D plots are supported through the mtplot3d toolkit. It supports different graphics platforms and toolkits, as well as all the common vector and raster graphics formats (JPG, PNG, GIF, SVG, PDF, etc.). Matplotlib can be used in Python scripts, IPython REPL, and Jupyter notebooks.

1.4 Getting Started with matplotlib

In your Python program, you start by importing the matplotlib.pyplot module and aliasing it like so:

import matplotlib.pyplot as plt

In Jupyter notebooks, you can instruct the graphics rendering engine to embed the generated graphs with the notebook page with this “magic” command:

%matplotlib inline

The generated graphics will be in-lined in your notebook and there will be no plotting window popping up as in stand-alone Python (including IPython). You can now use the matplotlib.pyplot object to draw your plots using its graphics functions. When done, invoke plt.show() command to render your plot. The show() function discards the object when you close the plot window (you cannot run plt.show() again on the same object). In Jupyter notebook you are not required to use the show() method, also, in order to suppress some diagnostic messages, simply add ‘;’ at the end of the last graph rendering command.

1.5 Figures

The matplotlib.pyplot.figure() method call will launch the plotting window and render the image there. You can create multiple figures before the final call to show(), upon which all the images will be rendered in their respective plotting windows. You can optionally pass the function a number or a string as a parameter representing the figure coordinates to help moving back and forth between the figures.  An important function parameter is figsize which holds a tuple of the figure width and height in inches, e.g. plt.figure(figsize=[12,8]). The default figsize values are 6.4 and 4.8 inches. 

Examples of using the figure() function in stand-alone Python

plt.figure(1) # Subsequent graphics commands will be rendered in the first plotting window

plt.subplot(211) # You can set the figure’s grid layout

plt.plot( …

plt.subplot(212)

plt.plot( …

plt.figure(2) # Now all the subsequent graphics will be

# rendered in a second window

plt.plot( …

plt.figure(1) # You can go back to figure #1

plt.show() # Two stacked-up plotting windows will be generated

Note: You can drop the figure() parameters in case you do not plan to alternate between the figures.

1.6 Saving Figures to a File

Use the matplotlib.pyplot.savefig() function to save the generated figure to a file.  Matplotlib will try to figure out the file’s format using the file’s extension.  Supported formats are eps, jpeg, jpg, pdf, pgf, png, ps, raw, rgba, svg, svgz, tif, tiff.

gif is not supported.

Example:

plt.plot(range(20), ‘rx’)

plt.savefig(‘img/justRedLineToX.jpeg’, dpi=600)

The destination directory must exist.  No show() call is needed. For more details, visit: https://matplotlib.org/api/_as_gen/matplotlib.pyplot.savefig.html#matplotlib.pyplot.savefig

1.7 Seaborn

seaborn is a popular data visualization and EDA library [https://seaborn.pydata.org/]. It is based on matplotlib and is closely integrated with pandas data structures. It has a number of attractive features. It has a dataset-oriented API for examining relationships between multiple variables. It has a convenient views of complex datasets. It has high-level abstractions for structuring multi-plot grids and it has concise control over matplotlib figure styling with several built-in themes.

1.8 Getting Started with seaborn

The required imports are as follows:

%matplotlib inline

import matplotlib.pyplot as plt

import seaborn as sns

Optionally, you can start your data visualization session by resetting the rendering engine settings to seaborn’s default theme and color palette using this command:

sns.set()

1.9 Histograms and KDE

You can render histogram plots along with the fitted kernel density estimate (KDE) line with the distplot() function, e.g.

sns.distplot (pandas_df.column_name)

1.10 Plotting Bivariate Distributions

In addition to plotting univariate distributions (using the distplot() function), seaborn offers a way to plot bivariate distributions using the joinplot() function:

sns.jointplot(x=”col_nameA”, y=”col_nameB”, data=DF, kind=”kde”);

1.11 Scatter plots in seaborn

Scatter plots are rendered using the scatterplot() function, for example:

sns.scatterplot(x, y, hue=[list of color levels]);

1.12 Pair plots in seaborn

The pairplot() function automatically plots pairwise relationships between variables in a dataset. A sample output of the function is shown below. 

Note: Trying to plot too many variables (stored as columns in you DataFrame) in one go may clutter the resulting pair plot.

1.13 Heatmaps

Heatmaps, popularized by Microsoft Excel, are supported in seaborn through its heatmap() function.

A sample output of the function is shown below.

1.14 Summary

In this  tutorial, we reviewed two main data visualization packages in Python:

  • matplotlib
  • seaborn

Python Modules and Code Reuse

This tutorial is adapted from Web Age course Introduction to Python Development.

1.1 Code Organization in Python

Several organizational terms are used when referring to Python code:
Module – A file with some Python code in it
Library – A collection of modules
Package – A directory that can include an individual module, a library of modules, an __init__.py file or sub-package(s).

1.2 Python Modules

A Module is a file containing Python code.  Modules that are meant to be executed are sometimes called Scripts. Modules meant to be imported and used by other modules may be referred to as Libraries. Module filenames have the extension .py
Python module code can include variables, functions, classes and runnable code (code not in a function)

1.3 Python Module Example

The following file is an example of a Python module:
# my_utils.py
def get_mod_name():
return __name__

def halved(a):
return a / 2

def doubled(a):
return 2 * a

def squared(a):
return a * a

1.4 Using Modules

Modules must be importing before they can be used:
import module_name

Functions are called with the following syntax:
module_name.function_name()

Here’s a script that uses our util module:
# my_app.py
import my_utils
print(my_utils.get_mod_name())
print(my_utils.doubled(7))
print(my_utils.squared(5))

The above script outputs
my_utils
14
25

1.5 Import Statements

Import statements can be used to import a module or individual function. Imports can allow access via an alias’.

Import Allow Use of
import mod1 mod1.func_name()
import mod1 as m m.func_name()
import dir1.mod1 dir1.mod1.func_name()
import dir1.mod1 as dm1 dm1.func_name()
from mod1 import func_name func_name()

1.6 Using Modules in Multiple Projects

To use a Module in multiple projects:
Create the module in its own project.
Place a copy of the module in a location where Python can access it.
Import the module into the code where it is used.

1.7 How Does Python Find Modules?

Upon reading the import statement below Python looks in various directories for my_util.py and an error is thrown if the module is not found.
import my_util

The directories Python looks in are defined in its sys.path variable. The sys.path variable can be accessed from the Python console like this:
>>> import sys
>>> print(sys.path)
[”,
‘/usr/lib/python36.zip’,
‘/usr/lib/python3.6’,
‘/usr/lib/python3.6/lib-dynload’,
‘/usr/local/lib/python3.6/dist-packages’,
‘/usr/lib/python3/dist-packages’]

Placing your module in any of the listed directories will allow python to find it when it processes the import statement.

1.8 Adding Directories to Sys.Path

If the directory where you are keeping your module is not one of the directories in sys.path you have two options.
Copy your module file to a directory listed in sys.path.
Add the directory to the sys.path.

You can add directories to sys.path by :

  • Setting the PYTHONPATH environment variable before running your app (or before running the shell if you are working in there).
    export PYTHONPATH=”$PWD/the_module_dir”.
  • Adding the following code in your script (before the module import statement):

import sys
sys.path.append(‘the_module_dir’)
import module_name
module_name.some_function()

1.9 Packages in Python

A Python package is an organized collection of individual python files (modules).
To create a Python package:

  • Create a directory for the package.
  • Add python module files to the package directory.
  • Create a python file named __init__.py in the package directory.

Code in Python packages can be accessed using the same type of import statements that are used for modules.

1.10 Example Package

This package directory will be named: util
The util directory will hold two modules – each with two functions plus an empty __init_.py file:
module01.py

  • halved()
  • doubled()

module02.py

  • squared()
  • cubed()

__init__.py

1.11 Accessing the Package Functions

The following code will access the packages’ functions
# my_app.py
import util.module01
import util.module02
print(util.module01.doubled(2))
print(util.module02.squared(5))

This code is not optimal in that:
It requires two import statements
All function names must be fully qualified

1.12 Using the Package’s Initialization File

We can improve our code by adding the following to the __init__.py file:
from .module01 import doubled
from .module01 import halved
from .module02 import squared
from .module02 import cubed

This allows us to write our application like this:
# my_app.py
import util
print(util.doubled(2))
print(util.squared(5))

1.13 Summary

In this tutorial, we covered:
Code Organization in Python
Python Modules
Python Packages
Import Statements
The Package Initialization File