This tutorial is adapted from the Web Age course https://www.webagesolutions.com/courses/WA3057-data-science-and-data-engineering-for-architects.

1.1 Why Do I Need Data Visualization?

  • The common wisdom states that:
    • Seeing is believing and a picture is worth a thousand words
  • Data visualization techniques help users understand the data, underlying trends and patterns by displaying it in a variety of graphical forms (heat maps, scatter plots, charts, etc.)
  • Data visualization is also a great vehicle for communicating analysis results to stakeholders
  • Data visualization is an indispensable activity in exploratory data analysis (EDA)
  • Business intelligence software vendors usually bundle data visualization tools into their products
  • There are a number of free tools that may offer similar capabilities in certain areas of data visualization

1.2 Data Visualization in Python

  • The two most popular data visualization libraries with Python developers are:
    • matplotlib, and
    • seaborn

Notes:

Visualization with pandas

The pandas module leverages the matplotlib module for creating an easy-to-use visualization API (https://pandas.pydata.org/pandas-docs/stable/visualization.html) against the target DataFrame – you just need to import the matplotlib.pyplot module first.

Matplotlib

  • Matplotlib [https://matplotlib.org/] is a Python graphics library for data visualization.
  • The project dates back to 2002 and offers Python developers a MATLAB-like plotting interface.
  • Depends on NumPy.
  • You can generate plots, histograms, power spectra, bar charts, error charts, scatter plots, etc., with just a few lines of code.
  • Matplotlib’s main focus is 2D plotting; 3D plotting is possible with the mplot3d package.
  • The 2D and 3D desktop plotting package for Python.
  • 3D plots are supported through the mtplot3d toolkit.
  • It supports different graphics platforms and toolkits, as well as all the common vector and raster graphics formats (JPG, PNG, GIF, SVG, PDF, etc.)
  • Matplotlib can be used in Python scripts, IPython REPL, and Jupyter notebooks.

1.3 Getting Started with matplotlib

  • In your Python program or Jupyter notebook, you start by importing the matplotlib.pyplot module and aliasing it like so:
	import matplotlib.pyplot as plt
  • In Jupyter notebooks, you can instruct the graphics rendering engine to embed the generated graphs with the notebook page with this “magic” command:
%matplotlib inline
  • The generated graphics will be in-lined in your notebook and there will be no plotting window popping up as in stand-alone Python (including IPython)
  • You can now use the matplotlib.pyplot object to draw your plots using its graphics functions
    • Note: Any methods mentioned in the text below are those of the matplotlib.pyplot object
  • When done, invoke the show() method (of the matplotlib.pyplot object) to render your plot

1.4A Basic Plot

%matplotlib inline
import matplotlib.pyplot as plt
x = [i for i in range(10)]
y = [x[i] * 2 for i in x]
plt.grid()
plt.plot(x,y)
plt.xticks(x)
plt.yticks(y)
plt.show()

Notes:

Without the xticks() and yticks() methods, the plot will use some calculated values for the x and y coordinates that may not be quite user-friendly:

1.5 Scatter Plots

import random
random.seed(1011)
dots_count = 10
x = [i for i in range(dots_count)]
y = [random.randint(0,dots_count)
for i in x]
plt.grid()
plt.scatter(x,y)
plt.xticks(range(dots_count))
plt.yticks(range(dots_count + 1))
for i in range(dots_count):
plt.annotate(“#” + str(i), (x[i], y[i]))
plt.show()

Notes:

Here is the unwrapped code from the slide:

import random 
random.seed(1011)
dots_count = 10
x = [i for i in range(dots_count)]
y = [random.randint(0,dots_count) for i in x]

plt.grid()    
plt.scatter(x,y)
plt.xticks(range(dots_count))
plt.yticks(range(dots_count + 1))

for i in range(dots_count):
    plt.annotate("#" + str(i), (x[i], y[i]))

plt.show()

The y list holds the following random numbers:

[5, 4, 0, 6, 2, 10, 7, 6, 10, 7]

The plt.annotate() method annotates a single data point on the plot.

1.6 Figures

  • The figure() method call will launch the plotting window and render the image there
  • You can create multiple figures before the final call to show(), upon which all the images will be rendered in their respective plotting windows
  • You can optionally pass the function a number or a string as a parameter representing the figure coordinates to help the moving back and forth between the figures
  • An important function parameter is figsize which holds a tuple of the figure width and height in inches, e.g. plt.figure(figsize=[12,8])
    • the default figsize values are 6.4 and 4.8 inches

1.7 Saving Figures to a File

  • Use the savefig() method to save the generated figure to a file
  • Matplotlib will try to figure out the file’s format using the file’s extension
  • Supported formats are:
    • eps, jpeg, jpg, pdf, pgf, png, ps, raw, rgba, svg, svgz, tif, tiff
      • No, gif is not supported
  • Example:
    # Build a plot ... 
    # Then save it to a local file ...
    plt.savefig('img/justRedLineToX.jpeg', dpi=600)
	

1.8 Seaborn

  • The seaborn data visualization package [https://seaborn.pydata.org/] is based and operates on top of matplotlib; it is tightly integrated with pandas data structures
  • It has a number of attractive features, including:
    • A dataset-oriented API for examining relationships between multiple variables
    • Convenient views of complex datasets
    • High-level abstractions for structuring multi-plot grids
    • Concise control over matplotlib figure styling with several built-in themes

1.9 Getting Started with seaborn

  • The required imports are as follows:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
  • Optionally, you can start your data visualization session by resetting the rendering engine settings to seaborn’s default theme and color palette using this command:
sns.set()	

1.10 Histograms and KDE

  • You can render histogram plots along with the fitted kernel density estimate (KDE) line with the distplot() function, e.g.
sns.distplot (pandas_df.column_name)


1.11 Plotting Bivariate Distributions

  • In addition to plotting univariate distributions (using the distplot() function), seaborn offers a way to plot bivariate distributions using the joinplot() function:
sns.jointplot(x="col_nameA", y="col_nameB", data=DF, kind="kde");

1.12 Scatter Plots in seaborn

  • Scatter plots are rendered using the scatterplot() function, for example:
sns.scatterplot(x, y, hue=[list of color levels]);

1.13 Pair plots in seaborn





  • The pairplot() function automatically plots pairwise relationships between variables in a dataset
  • A sample output of the function is shown below
    • Note: Trying to plot too many variables (stored as columns in your DataFrame) in one go may clutter the resulting pair plot

1.14 Heatmaps

  • Heatmaps, popularized by Microsoft Excel, are supported in seaborn through its heatmap() function
  • A sample output of the function is shown below

1.15A Seaborn Scatterplot with Varying Point Sizes and Hues

Notes:

The plot on the slide is borrowed from seaborn’s documentation: https://seaborn.pydata.org/examples/scatter_bubbles.html?highlight=hue

The relevant code is shown below; note that seaborn uses a pandas DataFrame to help with data rendering aspects organization (the mpg object is a DataFrame; values for hue, size, and other named parameters are taken from mpg’s columns referenced by their column index names):

import seaborn as sns
sns.set_theme(style="white")

# Load the example mpg dataset
mpg = sns.load_dataset("mpg")

# Plot miles per gallon against horsepower with other semantics
sns.relplot(x="horsepower", y="mpg", hue="origin", size="weight",
            sizes=(40, 400), alpha=.5, palette="muted",
            height=6, data=mpg) 

1.16 Summary

  • In this tutorial, we reviewed some of the graphing capabilities of the two main data visualization packages in Python:
    • matplotlib
    • seaborn

Leave a Reply

Your email address will not be published. Required fields are marked *