Data Visualization with matplotlib and seaborn in Python

This tutorial is adapted from Web Age course Advanced Data Analytics with Pyspark.

1.1 Data Visualization

 The common wisdom states that ‘Seeing is believing and a picture is worth a thousand words’. Data visualization techniques help users understand the data, underlying trends and patterns by displaying it in a variety of graphical forms (heat maps, scatter plots, charts, etc.) . Data visualization is also a great vehicle for communicating analysis results to stakeholders. Data visualization is an indispensable activity in exploratory data analysis (EDA). Business intelligence software vendors usually bundle data visualization tools into their products. There are a number of free tools that may offer similar capabilities in certain areas of data visualization.

1.2 Data Visualization in Python

The three most popular data visualization libraries with Python developers

are:

  • matplotlib,
  • seaborn, and
  • ggplot

seaborn is built on top of matplotlib and you need to perform the required matplotlib imports.

1.3 Matplotlib

Matplotlib [https://matplotlib.org/] is a Python graphics library for data visualization. The project dates back to 2002 and offers Python developers a MATLAB-like plotting interface. It depends on NumPy. You can generate plots, histograms, power spectra, bar charts, error charts, scatter plots, etc., with just a few lines of code. Matplotlib’s main focus is 2D plotting; 3D plotting is possible with the mplot3d package. It is a 2D and 3D desktop plotting package for Python. 3D plots are supported through the mtplot3d toolkit. It supports different graphics platforms and toolkits, as well as all the common vector and raster graphics formats (JPG, PNG, GIF, SVG, PDF, etc.). Matplotlib can be used in Python scripts, IPython REPL, and Jupyter notebooks.

1.4 Getting Started with matplotlib

In your Python program, you start by importing the matplotlib.pyplot module and aliasing it like so:

import matplotlib.pyplot as plt

In Jupyter notebooks, you can instruct the graphics rendering engine to embed the generated graphs with the notebook page with this “magic” command:

%matplotlib inline

The generated graphics will be in-lined in your notebook and there will be no plotting window popping up as in stand-alone Python (including IPython). You can now use the matplotlib.pyplot object to draw your plots using its graphics functions. When done, invoke plt.show() command to render your plot. The show() function discards the object when you close the plot window (you cannot run plt.show() again on the same object). In Jupyter notebook you are not required to use the show() method, also, in order to suppress some diagnostic messages, simply add ‘;’ at the end of the last graph rendering command.

1.5 Figures

The matplotlib.pyplot.figure() method call will launch the plotting window and render the image there. You can create multiple figures before the final call to show(), upon which all the images will be rendered in their respective plotting windows. You can optionally pass the function a number or a string as a parameter representing the figure coordinates to help moving back and forth between the figures.  An important function parameter is figsize which holds a tuple of the figure width and height in inches, e.g. plt.figure(figsize=[12,8]). The default figsize values are 6.4 and 4.8 inches. 

Examples of using the figure() function in stand-alone Python

plt.figure(1) # Subsequent graphics commands will be rendered in the first plotting window

plt.subplot(211) # You can set the figure’s grid layout

plt.plot( …

plt.subplot(212)

plt.plot( …

plt.figure(2) # Now all the subsequent graphics will be

# rendered in a second window

plt.plot( …

plt.figure(1) # You can go back to figure #1

plt.show() # Two stacked-up plotting windows will be generated

Note: You can drop the figure() parameters in case you do not plan to alternate between the figures.

1.6 Saving Figures to a File

Use the matplotlib.pyplot.savefig() function to save the generated figure to a file.  Matplotlib will try to figure out the file’s format using the file’s extension.  Supported formats are eps, jpeg, jpg, pdf, pgf, png, ps, raw, rgba, svg, svgz, tif, tiff.

gif is not supported.

Example:

plt.plot(range(20), ‘rx’)

plt.savefig(‘img/justRedLineToX.jpeg’, dpi=600)

The destination directory must exist.  No show() call is needed. For more details, visit: https://matplotlib.org/api/_as_gen/matplotlib.pyplot.savefig.html#matplotlib.pyplot.savefig

1.7 Seaborn

seaborn is a popular data visualization and EDA library [https://seaborn.pydata.org/]. It is based on matplotlib and is closely integrated with pandas data structures. It has a number of attractive features. It has a dataset-oriented API for examining relationships between multiple variables. It has a convenient views of complex datasets. It has high-level abstractions for structuring multi-plot grids and it has concise control over matplotlib figure styling with several built-in themes.

1.8 Getting Started with seaborn

The required imports are as follows:

%matplotlib inline

import matplotlib.pyplot as plt

import seaborn as sns

Optionally, you can start your data visualization session by resetting the rendering engine settings to seaborn’s default theme and color palette using this command:

sns.set()

1.9 Histograms and KDE

You can render histogram plots along with the fitted kernel density estimate (KDE) line with the distplot() function, e.g.

sns.distplot (pandas_df.column_name)

1.10 Plotting Bivariate Distributions

In addition to plotting univariate distributions (using the distplot() function), seaborn offers a way to plot bivariate distributions using the joinplot() function:

sns.jointplot(x=”col_nameA”, y=”col_nameB”, data=DF, kind=”kde”);

1.11 Scatter plots in seaborn

Scatter plots are rendered using the scatterplot() function, for example:

sns.scatterplot(x, y, hue=[list of color levels]);

1.12 Pair plots in seaborn

The pairplot() function automatically plots pairwise relationships between variables in a dataset. A sample output of the function is shown below. 

Note: Trying to plot too many variables (stored as columns in you DataFrame) in one go may clutter the resulting pair plot.

1.13 Heatmaps

Heatmaps, popularized by Microsoft Excel, are supported in seaborn through its heatmap() function.

A sample output of the function is shown below.

1.14 Summary

In this  tutorial, we reviewed two main data visualization packages in Python:

  • matplotlib
  • seaborn

Leave a Reply

Your email address will not be published. Required fields are marked *