This tutorial is adapated from the Web Age course Data Engineering Bootcamp Training (Using Python and PySpark).

1.1 Data Aggregation and Grouping

The pandas module offers functions that closely emulate SQL functionality for data grouping, aggregation, and filtering. 

With pandas you can do the following tasks:

  • Compute group summary statistics (mean, std, count, etc.)
  • Compute pivot tables and cross-tabulation

Generally, these operations are referred to as the split-apply-combine cycle.

1.2 Sample Data Set

We will use the data set below to illustrate some of the common data aggregation operations in Python using pandas. The data set captures a point in time’s view of some fictitious key performance indicators (KPI) which are characterized by their values, scale, and movement.  Note that we have categorical values in the Scale and Movement columns that are very good candidates to perform “group by” operations on.

KPI|Value|Scale|Movement
A|390|MEDIUM|DOWN
B|72|SMALL|DOWN
C|284|SMALL|UP
D|754|LARGE|DOWN
...
U|706|LARGE|DOWN
V|363|MEDIUM|DOWN
W|275|SMALL|UP
X|538|MEDIUM|DOWN
Y|974|LARGE|UP
Z|696|MEDIUM|DOWN

1.3 The pandas.core.groupby.SeriesGroupBy Object

Before you can start applying aggregate operations in pandas, you need to create a pandas.core.groupby.SeriesGroupBy object; here is how you can do this:

You specify the Group By column and the column to which you want to apply an aggregate operation; in our data set, we may want to create it like so:

groupValueByScale = df['Value'].groupby([df['Scale']])

Now you can apply a wide range of aggregate operations as function calls on the pandas.core.groupby.SeriesGroupBy object, for example:

groupValueByScale.mean()
groupValueByScale.max()
groupValueByScale.count()
groupValueByScale.sum()

The first two commands from above are functionally equivalent to these SQL statements:

SELECT Scale, AVG(Value) FROM DF GROUP BY Scale
SELECT Scale, MAX(Value) FROM DF GROUP BY Scale

With more recent pandas versions, you can also use a more compact syntax:

df.groupby().mean() #or simply like 
df.groupby('Scale').mean()

Your output would look something like that:

 Scale
LARGE     803.285714
MEDIUM    512.714286
SMALL     168.200000
Name: Value, dtype: float64

1.4 Grouping by Two or More Columns

You can designate two or more columns as the Group By columns.

With our data set, we can do something like that:

groupValueByScaleAndMovement = df.groupby(, df])

Note: You can also use a more compact syntax in more recent pandas versions:

 	df.groupby()

Now the groupValueByScaleAndMovement.mean() operation will produce this type of output:

 Scale   Movement
LARGE   DOWN        722.500000
        UP          911.000000
MEDIUM  DOWN        523.571429
        UP          501.857143
SMALL   DOWN         56.500000
        UP          242.666667
Name: Value, dtype: float64

1.5 Emulating SQL’s WHERE Clause

To emulate SQL’s WHERE clause, you need to use this syntax:

dataFrame <boolean operation> Some_Value], e.g.:

df > 333]

Notes:

If you issue this command instead:

df > 333

You will produce a logical vector:

0      True
1     False
2     False
3      True
4      True
5      True
6      True
7      True
8      True
...

1.6 The Pivot Tables

The pivot table is a structure that allows you to re-shape the original data frame so that categorical values from select columns become distinct columns in the pivot table. Pivot tables are very well suited for summary reports. For example, this command:

df.pivot(index = 'KPI',columns='Movement',values='Value')

will produce this output (notice that the DOWN and UP values from the Movement column have become columns in the report):

     Movement  DOWN     UP
KPI
A         390.0    NaN
B          72.0    NaN
C           NaN  284.0
D         754.0    NaN
E         718.0    NaN
F           NaN  493.0
G           NaN  483.0

Notes:

You can replace the NaN values in the above output with dashes (-) using this command:

dfPivo.fillna('-', inplace =True) making it a bit less noisy.

 

Here is a good illustration of how Pivot reshaping works (borrowed from DataCamp’s pandas cheat sheet):

1.7 Cross-Tabulation

The pandas’ crosstab() function computes a frequency table.

For example, this command:

pd.crosstab(df, df )

will produce a report counting how many movements (DOWN or UP) there were within each Scale category (LARGE, MEDIUM, or SMALL):

       Movement  DOWN  UP
Scale
LARGE             4   3
MEDIUM            7   7
SMALL             2   3

Notes:

The above report could have been generated using this Group By command; the report’s output layout will be a bit different:

df.groupby(, df]).count()
Scale   Movement
LARGE   DOWN        4
        UP          3
MEDIUM  DOWN        7
        UP          7
SMALL   DOWN        2
        UP          3
Name: Movement, dtype: int64

1.8 Summary

  • In this tutorial, we reviewed pandas’ data aggregation and grouping capabilities supported by:

  • The groupby function

  • The pivot tables

  • Cross-tabulation with the crosstab() function

Leave a Reply

Your email address will not be published. Required fields are marked *