This tutorial is adapated from the Web Age course Data Engineering Bootcamp Training (Using Python and PySpark).
1.1 Data Aggregation and Grouping
The pandas module offers functions that closely emulate SQL functionality for data grouping, aggregation, and filtering.
With pandas you can do the following tasks:
- Compute group summary statistics (mean, std, count, etc.)
- Compute pivot tables and cross-tabulation
Generally, these operations are referred to as the split-apply-combine cycle.
1.2 Sample Data Set
We will use the data set below to illustrate some of the common data aggregation operations in Python using pandas. The data set captures a point in time’s view of some fictitious key performance indicators (KPI) which are characterized by their values, scale, and movement. Note that we have categorical values in the Scale and Movement columns that are very good candidates to perform “group by” operations on.
KPI|Value|Scale|Movement A|390|MEDIUM|DOWN B|72|SMALL|DOWN C|284|SMALL|UP D|754|LARGE|DOWN ... U|706|LARGE|DOWN V|363|MEDIUM|DOWN W|275|SMALL|UP X|538|MEDIUM|DOWN Y|974|LARGE|UP Z|696|MEDIUM|DOWN
1.3 The pandas.core.groupby.SeriesGroupBy Object
Before you can start applying aggregate operations in pandas, you need to create a pandas.core.groupby.SeriesGroupBy object; here is how you can do this:
You specify the Group By column and the column to which you want to apply an aggregate operation; in our data set, we may want to create it like so:
groupValueByScale = df['Value'].groupby([df['Scale']])
Now you can apply a wide range of aggregate operations as function calls on the pandas.core.groupby.SeriesGroupBy object, for example:
groupValueByScale.mean() groupValueByScale.max() groupValueByScale.count() groupValueByScale.sum()
The first two commands from above are functionally equivalent to these SQL statements:
SELECT Scale, AVG(Value) FROM DF GROUP BY Scale SELECT Scale, MAX(Value) FROM DF GROUP BY Scale …
With more recent pandas versions, you can also use a more compact syntax:
df.groupby().mean() #or simply like df.groupby('Scale').mean()
Your output would look something like that:
Scale LARGE 803.285714 MEDIUM 512.714286 SMALL 168.200000 Name: Value, dtype: float64
1.4 Grouping by Two or More Columns
You can designate two or more columns as the Group By columns.
With our data set, we can do something like that:
groupValueByScaleAndMovement = df.groupby(, df])
Note: You can also use a more compact syntax in more recent pandas versions:
Now the groupValueByScaleAndMovement.mean() operation will produce this type of output:
Scale Movement LARGE DOWN 722.500000 UP 911.000000 MEDIUM DOWN 523.571429 UP 501.857143 SMALL DOWN 56.500000 UP 242.666667 Name: Value, dtype: float64
1.5 Emulating SQL’s WHERE Clause
To emulate SQL’s WHERE clause, you need to use this syntax:
dataFrame <boolean operation> Some_Value], e.g.:
df > 333]
If you issue this command instead:
df > 333
You will produce a logical vector:
0 True 1 False 2 False 3 True 4 True 5 True 6 True 7 True 8 True ...
1.6 The Pivot Tables
The pivot table is a structure that allows you to re-shape the original data frame so that categorical values from select columns become distinct columns in the pivot table. Pivot tables are very well suited for summary reports. For example, this command:
df.pivot(index = 'KPI',columns='Movement',values='Value')
will produce this output (notice that the DOWN and UP values from the Movement column have become columns in the report):
Movement DOWN UP KPI A 390.0 NaN B 72.0 NaN C NaN 284.0 D 754.0 NaN E 718.0 NaN F NaN 493.0 G NaN 483.0
You can replace the NaN values in the above output with dashes (-) using this command:
dfPivo.fillna('-', inplace =True) making it a bit less noisy.
Here is a good illustration of how Pivot reshaping works (borrowed from DataCamp’s pandas cheat sheet):
The pandas’ crosstab() function computes a frequency table.
For example, this command:
pd.crosstab(df, df )
will produce a report counting how many movements (DOWN or UP) there were within each Scale category (LARGE, MEDIUM, or SMALL):
Movement DOWN UP Scale LARGE 4 3 MEDIUM 7 7 SMALL 2 3
The above report could have been generated using this Group By command; the report’s output layout will be a bit different:
df.groupby(, df]).count() Scale Movement LARGE DOWN 4 UP 3 MEDIUM DOWN 7 UP 7 SMALL DOWN 2 UP 3 Name: Movement, dtype: int64
In this tutorial, we reviewed pandas’ data aggregation and grouping capabilities supported by:
The groupby function
The pivot tables
Cross-tabulation with the crosstab() function