Functional Programming in Python

This tutorial is adapted from the Web Age course Introduction to Python Programming.

1.1 What is Functional Programming?

Functional programming reduces problems to a set of function calls.

The functions used, referred to as Pure functions, follow these rules:

  • Only produce a result
  • Do not modify the parameters that were passed in
  • Do not produce any side effects
  • Always produce the same result when called with the same parameters

Another condition of functional programming is that all data be immutable.

1.2 Benefits of Functional Programming

Solutions implemented using functional programming have several advantages:

  • Pure functions are easier to test
  • Programs follow a predictable flow
  • Debugging is easier
  • Functional building blocks allow for high-level programming
  • Parallel execution is easier to implement

1.3 Functions as Data

In Python, functions can be assigned to and passed as variables. This allows for:

  • Passing functions as parameters to other functions
  • Returning functions as the result of a function

Python functions such as map(), filter() and sort() take functions as parameters. Functions that accept other functions as parameters or return functions as their result are referred to as higher-order functions.

1.4 Using Map Function

The map() function applies a specified function to each item in an iterable data type (list, string, dictionary, set).

map( transform_function, iterable )

The transform_function passed to map takes an item from the iterable as a parameter and returns data related to the item but modified in some way.

def toUpper(item):
  return   item.upper()

The following map function call converts all items to upper case characters:

states =['Arizona', 'Georgia','New York', 'Texas']
map_result = map( toUpper, states)
list_result = list(map_result)
print(list_result)
# prints:['ARIZONA', 'GEORGIA', 'NEW YORK', 'TEXAS']

The map() returns a map type object which is then converted to a list using the list() constructor function.

1.5 Using Filter Function

The filter() function checks each item in an iterable (list, string, dictionary, set) against a condition and outputs a new iterable that only includes items that pass the check

filter( check_function, iterable )

The check_function passed to filter takes an item from the iterable as a parameter and returns True|False depending on whether the item conforms to the given condition.

def lengthCheck(item, maxlen=7):
  return len(item) <= maxlen

The following map function call converts all itmes to upper case characters:

states =['Arizona', 'Oklahoma','Utah', 'Texas']
filter_result = filter( lengthCheck, states)
list_result = list(filter_result)
print(list_result)
# prints:['Arizona', 'Georgia', 'Texas']

The filter() returns a filter type object which is then converted to a list using the list() constructor function.

1.6 Lambda expressions

Lambda expressions in Python:

  • Are a special syntax for creating anonymous functions
  • Are limited to a single line of code
  • Return the result of their single code statement by default
  • Are typically defined in-line where they will be used
  • Some lambda expressions
lambda x : x * x     # multiply parameter by itself
lambda y : y * 2     # multiply parameter by 2 
lambda z : z['name'] # get value from dict with 
                     # key = 'name'

Normally a lambda expression is placed in code where it will be used:

list(map(lambda x:x*x, [1,2,3]))
# outputs [1, 4, 9]

Lambda expressions can be tested as shown here by supplying a parameter in parenthesis:

(lambda x : x * x)(5) # returns 25

1.7 List.sort() Using Lambda Expression

List.sort() takes two parameters:

list.sort( key, reverse )
    • key = function that should return a value to be sorted on
    • reverse = True/False to indicate source order

In the code below a lambda expression is used to provide the function for the ‘key’ parameter:

list1 = [{'name':'Jack', 'age':25},
{'name':'Cindy', 'age':17},
{'name':'Stuart', 'age':20}]

list1.sort(key=lambda item: item['name'],  
           reverse=False)
print(list1)
# outputs: [{'name': 'Cindy', 'age': 17}, 
# {'name': 'Jack', 'age': 25}, 
# {'name': 'Stuart', 'age': 20}]

The lambda expression returns the value of each object’s ‘name’ parameter to be sorted on

Notes

Python’s sorted() function is similar to list.sort() except that you need to pass in the list as the first parameter and that sorted() returns the sorted list which must then be assigned to a variable

list_sorted = sorted(list1, key, reverse)

1.8 Difference Between Simple Loops and map/filter Type Functions

Loops intended to modify an iterable’s items often does so my mutating the original iterable:

# loop over and mutate iterable
list1 = [ 'c', 'f', 'a', 'e', 'b' ]
for i in range(len(list1)):
  list1[i] = list1[i].upper()

The problems with this are:

  • The original list is no longer available as it has been mutated
  • This code would not work with immutable sequences like tuples
  • The equivalent code using map fixes both of these issues
# map creates new list based on original
list1 = [ 'c', 'f', 'a', 'e', 'b' ]
list2 = list(map(lambda x: x.upper(), list1))

Here is an example of the same map function being used with a tuple:

tup1 = ( 'c', 'f', 'a', 'e', 'b' )
tup2 = tuple(map(lambda x:x.upper(), tup1))

 

1.9 Additional Functions

Python includes many functions developers can use out of the box rather than have to hand-code them:

any(seq) # iterate boolean seq, returns true if at least one item in true

all(seq) # iterate boolean seq, returns true if all items are true

max(seq) # iterate sequence and calculate min value

min(seq) # iterate sequence and calculate min value

sum(list) # iterate sequence and calculate sum

len(seq) # return length of strings, # items in list, etc.

input() # get input from user

randint(0,10) # generate a random number between 0 and 10

Counter(seq) # create dictionary with freq of each seq member

Notes:

randint() is part of the random module which must be imported:
import random as r

r.randint(0,10)

Counter is part of the collections module and must be imported

import collections as c

c.Counter(‘asdffg’)

1.10 General Rules for Creating Functions

Since creating functions is a big part of functional programming we will re-iterate here some of the rules we learned earlier

  • Name functions appropriately
  • Limit the function to a single responsibility
  • Include a docstring
  • Always return a value
  • Limit functions to 50 lines or less
  • Make the function ‘idempotent’ and ‘pure’ if possible

1.11 Summary

In this tutorial, we covered:

  • What is Functional Programming
  • Functional Programming Benefits
  • Using Map Function
  • Using Filter Function
  • Lambda Expressions
  • List.sort() using Lambda
  • Loops vs. map/filter
  • Additional Functions
  • General Rules for Creating Functions

How to do Data Grouping and Aggregation with Pandas?

This tutorial is adapated from the Web Age course Data Engineering Bootcamp Training (Using Python and PySpark).

1.1 Data Aggregation and Grouping

The pandas module offers functions that closely emulate SQL functionality for data grouping, aggregation, and filtering. 

With pandas you can do the following tasks:

  • Compute group summary statistics (mean, std, count, etc.)
  • Compute pivot tables and cross-tabulation

Generally, these operations are referred to as the split-apply-combine cycle.

1.2 Sample Data Set

We will use the data set below to illustrate some of the common data aggregation operations in Python using pandas. The data set captures a point in time’s view of some fictitious key performance indicators (KPI) which are characterized by their values, scale, and movement.  Note that we have categorical values in the Scale and Movement columns that are very good candidates to perform “group by” operations on.

KPI|Value|Scale|Movement
A|390|MEDIUM|DOWN
B|72|SMALL|DOWN
C|284|SMALL|UP
D|754|LARGE|DOWN
...
U|706|LARGE|DOWN
V|363|MEDIUM|DOWN
W|275|SMALL|UP
X|538|MEDIUM|DOWN
Y|974|LARGE|UP
Z|696|MEDIUM|DOWN

1.3 The pandas.core.groupby.SeriesGroupBy Object

Before you can start applying aggregate operations in pandas, you need to create a pandas.core.groupby.SeriesGroupBy object; here is how you can do this:

You specify the Group By column and the column to which you want to apply an aggregate operation; in our data set, we may want to create it like so:

groupValueByScale = df['Value'].groupby([df['Scale']])

Now you can apply a wide range of aggregate operations as function calls on the pandas.core.groupby.SeriesGroupBy object, for example:

groupValueByScale.mean()
groupValueByScale.max()
groupValueByScale.count()
groupValueByScale.sum()

The first two commands from above are functionally equivalent to these SQL statements:

SELECT Scale, AVG(Value) FROM DF GROUP BY Scale
SELECT Scale, MAX(Value) FROM DF GROUP BY Scale

With more recent pandas versions, you can also use a more compact syntax:

df.groupby(['Scale'])['Value'].mean() #or simply like 
df.groupby('Scale')['Value'].mean()

Your output would look something like that:

 Scale
LARGE     803.285714
MEDIUM    512.714286
SMALL     168.200000
Name: Value, dtype: float64

1.4 Grouping by Two or More Columns

You can designate two or more columns as the Group By columns.

With our data set, we can do something like that:

groupValueByScaleAndMovement = df['Value'].groupby([df['Scale'], df['Movement']])

Note: You can also use a more compact syntax in more recent pandas versions:

 	df.groupby(['Scale', 'Movement'])['Value']

Now the groupValueByScaleAndMovement.mean() operation will produce this type of output:

 Scale   Movement
LARGE   DOWN        722.500000
        UP          911.000000
MEDIUM  DOWN        523.571429
        UP          501.857143
SMALL   DOWN         56.500000
        UP          242.666667
Name: Value, dtype: float64

1.5 Emulating SQL’s WHERE Clause

To emulate SQL’s WHERE clause, you need to use this syntax:

dataFrame[ dataFrame [‘Column Name’] <boolean operation> Some_Value], e.g.:

df[df['Value'] > 333]

Notes:

If you issue this command instead:

df['Value'] > 333

You will produce a logical vector:

0      True
1     False
2     False
3      True
4      True
5      True
6      True
7      True
8      True
...

1.6 The Pivot Tables

The pivot table is a structure that allows you to re-shape the original data frame so that categorical values from select columns become distinct columns in the pivot table. Pivot tables are very well suited for summary reports. For example, this command:

df.pivot(index = 'KPI',columns='Movement',values='Value')

will produce this output (notice that the DOWN and UP values from the Movement column have become columns in the report):

     Movement  DOWN     UP
KPI
A         390.0    NaN
B          72.0    NaN
C           NaN  284.0
D         754.0    NaN
E         718.0    NaN
F           NaN  493.0
G           NaN  483.0

Notes:

You can replace the NaN values in the above output with dashes (-) using this command:

dfPivo.fillna('-', inplace =True) making it a bit less noisy.

 

Here is a good illustration of how Pivot reshaping works (borrowed from DataCamp’s pandas cheat sheet):

1.7 Cross-Tabulation

The pandas’ crosstab() function computes a frequency table.

For example, this command:

pd.crosstab(df['Scale'], df['Movement'] )

will produce a report counting how many movements (DOWN or UP) there were within each Scale category (LARGE, MEDIUM, or SMALL):

       Movement  DOWN  UP
Scale
LARGE             4   3
MEDIUM            7   7
SMALL             2   3

Notes:

The above report could have been generated using this Group By command; the report’s output layout will be a bit different:

df['Movement'].groupby([df['Scale'], df['Movement']]).count()
Scale   Movement
LARGE   DOWN        4
        UP          3
MEDIUM  DOWN        7
        UP          7
SMALL   DOWN        2
        UP          3
Name: Movement, dtype: int64

1.8 Summary

  • In this tutorial, we reviewed pandas’ data aggregation and grouping capabilities supported by:

  • The groupby function

  • The pivot tables

  • Cross-tabulation with the crosstab() function