Posted on April 5, 2021April 9, 2021How to Repair and Normalize Data with Pandas? This tutorial is adapted from Web Age course Data Engineering Bootcamp Training Using Python and PySpark. When you embark on a new data engineering/data science/machine learning project, right off the bat you may be faced with defects in your input dataset, including but not limited to these issues: Continue reading “How to Repair and Normalize Data with Pandas?”
Posted on November 20, 2020November 23, 2020Functional Programming in Python This tutorial is adapted from the Web Age course Introduction to Python Programming. 1.1 What is Functional Programming? Functional programming reduces problems to a set of function calls. The functions used, referred to as Pure functions, follow these rules: Only produce a result Do not modify the parameters that were passed in Do not produce any side effects Always produce the same result when called with the same parameters Another condition of functional programming is that all data be immutable. 1.2 Benefits of Functional Programming Solutions implemented using functional programming have several advantages: Pure functions are easier to test Programs follow a predictable flow Debugging is easier Functional building blocks allow for high-level programming Parallel execution is easier to implement 1.3 Functions as Data In Python, functions can be assigned to and passed as variables. This allows for: Passing functions as parameters to other functions Returning functions as the result of a function Python functions such as map(), filter() and sort() take functions as parameters. Functions that accept other functions as parameters or return functions as their result are referred to as higher-order functions. 1.4 Using Map Function The map() function applies a specified function to each item in an iterable data type (list, string, dictionary, set). map( transform_function, iterable ) The transform_function passed to map takes an item from the iterable as a parameter and returns data related to the item but modified in some way. def toUpper(item): return item.upper() The following map function call converts all items to upper case characters: states =['Arizona', 'Georgia','New York', 'Texas'] map_result = map( toUpper, states) list_result = list(map_result) print(list_result) # prints:['ARIZONA', 'GEORGIA', 'NEW YORK', 'TEXAS'] The map() returns a map type object which is then converted to a list using the list() constructor function. 1.5 Using Filter Function The filter() function checks each item in an iterable (list, string, dictionary, set) against a condition and outputs a new iterable that only includes items that pass the check filter( check_function, iterable ) The check_function passed to filter takes an item from the iterable as a parameter and returns True|False depending on whether the item conforms to the given condition. def lengthCheck(item, maxlen=7): return len(item) <= maxlen The following map function call converts all itmes to upper case characters: states =['Arizona', 'Oklahoma','Utah', 'Texas'] filter_result = filter( lengthCheck, states) list_result = list(filter_result) print(list_result) # prints:['Arizona', 'Georgia', 'Texas'] The filter() returns a filter type object which is then converted to a list using the list() constructor function. 1.6 Lambda expressions Lambda expressions in Python: Are a special syntax for creating anonymous functions Are limited to a single line of code Return the result of their single code statement by default Are typically defined in-line where they will be used Some lambda expressions lambda x : x * x # multiply parameter by itself lambda y : y * 2 # multiply parameter by 2 lambda z : z['name'] # get value from dict with # key = 'name' Normally a lambda expression is placed in code where it will be used: list(map(lambda x:x*x, [1,2,3])) # outputs [1, 4, 9] Lambda expressions can be tested as shown here by supplying a parameter in parenthesis: (lambda x : x * x)(5) # returns 25 1.7 List.sort() Using Lambda Expression List.sort() takes two parameters: list.sort( key, reverse ) key = function that should return a value to be sorted on reverse = True/False to indicate source order In the code below a lambda expression is used to provide the function for the ‘key’ parameter: list1 = [{'name':'Jack', 'age':25}, {'name':'Cindy', 'age':17}, {'name':'Stuart', 'age':20}] list1.sort(key=lambda item: item['name'], reverse=False) print(list1) # outputs: [{'name': 'Cindy', 'age': 17}, # {'name': 'Jack', 'age': 25}, # {'name': 'Stuart', 'age': 20}] The lambda expression returns the value of each object’s ‘name’ parameter to be sorted on Notes Python’s sorted() function is similar to list.sort() except that you need to pass in the list as the first parameter and that sorted() returns the sorted list which must then be assigned to a variable list_sorted = sorted(list1, key, reverse) 1.8 Difference Between Simple Loops and map/filter Type Functions Loops intended to modify an iterable’s items often does so my mutating the original iterable: # loop over and mutate iterable list1 = [ 'c', 'f', 'a', 'e', 'b' ] for i in range(len(list1)): list1[i] = list1[i].upper() The problems with this are: The original list is no longer available as it has been mutated This code would not work with immutable sequences like tuples The equivalent code using map fixes both of these issues # map creates new list based on original list1 = [ 'c', 'f', 'a', 'e', 'b' ] list2 = list(map(lambda x: x.upper(), list1)) Here is an example of the same map function being used with a tuple: tup1 = ( 'c', 'f', 'a', 'e', 'b' ) tup2 = tuple(map(lambda x:x.upper(), tup1)) 1.9 Additional Functions Python includes many functions developers can use out of the box rather than have to hand-code them: any(seq) # iterate boolean seq, returns true if at least one item in true all(seq) # iterate boolean seq, returns true if all items are true max(seq) # iterate sequence and calculate min value min(seq) # iterate sequence and calculate min value sum(list) # iterate sequence and calculate sum len(seq) # return length of strings, # items in list, etc. input() # get input from user randint(0,10) # generate a random number between 0 and 10 Counter(seq) # create dictionary with freq of each seq member Notes: randint() is part of the random module which must be imported: import random as r r.randint(0,10) Counter is part of the collections module and must be imported import collections as c c.Counter(‘asdffg’) 1.10 General Rules for Creating Functions Since creating functions is a big part of functional programming we will re-iterate here some of the rules we learned earlier Name functions appropriately Limit the function to a single responsibility Include a docstring Always return a value Limit functions to 50 lines or less Make the function ‘idempotent’ and ‘pure’ if possible 1.11 Summary In this tutorial, we covered: What is Functional Programming Functional Programming Benefits Using Map Function Using Filter Function Lambda Expressions List.sort() using Lambda Loops vs. map/filter Additional Functions General Rules for Creating Functions
Posted on October 30, 2020October 30, 2020How to do Data Grouping and Aggregation with Pandas? This tutorial is adapated from the Web Age course Data Engineering Bootcamp Training (Using Python and PySpark). 1.1 Data Aggregation and Grouping The pandas module offers functions that closely emulate SQL functionality for data grouping, aggregation, and filtering. With pandas you can do the following tasks: Compute group summary statistics (mean, std, count, etc.) Compute pivot tables and cross-tabulation Generally, these operations are referred to as the split-apply-combine cycle. 1.2 Sample Data Set We will use the data set below to illustrate some of the common data aggregation operations in Python using pandas. The data set captures a point in time’s view of some fictitious key performance indicators (KPI) which are characterized by their values, scale, and movement. Note that we have categorical values in the Scale and Movement columns that are very good candidates to perform “group by” operations on. KPI|Value|Scale|Movement A|390|MEDIUM|DOWN B|72|SMALL|DOWN C|284|SMALL|UP D|754|LARGE|DOWN ... U|706|LARGE|DOWN V|363|MEDIUM|DOWN W|275|SMALL|UP X|538|MEDIUM|DOWN Y|974|LARGE|UP Z|696|MEDIUM|DOWN 1.3 The pandas.core.groupby.SeriesGroupBy Object Before you can start applying aggregate operations in pandas, you need to create a pandas.core.groupby.SeriesGroupBy object; here is how you can do this: You specify the Group By column and the column to which you want to apply an aggregate operation; in our data set, we may want to create it like so: groupValueByScale = df['Value'].groupby([df['Scale']]) Now you can apply a wide range of aggregate operations as function calls on the pandas.core.groupby.SeriesGroupBy object, for example: groupValueByScale.mean() groupValueByScale.max() groupValueByScale.count() groupValueByScale.sum() The first two commands from above are functionally equivalent to these SQL statements: SELECT Scale, AVG(Value) FROM DF GROUP BY Scale SELECT Scale, MAX(Value) FROM DF GROUP BY Scale … With more recent pandas versions, you can also use a more compact syntax: df.groupby(['Scale'])['Value'].mean() #or simply like df.groupby('Scale')['Value'].mean() Your output would look something like that: Scale LARGE 803.285714 MEDIUM 512.714286 SMALL 168.200000 Name: Value, dtype: float64 1.4 Grouping by Two or More Columns You can designate two or more columns as the Group By columns. With our data set, we can do something like that: groupValueByScaleAndMovement = df['Value'].groupby([df['Scale'], df['Movement']]) Note: You can also use a more compact syntax in more recent pandas versions: df.groupby(['Scale', 'Movement'])['Value'] Now the groupValueByScaleAndMovement.mean() operation will produce this type of output: Scale Movement LARGE DOWN 722.500000 UP 911.000000 MEDIUM DOWN 523.571429 UP 501.857143 SMALL DOWN 56.500000 UP 242.666667 Name: Value, dtype: float64 1.5 Emulating SQL’s WHERE Clause To emulate SQL’s WHERE clause, you need to use this syntax: dataFrame[ dataFrame [‘Column Name’] <boolean operation> Some_Value], e.g.: df[df['Value'] > 333] Notes: If you issue this command instead: df['Value'] > 333 You will produce a logical vector: 0 True 1 False 2 False 3 True 4 True 5 True 6 True 7 True 8 True ... 1.6 The Pivot Tables The pivot table is a structure that allows you to re-shape the original data frame so that categorical values from select columns become distinct columns in the pivot table. Pivot tables are very well suited for summary reports. For example, this command: df.pivot(index = 'KPI',columns='Movement',values='Value') will produce this output (notice that the DOWN and UP values from the Movement column have become columns in the report): Movement DOWN UP KPI A 390.0 NaN B 72.0 NaN C NaN 284.0 D 754.0 NaN E 718.0 NaN F NaN 493.0 G NaN 483.0 Notes: You can replace the NaN values in the above output with dashes (-) using this command: dfPivo.fillna('-', inplace =True) making it a bit less noisy. Here is a good illustration of how Pivot reshaping works (borrowed from DataCamp’s pandas cheat sheet): 1.7 Cross-Tabulation The pandas’ crosstab() function computes a frequency table. For example, this command: pd.crosstab(df['Scale'], df['Movement'] ) will produce a report counting how many movements (DOWN or UP) there were within each Scale category (LARGE, MEDIUM, or SMALL): Movement DOWN UP Scale LARGE 4 3 MEDIUM 7 7 SMALL 2 3 Notes: The above report could have been generated using this Group By command; the report’s output layout will be a bit different: df['Movement'].groupby([df['Scale'], df['Movement']]).count() Scale Movement LARGE DOWN 4 UP 3 MEDIUM DOWN 7 UP 7 SMALL DOWN 2 UP 3 Name: Movement, dtype: int64 1.8 Summary In this tutorial, we reviewed pandas’ data aggregation and grouping capabilities supported by: The groupby function The pivot tables Cross-tabulation with the crosstab() function