How to Repair and Normalize Data with Pandas?

This tutorial is adapted from Web Age course Data Engineering Bootcamp Training Using Python and PySpark.

When you embark on a new data engineering/data science/machine learning project, right off the bat you may be faced with defects in your input dataset, including but not limited to these issues: 

  • Missing values-Some data point may not have been captured at source or lost/corrupt in transit
  • Not normalized data -This data property may impact the accuracy of subsequent data science and machine learning algorithms downstream the processing pipeline
  • Data in columns is in inconsistent formats
  • The presence of obvious outliers

In this tutorial, we will review some of the common techniques supported by pandas for dealing with most of the above issues; we will also review two techniques used in data normalization (scaling). 

Part 1 – Set up the Environment

1. Start a new Jupyter working session by running this command:

jupyter notebook

Wait for the Jupyter notebook page to open in the browser.

2. In the New drop-down box in the top right-hand corner of the Jupyter page, select Python 3 to create a new notebook.

The new Untitled notebook should initialize.

3. Rename the notebook as Repairing and Normalizing Your Data Lab

Part 2 – Review and Download the Data Set

As usual, we will start with importing the required modules.

1. In the currently active input cell, enter the following commands one after another pressing Enter after each line and Shift+Enter to submit the command batch for execution:

import pandas as pd
import numpy as np

2. The input file is in a GitHub repository and you can view its contents by pointing your browser to http://bit.ly/36gJKAM

You should see the following output:

11,12,13,14
NaN,22,,24
,NULL,33,
41,42,43,n/a
nan,52,null,54
61,62,63,NA
N/A,72,73,74
81,82,83,84

Looks like the file is riddled with missing data.

The file is quite small but sufficient for our purposes to illustrate most of the possible missing data scenarios.

Note: pandas interprets the following strings as “Not Available” dat:

”, ‘#N/A’, ‘#N/A N/A’, ‘#NA’, ‘-1.#IND’, ‘-1.#QNAN’, ‘-NaN’, ‘-nan’,

‘1.#IND’, ‘1.#QNAN’, ‘N/A’, ‘NA’, ‘NULL’, ‘NaN’, ‘n/a’, ‘nan’,

‘null’.

You can use any of those to represent the missing data in your data sets.

3. Enter the following commands to download the dataset and create a DataFrame off of it:

input_file = 'http://bit.ly/36gJKAM'
dataSet  = pd.read_csv(input_file, header=None, 
                    names = ['C1', 'C2', 'C3', 'C4']); dataSet

You should see the following output:

	C1	C2	C3	C4
0	11.0	12.0	13.0	14.0
1	NaN	22.0	NaN	24.0
2	NaN	NaN	33.0	NaN
3	41.0	42.0	43.0	NaN
4	NaN	52.0	NaN	54.0
5	61.0	62.0	63.0	NaN
6	NaN	72.0	73.0	74.0
7	81.0	82.0	83.0	84.0

Get a brief summary of the DataFrame with the info() method.

You should see the following output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8 entries, 0 to 7
Data columns (total 4 columns):
C1    4 non-null float64
C2    7 non-null float64
C3    6 non-null float64
C4    5 non-null float64
dtypes: float64(4)
memory usage: 336.0 bytes

Notice that the system reports the number of non-null values per column.

Part 3 – Repairing Data

In our work below, we are going to plug in the holes represented by the NaN values.

We have a rather small data set and you can visually identify NaN values in it. With larger data sets you will need some way of getting an idea of how many NaN (null) values you have; you are usually interested in knowing about NaN values per column.

First, let’s get the count of the missing (NaN) values per column in our DataFrame.

1. Enter the following command:

dataSet.isnull().sum(axis = 0)

Note: axis = 0 instructs pandas to go through all the rows column-wise.

You should see the following output:

C1    4
C2    1
C3    2
C4    3
dtype: int64

Columns (which will eventually represent features or variables in data science/machine learning projects) with large counts of NaN values may be possible candidates for being dropped altogether.

For large data sets, you can use this command to get the ratio of NaN values per column:

dataSet.isnull().sum(axis = 0) / dataSet.shape[0]

Note: The dataSet.shape command will return a tuple that lists the number of rows and number of columns in the dataSet object and the dataSet.shape[0] yields the number of rows.

For our dataset, we have the following output for that ratio:

C1    0.500
C2    0.125
C3    0.250
C4    0.375
dtype: float64

2. Enter the following command to drop this column (yes, you were correct – it is the first one):

dataSet2 = dataSet.drop('C1', axis=1) 

Note that we do not drop the column in the original data set; rather, we create a derivative DataFrame, dataSet2.

3. Enter the following command:

dataSet2.shape

You should see the (8, 3) output indicating that the column was, indeed, dropped (the original shape is (8, 4)).

Now, let’s see what we have in terms of missing data row-wise.

4. Enter the following command:

dataSet2.isnull().sum(axis = 1)

You should see the following output:

0    0
1    1
2    2
3    1
4    1
5    1
6    0
7    0
dtype: int64

The third row has the most missing values – two.

Let’s assume that it is not acceptable for our 3-column dataset and we want to drop that row.

5. Enter the following command:

dataSet3 = dataSet2.drop(2, axis=0)

As a result of our data manipulation, we have the following data in dataSet3:

	C2	C3	C4
0	12.0	13.0	14.0
1	22.0	NaN	24.0
3	42.0	43.0	NaN
4	52.0	NaN	54.0
5	62.0	63.0	NaN
6	72.0	73.0	74.0
7	82.0	83.0	84.0

Note: In some situations, you may want to drop rows/columns that contain all or any missing values. In such situations where you can afford to drop data like that, you can use pandas’ dropna() method. Here is an example of dropping all rows (axis = 0 (index) is the default) that has any missing values:

dataSet3.dropna()

You should see the following severely decimated output:

C2 C3 C4

0 12.0 13.0 14.0

6 72.0 73.0 74.0

7 82.0 83.0 84.0

Now, let’s see what we can do to plug in values for the missing values without deleting columns and rows.

First, make a copy of dataSet3 as we are going to do updates in-place and it makes sense to keep the original copy around intact (subject to available memory, of course).

6. Enter the following command:

dataSet4 = dataSet3.copy() 

First, we will use the default ‘linear‘ interpolation strategy to plug-in NaN’s in the C3 column.

7. Enter the following command:

dataSet4.C3.interpolate (inplace=True)

The values in the column C3 now will be as follows:

0    13.0
1    28.0
3    43.0
4    53.0
5    63.0
6    73.0
7    83.0

8. Enter the following command:

dataSet4.C4.interpolate (method='nearest', inplace=True); dataSet4.C4
0    14.0
1    24.0
3    54.0
4    54.0
5    54.0
6    74.0
7    84.0
Name: C4, dtype: float64

Note: You can always get help on the interpolate() method using this IPython magic command:

dataSet4.interpolate?

Currently, the interpolate() method supports the following interpolation methods (looks like the developer(s) had a field day working on this feature):

{‘linear’, ‘time’, ‘index’, ‘values’, ‘nearest’, ‘zero’,
 ‘slinear’, ‘quadratic’, ‘cubic’, ‘barycentric’, ‘krogh’, ‘polynomial’,
  ‘spline’, ‘piecewise_polynomial’, ‘from_derivatives’, ‘pchip’, ‘akima’}

which should give you enough options for replacing the missing values (the most common strategy is the default linear, though, which also makes perfect sense in most practical situations).

Now, let’s demonstrate the usage of the fillna() method.

9. Enter the following commands:

dataSet5 = dataSet3.copy()
dataSet5.C3.fillna(dataSet5.C3.mean(), inplace=True)
dataSet5

In this data repairing strategy, we decided to use the mean (average) value for the C3 column.

You should see the following output:

	C2	C3	C4
0	12.0	13.0	14.0
1	22.0	55.0	24.0
3	42.0	43.0	NaN
4	52.0	55.0	54.0
5	62.0	63.0	NaN
6	72.0	73.0	74.0
7	82.0	83.0	84.0

The value 55.0 was the average value (the mean) of the existing (not NaN) column values in the original C3 column.

Now, let’s apply the fillna() method across the rows.

First, we need to re-index our dataset to avoid confusion.

10. Enter the following command:

dataSet5.index = range(dataSet5.shape[0])

11. Run the following commands to understand the math at hand and prepare yourself for understanding of what will transpire.

print (dataSet5.iloc[2])
print ('...................')
print (dataSet5.iloc[2].mean())

You should see the following output:

C2    42.0
C3    43.0
C4     NaN
Name: 2, dtype: float64
...................
42.5

12. Enter the following command:

dataSet5.iloc[2].fillna(dataSet5.iloc[2].mean(), inplace=True)

dataSet5 now has this updated third row:

	C2	C3	C4
0	12.0	13.0	14.0
1	22.0	55.0	24.0
2	42.0	43.0	42.5
3	52.0	55.0	54.0
4	62.0	63.0	NaN
5	72.0	73.0	74.0
6	82.0	83.0	84.0  

We now have only one NaN value left (in column C4).

Use the knowledge you have acquired on the matter to fill that value with an average value in C4.

At the end of our data munging exercise, our DataFrame looks like that:

         C2	  C3	  C4
0	12.0	13.0	14.00
1	22.0	55.0	24.00
2	42.0	43.0	42.50
3	52.0	55.0	54.00
4	62.0	63.0	48.75
5	72.0	73.0	74.00
6	82.0	83.0	84.00

Fine print: When dealing with missing values in your input data sets, you need to be cognizant of the problem domain’s specifics in order to come up with a suitable substitute strategy.

Part 4 – Applying the map Function (for Review)

Sometimes you would like to apply a function to all the values in a column. One option here is to use the map function that would use the column’s values as input.

Here is an example. Let’s say you would like to create an executive summary of the values in column C2 which, instead of showing numbers, would, rather, show some labels that would represent data ranges, e.g. weak, medium, and strong (e.g. describing a tornado strength, a stock behavior, etc.). This technique can also be useful in Data Science experiments where you may want to use categorical (class) variables.

df_map = dataSet5.copy()
df_map.C2  = df_map.C2.map(
	lambda c: 'strong' if c > 60 else 'medium' if c > 40 else 'weak') 

The above operation is similar to SQL’s CASE-WHEN statement.

df_map would contain the following data:

         C2	C3	C4
0	weak	13.0	14.00
1	weak	55.0	24.00
2	medium	43.0	42.50
3	medium	55.0	54.00
4	strong	63.0	48.75
5	strong	73.0	74.00
6	strong	83.0	84.00

Now, if you want to convert the above string labels into numbers for ease of processing, you can use the sklearn.preprocessing.LabelEncoder object, which you can use as follows:

from sklearn.preprocessing import LabelEncoder
df_lbl_enc = df_map.copy()
df_lbl_enc.C2 = LabelEncoder().fit_transform ( df_lbl_enc.C2 )

Now df_lbl_enc has the following data in it:

	C2	C3	C4
0	2	13.0	14.00
1	2	55.0	24.00
2	0	43.0	42.50
3	0	55.0	54.00
4	1	63.0	48.75
5	1	73.0	74.00
6	1	83.0	84.00

You may want to change the way the mapping is performed, e.g. you want the ‘weak’ label to be represented by 0, rather than non-intuitive 2. In other words, we want to enforce our own ordinal sequence. For that you need to do the following:

map_dic = {'weak': 0, 'medium': 1, 'strong':2}
# This will created the "mapping"
# Use the above mapping in the map function:
df_map.copy()['C2'].map(map_dic)

You will get this pandas.core.series.Series:

0    0
1    0
2    1
3    1
4    2
5    2
6    2   

Part 5 – How to Normalize (Scale) Your Data

The range of values in your input data sets may vary widely which may negatively affect the accuracy and validity of some machine learning algorithms that depend on calculating the distance between data points, e.g. using the Euclidean distance.

Data normalization (a.k.a. feature scaling) is a method for standardizing the range of independent variables or features contained in your input data sets. This data preprocessing step may considerably improve the quality of your subsequent work with data science and machine learning algorithms.

In this part, we will demonstrate two techniques used in data normalization.

The first technique is called z-score normalization which is about centering data sets (column-wise) around the mean of zero and re-scaling data values in columns such that their standard deviation equals 1; such re-scaling makes most of your data fall in the range of [-3, +3]

The second technique we are going to demonstrate is the min-max transformation which re-scales your data to values in the range of [0,1], where 0 corresponds to the minimum value in the value range, and 1 corresponds to the maximum value.

Both techniques are supported by the sklearn.preprocessing module which also includes methods for data scaling, centering, normalization, binarization, and imputation.

Part 6 – DataFrame Scaling and Normalization

Before continuing make sure you ran the solution mentioned at the end of part 3.

Let’s apply the z-score normalization transformation to our dataSet5 DataFrame.

1. Import the needed module:

from sklearn import preprocessing

2. Enter the following command in a new code cell:

dataSet5_scaled = preprocessing.scale(dataSet5); dataSet5_scaled

You should get the following output:

array([[-1.56501609, -2.00227144, -1.49800487],
       [-1.1436656 ,  0.        , -1.06692433],
       [-0.30096463, -0.57207755, -0.26942534],
       [ 0.12038585,  0.        ,  0.22631728],
       [ 0.54173634,  0.38138504,  0.        ],
       [ 0.96308682,  0.85811633,  1.08847836],
       [ 1.38443731,  1.33484762,  1.51955889]])

dataSet5_scaled is of type numpy.ndarray, and should you wish to return back to the DataFrame world, you need to reconstruct it like so:

df = pd.DataFrame (dataSet5_scaled, columns=['C1', 'C2', 'C3'])

With this reconstructed DataFrame you can do the following operations:

df.std()
df.mean()

Which would report the near zero mean and unit variance as promised by preprocessing.scale().

df.describe() will report the following descriptive statistics:

                C1	   C2	          C3
count	7.000000e+00	7.000000e+00	7.000000e+00
mean	-2.220446e-16	6.344132e-17	-9.516197e-17
std	1.080123e+00	1.080123e+00	1.080123e+00
min	-1.565016e+00	-1.341641e+00	-1.517158e+00
25%	-7.223151e-01	-7.546729e-01	-7.883275e-01
50%	1.203859e-01	-1.677051e-01	1.487410e-01
75%	7.524116e-01	8.105746e-01	7.734534e-01
max	1.384437e+00	1.397542e+00	1.398166e+00

Shrug off the count 7.000000e+00 nuisance even though it might be philosophically disturbing for some.

Note: The scale() method treats NaNs, if any are present, as missing values and those are disregarded during computation retaining the original NaN values.

Applying the MinMaxScaler is a bit different, and here is the command:

preprocessing.MinMaxScaler().fit_transform(dataSet5)

You should see the following output:

array([[0.        , 0.        , 0.        ],
       [0.14285714, 0.6       , 0.14285714],
       [0.42857143, 0.42857143, 0.40714286],
       [0.57142857, 0.6       , 0.57142857],
       [0.71428571, 0.71428571, 0.49642857],
       [0.85714286, 0.85714286, 0.85714286],
       [1.        , 1.        , 1.        ]])

To learn about the MinMaxScaler object, and its fit, transform, and fit_transform methods, visit https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html

Part 7 – Scaling and Normalization of NumPy Arrays (Optional)

1. Enter the following commands:

np.random.seed(1234)
toScaleDS = np.ceil(100 * np.random.rand (10,3))
toScaleDS

Note: We use the

np.random.seed(1234)

seeding command in order to make the results of generating random numbers reproducible.

You should get the following results:

array([[ 20.,  63.,  44.],
       [ 79.,  78.,  28.],
       [ 28.,  81.,  96.],
       [ 88.,  36.,  51.],
       [ 69.,  72.,  38.],
       [ 57.,  51.,   2.],
       [ 78.,  89.,  37.],
       [ 62.,   8.,  37.],
       [ 94.,  66.,  40.],
       [ 79.,  32.,  57.]])

You can access this NumPy array to get its descriptive stats as follows:

  • Get the average of the values in the second column:
toScaleDS[:,1].mean()
  • Get the standard deviation across the third row:
toScaleDS[2,:].std()

We will use the NumPy array toScaleDS created earlier in the tutorial for illustration of the techniques.

2. Enter the following command to apply z-score normalization transformation:

from sklearn import preprocessing
scaledDS = preprocessing.scale(toScaleDS)
scaledDS

You should see the following output:

array([[-1.95146565,  0.22264681,  0.04435909],
       [ 0.58458002,  0.84111018, -0.66538641],
       [-1.60759505,  0.96480286,  2.35103199],
       [ 0.97143444, -0.89058725,  0.35487275],
       [ 0.15474177,  0.59372484, -0.22179547],
       [-0.36106413, -0.27212388, -1.81872286],
       [ 0.54159619,  1.29464999, -0.26615457],
       [-0.146145  , -2.04505221, -0.26615457],
       [ 1.22933739,  0.34633949, -0.13307728],
       [ 0.58458002, -1.05551082,  0.62102732]])

The scaledDS structure is a NumPy ndarray as well.

The original (unscaled) version is shown below.

array([[ 20.,  63.,  44.],
       [ 79.,  78.,  28.],
       [ 28.,  81.,  96.],
       [ 88.,  36.,  51.],
       [ 69.,  72.,  38.],
       [ 57.,  51.,   2.],
       [ 78.,  89.,  37.],
       [ 62.,   8.,  37.],
       [ 94.,  66.,  40.],
       [ 79.,  32.,  57.]])

The scale() method can standardize the input data along any axis – column-wise (features-wise) scaling (axis = 0) is the default; use axis = 1 for scaling across each row.

Let’s see how we can get.

3. Enter the following command:

np.mean(scaledDS, axis = 0)

You should see the following output (we have three columns in our array, so we get three values for the average values – one per each column) which are near zero (with negligible deviations).

array([-2.22044605e-16, -6.66133815e-17,  1.11022302e-17])

4. Enter the following command:

np.std(scaledDS, axis = 0)

You should see the following output:

array([ 1.,  1.,  1.])

As you can see, this is the unit variance.

Now let’s apply the min-max transformation to the original toScaleDS data set.

5. Enter the following command:

mmScaler = preprocessing.MinMaxScaler()
mmScaler.fit_transform(toScaleDS)

You should see the following output:

array([[ 0.        ,  0.67901235,  0.44680851],
       [ 0.7972973 ,  0.86419753,  0.27659574],
       [ 0.10810811,  0.90123457,  1.        ],
       [ 0.91891892,  0.34567901,  0.5212766 ],
       [ 0.66216216,  0.79012346,  0.38297872],
       [ 0.5       ,  0.5308642 ,  0.        ],
       [ 0.78378378,  1.        ,  0.37234043],
       [ 0.56756757,  0.        ,  0.37234043],
       [ 1.        ,  0.71604938,  0.40425532],
       [ 0.7972973 ,  0.2962963 ,  0.58510638]])

To help you compare the produced result with the original data set, below is the toScaleDS’s contents:

array([[ 20.,  63.,  44.],
       [ 79.,  78.,  28.],
       [ 28.,  81.,  96.],
       [ 88.,  36.,  51.],
       [ 69.,  72.,  38.],
       [ 57.,  51.,   2.],
       [ 78.,  89.,  37.],
       [ 62.,   8.,  37.],
       [ 94.,  66.,  40.],
       [ 79.,  32.,  57.]])

Part 8 – Clean Up

1. Click the Save button in the toolbar.

In the menu bar, select File > Close and Halt

The command will shut down the notebook’s kernel (the sandboxed Python session) and close the interactive edit session.

Keep the browser window open as we are going to use it later.

This is the last step in this tutorial.

Part 9 – Solutions

S1
dataSet5.C4.fillna(dataSet5.C4.mean(), inplace=True); dataSet5

Part 10 – Review

In this tutorial, you learned about the ways Python can handle missing values in your input data sets as well as ways to normalize (scale) your data to make the transformed data more conducive for further data processing down your data science workflow pipeline.

 

Leave a Reply

Your email address will not be published. Required fields are marked *