This tutorial is adapted from Web Age course Data Engineering Bootcamp Training Using Python and PySpark.

When you embark on a new data engineering/data science/machine learning project, right off the bat you may be faced with defects in your input dataset, including but not limited to these issues:

- Missing values-Some data point may not have been captured at source or lost/corrupt in transit
- Not normalized data -This data property may impact the accuracy of subsequent data science and machine learning algorithms downstream the processing pipeline
- Data in columns is in inconsistent formats
- The presence of obvious outliers

In this tutorial, we will review some of the common techniques supported by **pandas** for dealing with most of the above issues; we will also review two techniques used in data normalization (scaling).

## Part 1 – Set up the Environment

1. Start a new Jupyter working session by running this command:

jupyter notebook

Wait for the Jupyter notebook page to open in the browser.

2. In the **New **drop-down box in the top right-hand corner of the Jupyter page, select **Python 3 **to create a new notebook.

The new *Untitled* notebook should initialize.

3. Rename the notebook as **Repairing and Normalizing Your Data Lab **

## Part 2 – Review and Download the Data Set

As usual, we will start with importing the required modules.

1. In the currently active input cell, enter the following commands one after another pressing **Enter **after each line and **Shift+Enter **to submit the command batch for execution:

import pandas as pdimport numpy as np

2. The input file is in a GitHub repository and you can view its contents by pointing your browser to http://bit.ly/36gJKAM

You should see the following output:

11,12,13,14 NaN,22,,24 ,NULL,33, 41,42,43,n/a nan,52,null,54 61,62,63,NA N/A,72,73,74 81,82,83,84

Looks like the file is riddled with missing data.

The file is quite small but sufficient for our purposes to illustrate most of the possible missing data scenarios.

**Note: **pandas interprets the following strings as “Not Available” dat:

”, ‘#N/A’, ‘#N/A N/A’, ‘#NA’, ‘-1.#IND’, ‘-1.#QNAN’, ‘-NaN’, ‘-nan’,

‘1.#IND’, ‘1.#QNAN’, ‘N/A’, ‘NA’, ‘NULL’, ‘NaN’, ‘n/a’, ‘nan’,

‘null’.

You can use any of those to represent the missing data in your data sets.

3. Enter the following commands to download the dataset and create a DataFrame off of it:

input_file = 'http://bit.ly/36gJKAM'dataSet = pd.read_csv(input_file, header=None,names = ['C1', 'C2', 'C3', 'C4']); dataSet

You should see the following output:

C1 C2 C3 C40 11.0 12.0 13.0 14.0 1 NaN 22.0 NaN 24.0 2 NaN NaN 33.0 NaN 3 41.0 42.0 43.0 NaN 4 NaN 52.0 NaN 54.0 5 61.0 62.0 63.0 NaN 6 NaN 72.0 73.0 74.0 7 81.0 82.0 83.0 84.0

Get a brief summary of the DataFrame with the *info() *method.

You should see the following output:

<class 'pandas.core.frame.DataFrame'> RangeIndex: 8 entries, 0 to 7 Data columns (total 4 columns): C1 4 non-null float64 C2 7 non-null float64 C3 6 non-null float64 C4 5 non-null float64 dtypes: float64(4) memory usage: 336.0 bytes

Notice that the system reports the number of non-null values per column.

## Part 3 – Repairing Data

In our work below, we are going to plug in the holes represented by the *NaN *values.

We have a rather small data set and you can visually identify NaN values in it. With larger data sets you will need some way of getting an idea of how many NaN (null) values you have; you are usually interested in knowing about NaN values per column.

First, let’s get the count of the missing (NaN) values per column in our DataFrame.

1. Enter the following command:

dataSet.isnull().sum(axis = 0)

**Note**: axis = 0 instructs pandas to go through all the rows column-wise.

You should see the following output:

C1 4 C2 1 C3 2 C4 3 dtype: int64

Columns (which will eventually represent features or variables in data science/machine learning projects) with large counts of NaN values may be possible candidates for being dropped altogether.

For large data sets, you can use this command to get the ratio of NaN values per column:

dataSet.isnull().sum(axis = 0) / dataSet.shape[0]

**Note**: The *dataSet.shape *command will return a tuple that lists the number of rows and number of columns in the *dataSet* object and the *dataSet.shape[0] *yields the number of rows.

For our dataset, we have the following output for that ratio:

C1 0.500 C2 0.125 C3 0.250 C4 0.375 dtype: float64

2. Enter the following command to drop this column (yes, you were correct – it is the first one):

dataSet2 = dataSet.drop('C1', axis=1)

Note that we do not drop the column in the original data set; rather, we create a derivative DataFrame, *dataSet2*.

3. Enter the following command:

dataSet2.shape

You should see the (8, 3) output indicating that the column was, indeed, dropped (the original shape is (8, 4)).

Now, let’s see what we have in terms of missing data row-wise.

4. Enter the following command:

dataSet2.isnull().sum(axis = 1)

You should see the following output:

0 0 1 12 23 1 4 1 5 1 6 0 7 0 dtype: int64

The third row has the most missing values – two.

Let’s assume that it is not acceptable for our 3-column dataset and we want to drop that row.

5. Enter the following command:

dataSet3 = dataSet2.drop(2, axis=0)

As a result of our data manipulation, we have the following data in *dataSet3*:

C2 C3 C4 0 12.0 13.0 14.0 1 22.0 NaN 24.0 3 42.0 43.0 NaN 4 52.0 NaN 54.0 5 62.0 63.0 NaN 6 72.0 73.0 74.0 7 82.0 83.0 84.0

**Note**: In some situations, you may want to drop rows/columns that contain all or any missing values. In such situations where you can afford to drop data like that, you can use pandas’ *dropna() *method. Here is an example of dropping all rows (axis = 0 (index) is the default) that has any missing values:

**dataSet3.dropna()**

You should see the following severely decimated output:

C2 C3 C4

0 12.0 13.0 14.0

6 72.0 73.0 74.0

7 82.0 83.0 84.0

Now, let’s see what we can do to plug in values for the missing values without deleting columns and rows.

First, make a copy of *dataSet3* as we are going to do updates in-place and it makes sense to keep the original copy around intact (subject to available memory, of course).

6. Enter the following command:

dataSet4 = dataSet3.copy()

First, we will use the default ‘*linear*‘ interpolation strategy to plug-in NaN’s in the C3 column.

7. Enter the following command:

dataSet4.C3.interpolate (inplace=True)

The values in the column C3 now will be as follows:

0 13.0 128.03 43.0 453.05 63.0 6 73.0 7 83.0

8. Enter the following command:

dataSet4.C4.interpolate (method='nearest', inplace=True); dataSet4.C4

0 14.0 1 24.03 54.04 54.05 54.06 74.0 7 84.0 Name: C4, dtype: float64

**Note**: You can always get help on the *interpolate() *method using this IPython magic command:

*dataSet4.interpolate?*

Currently, the *interpolate() *method supports the following interpolation methods (looks like the developer(s) had a field day working on this feature):

{‘linear’, ‘time’, ‘index’, ‘values’, ‘nearest’, ‘zero’, ‘slinear’, ‘quadratic’, ‘cubic’, ‘barycentric’, ‘krogh’, ‘polynomial’, ‘spline’, ‘piecewise_polynomial’, ‘from_derivatives’, ‘pchip’, ‘akima’}

which should give you enough options for replacing the missing values (the most common strategy is the default *linear, *though, which also makes perfect sense in most practical situations).

Now, let’s demonstrate the usage of the *fillna() *method.

9. Enter the following commands:

dataSet5 = dataSet3.copy()dataSet5.C3.fillna(dataSet5.C3.mean(), inplace=True)dataSet5

In this data repairing strategy, we decided to use the *mean* (average) value for the C3 column.

You should see the following output:

C2 C3 C4 0 12.0 13.0 14.0 1 22.055.024.0 3 42.0 43.0 NaN 4 52.055.054.0 5 62.0 63.0 NaN 6 72.0 73.0 74.0 7 82.0 83.0 84.0

The value **55.0** was the average value (the mean) of the existing (not NaN) column values in the original C3 column.

Now, let’s apply the *fillna()* method across the rows.

First, we need to re-index our dataset to avoid confusion.

10. Enter the following command:

dataSet5.index = range(dataSet5.shape[0])

11. Run the following commands to understand the math at hand and prepare yourself for understanding of what will transpire.

print (dataSet5.iloc[2])print ('...................')print (dataSet5.iloc[2].mean())

You should see the following output:

C2 42.0 C3 43.0 C4 NaN Name: 2, dtype: float64 ................... 42.5

12. Enter the following command:

dataSet5.iloc[2].fillna(dataSet5.iloc[2].mean(), inplace=True)

*dataSet5* now has this updated third row:

C2 C3 C4 0 12.0 13.0 14.0 1 22.0 55.0 24.02 42.0 43.0 42.53 52.0 55.0 54.0 4 62.0 63.0 NaN 5 72.0 73.0 74.0 6 82.0 83.0 84.0

We now have only one NaN value left (in column C4).

Use the knowledge you have acquired on the matter to fill that value with an average value in C4.

At the end of our data munging exercise, our DataFrame looks like that:

C2 C3 C4 0 12.0 13.0 14.00 1 22.0 55.0 24.00 2 42.0 43.0 42.50 3 52.0 55.0 54.00 4 62.0 63.048.755 72.0 73.0 74.00 6 82.0 83.0 84.00

Fine print: *When dealing with missing values in your input data sets, you need to be cognizant of the problem domain’s specifics in order to come up with a suitable substitute strategy.*

## Part 4 – Applying the map Function (for Review)

Sometimes you would like to apply a function to all the values in a column. One option here is to use the *map *function that would use the column’s values as input.

Here is an example. Let’s say you would like to create an executive summary of the values in column C2 which, instead of showing numbers, would, rather, show some labels that would represent data ranges, e.g. weak, medium, and strong (e.g. describing a tornado strength, a stock behavior, etc.). This technique can also be useful in Data Science experiments where you may want to use categorical (class) variables.

df_map = dataSet5.copy()df_map.C2 = df_map.C2.map(lambda c: 'strong' if c > 60 else 'medium' if c > 40 else 'weak')

The above operation is similar to SQL’s CASE-WHEN statement.

**df_map** would contain the following data:

C2 C3 C4 0weak13.0 14.00 1weak55.0 24.00 2medium43.0 42.50 3medium55.0 54.00 4strong63.0 48.75 5strong73.0 74.00 6strong83.0 84.00

Now, if you want to convert the above string labels into numbers for ease of processing, you can use the *sklearn.preprocessing.LabelEncoder *object, which you can use as follows:

from sklearn.preprocessing import LabelEncoderdf_lbl_enc = df_map.copy()df_lbl_enc.C2 = LabelEncoder().fit_transform ( df_lbl_enc.C2 )

Now df_lbl_enc has the following data in it:

C2 C3 C4 0 2 13.0 14.00 1 2 55.0 24.00 2 0 43.0 42.50 3 0 55.0 54.00 4 1 63.0 48.75 5 1 73.0 74.00 6 1 83.0 84.00

You may want to change the way the mapping is performed, e.g. you want the ‘weak’ label to be represented by 0, rather than non-intuitive 2. In other words, we want to enforce our own ordinal sequence. For that you need to do the following:

map_dic = {'weak': 0, 'medium': 1, 'strong':2}# This will created the "mapping" # Use the above mapping in the map function:df_map.copy()['C2'].map(map_dic)

You will get this *pandas.core.series.Series*:

0 0 1 0 2 1 3 1 4 2 5 2 6 2

## Part 5 – How to Normalize (Scale) Your Data

The range of values in your input data sets may vary widely which may negatively affect the accuracy and validity of some machine learning algorithms that depend on calculating the distance between data points, e.g. using the Euclidean distance.

Data normalization (a.k.a. feature scaling) is a method for standardizing the range of independent variables or features contained in your input data sets. This data preprocessing step may considerably improve the quality of your subsequent work with data science and machine learning algorithms.

In this part, we will demonstrate two techniques used in data normalization.

The first technique is called **z-score **normalization which is about centering data sets (column-wise) around the mean of zero and re-scaling data values in columns such that their standard deviation equals 1; such re-scaling makes most of your data fall in the range of [-3, +3]

The second technique we are going to demonstrate is the **min-max** transformation which re-scales your data to values in the range of [0,1], where 0 corresponds to the minimum value in the value range, and 1 corresponds to the maximum value.

Both techniques are supported by the *sklearn.preprocessing* module which also includes methods for data scaling, centering, normalization, binarization, and imputation.

## Part 6 – DataFrame Scaling and Normalization

Before continuing make sure you ran the solution mentioned at the end of part 3.

Let’s apply the **z-score **normalization transformation to our *dataSet5* DataFrame.

1. Import the needed module:

from sklearn import preprocessing

2. Enter the following command in a new code cell:

dataSet5_scaled = preprocessing.scale(dataSet5); dataSet5_scaled

You should get the following output:

array([[-1.56501609, -2.00227144, -1.49800487], [-1.1436656 , 0. , -1.06692433], [-0.30096463, -0.57207755, -0.26942534], [ 0.12038585, 0. , 0.22631728], [ 0.54173634, 0.38138504, 0. ], [ 0.96308682, 0.85811633, 1.08847836], [ 1.38443731, 1.33484762, 1.51955889]])

*dataSet5_scaled* is of type *numpy.ndarray*, and should you wish to return back to the DataFrame world, you need to reconstruct it like so:

df = pd.DataFrame (dataSet5_scaled, columns=['C1', 'C2', 'C3'])

With this reconstructed DataFrame you can do the following operations:

df.std()df.mean()

Which would report the near zero mean and unit variance as promised by *preprocessing.scale()*.

*df.describe() *will report the following descriptive statistics:

C1 C2 C3count 7.000000e+00 7.000000e+00 7.000000e+00 mean -2.220446e-16 6.344132e-17 -9.516197e-17 std 1.080123e+00 1.080123e+00 1.080123e+00 min -1.565016e+00 -1.341641e+00 -1.517158e+00 25% -7.223151e-01 -7.546729e-01 -7.883275e-01 50% 1.203859e-01 -1.677051e-01 1.487410e-01 75% 7.524116e-01 8.105746e-01 7.734534e-01 max 1.384437e+00 1.397542e+00 1.398166e+00

Shrug off the *count 7.000000e+00 *nuisance even though it might be philosophically disturbing for some.

**Note**: The *scale() *method treats NaNs, if any are present, as missing values and those are disregarded during computation retaining the original NaN values.

Applying the *MinMaxScaler* is a bit different, and here is the command:

preprocessing.MinMaxScaler().fit_transform(dataSet5)

You should see the following output:

array([[0. , 0. , 0. ], [0.14285714, 0.6 , 0.14285714], [0.42857143, 0.42857143, 0.40714286], [0.57142857, 0.6 , 0.57142857], [0.71428571, 0.71428571, 0.49642857], [0.85714286, 0.85714286, 0.85714286], [1. , 1. , 1. ]])

To learn about the MinMaxScaler object, and its *fit, transform, *and *fit_transform *methods, visit https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html

## Part 7 – Scaling and Normalization of NumPy Arrays (Optional)

1. Enter the following commands:

np.random.seed(1234)toScaleDS = np.ceil(100 * np.random.rand (10,3))toScaleDS

**Note**: We use the

np.random.seed(1234)

seeding command in order to make the results of generating random numbers reproducible.

You should get the following results:

array([[ 20., 63., 44.], [ 79., 78., 28.], [ 28., 81., 96.], [ 88., 36., 51.], [ 69., 72., 38.], [ 57., 51., 2.], [ 78., 89., 37.], [ 62., 8., 37.], [ 94., 66., 40.], [ 79., 32., 57.]])

You can access this NumPy array to get its descriptive stats as follows:

- Get the average of the values in the
**second**column:

`toScaleDS[:,`**1**].mean()

- Get the standard deviation across the
**third**row:

`toScaleDS[`**2**,:].std()

We will use the NumPy array *toScaleDS *created earlier in the tutorial for illustration of the techniques.

2. Enter the following command to apply **z-score **normalization transformation:

from sklearn import preprocessingscaledDS = preprocessing.scale(toScaleDS)scaledDS

You should see the following output:

array([[-1.95146565, 0.22264681, 0.04435909], [ 0.58458002, 0.84111018, -0.66538641], [-1.60759505, 0.96480286, 2.35103199], [ 0.97143444, -0.89058725, 0.35487275], [ 0.15474177, 0.59372484, -0.22179547], [-0.36106413, -0.27212388, -1.81872286], [ 0.54159619, 1.29464999, -0.26615457], [-0.146145 , -2.04505221, -0.26615457], [ 1.22933739, 0.34633949, -0.13307728], [ 0.58458002, -1.05551082, 0.62102732]])

The *scaledDS* structure is a NumPy ndarray as well.

The original (unscaled) version is shown below.

array([[ 20., 63., 44.], [ 79., 78., 28.], [ 28., 81., 96.], [ 88., 36., 51.], [ 69., 72., 38.], [ 57., 51., 2.], [ 78., 89., 37.], [ 62., 8., 37.], [ 94., 66., 40.], [ 79., 32., 57.]])

The *scale()* method can standardize the input data along any axis – column-wise (features-wise) scaling (**axis = 0)** is the default; use **axis = 1 **for scaling across each row.

Let’s see how we can get.

3. Enter the following command:

np.mean(scaledDS, axis = 0)

You should see the following output (we have three columns in our array, so we get three values for the average values – one per each column) which are near zero (with negligible deviations).

array([-2.22044605e-16, -6.66133815e-17, 1.11022302e-17])

4. Enter the following command:

np.std(scaledDS, axis = 0)

You should see the following output:

`array([ 1., 1., 1.])`

As you can see, this is the unit variance.

Now let’s apply the *min-max* transformation to the original *toScaleDS* data set.

5. Enter the following command:

mmScaler = preprocessing.MinMaxScaler()mmScaler.fit_transform(toScaleDS)

You should see the following output:

array([[ 0. , 0.67901235, 0.44680851], [ 0.7972973 , 0.86419753, 0.27659574], [ 0.10810811, 0.90123457, 1. ], [ 0.91891892, 0.34567901, 0.5212766 ], [ 0.66216216, 0.79012346, 0.38297872], [ 0.5 , 0.5308642 , 0. ], [ 0.78378378, 1. , 0.37234043], [ 0.56756757, 0. , 0.37234043], [ 1. , 0.71604938, 0.40425532], [ 0.7972973 , 0.2962963 , 0.58510638]])

To help you compare the produced result with the original data set, below is the *toScaleDS’s* contents:

array([[ 20., 63., 44.], [ 79., 78., 28.], [ 28., 81., 96.], [ 88., 36., 51.], [ 69., 72., 38.], [ 57., 51., 2.], [ 78., 89., 37.], [ 62., 8., 37.], [ 94., 66., 40.], [ 79., 32., 57.]])

## Part 8 – Clean Up

1. Click the **Save **button in the toolbar.

In the menu bar, select **File > Close and Halt**

The command will shut down the notebook’s kernel (the sandboxed Python session) and close the interactive edit session.

Keep the browser window open as we are going to use it later.

This is the last step in this tutorial.

## Part 9 – Solutions

**S1**
dataSet5.C4.fillna(dataSet5.C4.mean(), inplace=True); dataSet5

## Part 10 – Review

In this tutorial, you learned about the ways Python can handle missing values in your input data sets as well as ways to normalize (scale) your data to make the transformed data more conducive for further data processing down your data science workflow pipeline.