Data Transformation - Normalisation Techniques

Almost always when we get raw data in any project, it is unfit for direct consumption for analysis or modelling . It is a especially a concern when the data volume is huge for example in a big data analytics project . In this blog post I cover a few of the most common transformations and their use.

Box-Cox Transformation

It is not necessary for a data set to adhere to a normal distribution. However many data analysis methods require the data distribution to be normal. Box-Cox is a transformation that can be used to convert any distribution to a normal distribution. Every dataset may not benefit from a Box-Cox transformation, for example, if there are significant outliers box-cos may not help.

The box-cox transformation in mathematical form is denoted as
boxcox

where λ is the exponent (power) and δ is a shift amount that is added when X is zero or negative. When λ is zero, the above definition is replaced by
boxcox

As you can very well imagine the trick is to find the right value of λ to get a normal distribution.

Usually, the standard λ values of -2, -1.5, -1, -0.5, 0, 0.5, 1, 1.5, and 2 are investigated to determine which, if any, is most suitable. However, a maximum likelihood estimation can be used to determine the best possible value of λ to get a more normal distribution.

To understand Box-Cox transformation lets look at a non normal data-set and see the impact of the transformation on it.
boxcox

Normalisation Techniques

In the part 1 of the series we looked at the various methods of normalising the data including min-max and box-cox transformations. In this part we look at the following

Value Mapping
Discretization
Equal Width Discretization
Equal Frequency Discretization
Aggregation
Value Mapping

Sometimes in the data set we may have variables that are textual in character but they may signify an order. For example a data set may have a column having three distinct values Low, Medium and High. These can be numerically mapped to 0, 1 and 2. However extreme care must be exercised when choosing the values as they must reflect the degree of change in mathematical terms. Who is to say that the right values are not 0, 5 and 6 for instance.

Another very frequent example of value mapping arises when we need to map categorical values into separate columns. This is required often in any deep learning data preparation. This is termed as one hot encoding signifying only one column of the data set representing the boolean is hot.

Consider a dataset as below and it’s one hot encoded form below:

Category	Article	Quantity
Electronics	Mobile Phone	100
Electronics	Tablet	100
Electronics	Laptop	60
Furniture	Table	25
Furniture	Chair	100

Electronics	Furniture	Article	Quantity
1	0	Mobile Phone	100
1	0	Tablet	100
1	0	Laptop	60
0	1	Table	25
0	1	Chair	50

As can be observed this makes the data set quite sparse if there are many values in the category columns.

Discretization

Discretization (also referred to as binning) is the process of converting a continuous variable (or a nominal variable into their discrete counterparts. Intuitively it may appear that discretization would lead to loss in information however in certain circumstances the process is quite valuable. For example a risk profile of a customer instead of being represented as any value within 0 to 100 may be categorised into Very Low, Low, Medium, High, Very High. Specifically if there is suspicion about the accuracy of the continuous variable discretization may be a desirable normalisation step.

The mathematical value from discretization arises as the frequency of values in original dataset would be very infrequent thereby leading to poor modelling and correlation. Another discretization of a different nature could be applied to export data for instance. The export data may have millions of company each exporting a handful of materials. It maybe value in grouping the companies into industries and a summarised view at the industry level may lend to a much better analysis.

While discretization may appear to be simply a process of grouping together like values in a dataset there are certain decisions that require consideration. How many intervals to choose is one such. Here two different approaches are commonly used that will be explained with the dataset below.

	Math	Physics	Chemistry	English	Biology	Economics	History	Civics
John	55	45	56	87	21	52	89	65
Suresh	75	55	0	64	90	61	58	2
Ramesh	25	54	89	76	95	87	56	74
Jessica	78	55	86	63	54	89	75	45
Jennifer	58	96	78	46	96	77	83	53

Equal Width Discretization

The algorithm first finds the min and max values and the splits the range into equal distances based on the interval.

So let's say we want 5 intervals and the range of marks vary between 0 to 100. In this case we would have the different bins as 0 - 20, 21-40, 41-60, 61-80, 81 -100. After equal width discretization the table would look as below:

	Math	Physics	Chemistry	English	Biology	Economics	History	Civics
0-20	0	0	1	0	0	0	0	1
21-40	1	0	0	0	1	0	0	1
41-60	4	4	1	1	1	1	2	1
61-80	0	0	1	3	0	2	1	2
81-100	0	1	2	1	3	2	2	0

Equal Frequency Discretization

The algorithm find the minimum and maximum values there-after divides the range into the given number of intervals, in such a way that every interval contains the equal number of sorted values

As we have five intervals and five observations each observation would get 1 value only. So if list the bins it should suffice for each language as each bin would have a frequency of 1.

The results are generated by using the classInt package in R.

The code is as below for Maths.


1
2
3

dataset <-c(55,75,25,78,58)
library(classInt)
classIntervals(dataset, 5)

Maths - [18.375,40) [40,56.5) [56.5,66.5) [66.5,76.5) [76.5,84.625]

Aggregation

Sometimes the variable that you are trying to visualise may not be part of the original dataset but maybe a derived variable based as a function of one or more variables in the original dataset.

As example we may have a dataset that has runs scored and balls faced by each batsman in a cricket match. What we may be interested in however could be the metric called strike rate which is defined simply as strike rate is then an aggregated variable.

That sums up the most common data normalisation techniques.

Content from www.bluepiit.com

Data Science and AI Labs

Data Transformation - Normalisation Techniques