Data Transformation - Normalisation Techniques
Almost always when we get raw data in any project, it is unfit for direct consumption for analysis or modelling . It is a especially a concern when the data volume is huge for example in a big data analytics project . In this blog post I cover a few of the most common transformations and their use.
Box-Cox Transformation
It is not necessary for a data set to adhere to a normal distribution. However many data analysis methods require the data distribution to be normal. Box-Cox is a transformation that can be used to convert any distribution to a normal distribution. Every dataset may not benefit from a Box-Cox transformation, for example, if there are significant outliers box-cos may not help.
The box-cox transformation in mathematical form is denoted as
where λ is the exponent (power) and δ is a shift amount that is added when X is zero or negative. When λ is zero, the above definition is replaced by
As you can very well imagine the trick is to find the right value of λ to get a normal distribution.
Usually, the standard λ values of -2, -1.5, -1, -0.5, 0, 0.5, 1, 1.5, and 2 are investigated to determine which, if any, is most suitable. However, a maximum likelihood estimation can be used to determine the best possible value of λ to get a more normal distribution.
To understand Box-Cox transformation lets look at a non normal data-set and see the impact of the transformation on it.
In the part 1 of the series we looked at the various methods of normalising the data including min-max and box-cox transformations. In this part we look at the following
- Value Mapping
- Discretization
- Equal Width Discretization
- Equal Frequency Discretization
- Aggregation
- Value Mapping
Sometimes in the data set we may have variables that are textual in character but they may signify an order. For example a data set may have a column having three distinct values Low, Medium and High. These can be numerically mapped to 0, 1 and 2. However extreme care must be exercised when choosing the values as they must reflect the degree of change in mathematical terms. Who is to say that the right values are not 0, 5 and 6 for instance.
Another very frequent example of value mapping arises when we need to map categorical values into separate columns. This is required often in any deep learning data preparation. This is termed as one hot encoding signifying only one column of the data set representing the boolean is hot.
Consider a dataset as below and it’s one hot encoded form below:
Category | Article | Quantity |
---|---|---|
Electronics | Mobile Phone | 100 |
Electronics | Tablet | 100 |
Electronics | Laptop | 60 |
Furniture | Table | 25 |
Furniture | Chair | 100 |
Electronics | Furniture | Article | Quantity |
---|---|---|---|
1 | 0 | Mobile Phone | 100 |
1 | 0 | Tablet | 100 |
1 | 0 | Laptop | 60 |
0 | 1 | Table | 25 |
0 | 1 | Chair | 50 |
As can be observed this makes the data set quite sparse if there are many values in the category columns.
Discretization
Discretization (also referred to as binning) is the process of converting a continuous variable (or a nominal variable into their discrete counterparts. Intuitively it may appear that discretization would lead to loss in information however in certain circumstances the process is quite valuable. For example a risk profile of a customer instead of being represented as any value within 0 to 100 may be categorised into Very Low, Low, Medium, High, Very High. Specifically if there is suspicion about the accuracy of the continuous variable discretization may be a desirable normalisation step.
The mathematical value from discretization arises as the frequency of values in original dataset would be very infrequent thereby leading to poor modelling and correlation. Another discretization of a different nature could be applied to export data for instance. The export data may have millions of company each exporting a handful of materials. It maybe value in grouping the companies into industries and a summarised view at the industry level may lend to a much better analysis.
While discretization may appear to be simply a process of grouping together like values in a dataset there are certain decisions that require consideration. How many intervals to choose is one such. Here two different approaches are commonly used that will be explained with the dataset below.
Math | Physics | Chemistry | English | Biology | Economics | History | Civics | |
---|---|---|---|---|---|---|---|---|
John | 55 | 45 | 56 | 87 | 21 | 52 | 89 | 65 |
Suresh | 75 | 55 | 0 | 64 | 90 | 61 | 58 | 2 |
Ramesh | 25 | 54 | 89 | 76 | 95 | 87 | 56 | 74 |
Jessica | 78 | 55 | 86 | 63 | 54 | 89 | 75 | 45 |
Jennifer | 58 | 96 | 78 | 46 | 96 | 77 | 83 | 53 |
Equal Width Discretization
The algorithm first finds the min and max values and the splits the range into equal distances based on the interval.
So let's say we want 5 intervals and the range of marks vary between 0 to 100. In this case we would have the different bins as 0 - 20, 21-40, 41-60, 61-80, 81 -100. After equal width discretization the table would look as below:
Math | Physics | Chemistry | English | Biology | Economics | History | Civics | |
---|---|---|---|---|---|---|---|---|
0-20 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 |
21-40 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 |
41-60 | 4 | 4 | 1 | 1 | 1 | 1 | 2 | 1 |
61-80 | 0 | 0 | 1 | 3 | 0 | 2 | 1 | 2 |
81-100 | 0 | 1 | 2 | 1 | 3 | 2 | 2 | 0 |
Equal Frequency Discretization
The algorithm find the minimum and maximum values there-after divides the range into the given number of intervals, in such a way that every interval contains the equal number of sorted values
As we have five intervals and five observations each observation would get 1 value only. So if list the bins it should suffice for each language as each bin would have a frequency of 1.
The results are generated by using the classInt package in R.
The code is as below for Maths.
1
2
3
dataset <-c(55,75,25,78,58)
library(classInt)
classIntervals(dataset, 5)
Aggregation
Sometimes the variable that you are trying to visualise may not be part of the original dataset but maybe a derived variable based as a function of one or more variables in the original dataset.
As example we may have a dataset that has runs scored and balls faced by each batsman in a cricket match. What we may be interested in however could be the metric called strike rate which is defined simply as strike rate is then an aggregated variable.
That sums up the most common data normalisation techniques.
Content from www.bluepiit.com
0 Comments