Ad Code

Data Science labs blog

Data Transformation - Normalisation Techniques

 Data Transformation - Normalisation Techniques

Normalisation Techniques


Almost always when we get raw data in any project, it is unfit for direct consumption for analysis or modelling . It is a especially a concern when the data volume is huge for example in a big data analytics project . In this blog post I cover a few of the most common transformations and their use.

Box-Cox Transformation

It is not necessary for a data set to adhere to a normal distribution. However many data analysis methods require the data distribution to be normal. Box-Cox is a transformation that can be used to convert any distribution to a normal distribution. Every dataset may not benefit from a Box-Cox transformation, for example, if there are significant outliers box-cos may not help.

The box-cox transformation in mathematical form is denoted as
boxcox

where λ is the exponent (power) and δ is a shift amount that is added when X is zero or negative. When λ is zero, the above definition is replaced by
boxcox

As you can very well imagine the trick is to find the right value of λ to get a normal distribution.

Usually, the standard λ values of -2, -1.5, -1, -0.5, 0, 0.5, 1, 1.5, and 2 are investigated to determine which, if any, is most suitable. However, a maximum likelihood estimation can be used to determine the best possible value of λ to get a more normal distribution.

To understand Box-Cox transformation lets look at a non normal data-set and see the impact of the transformation on it.
boxcox

Normalisation Techniques


In the part 1 of the series we looked at the various methods of normalising the data including min-max and box-cox transformations. In this part we look at the following

  • Value Mapping
  • Discretization
  • Equal Width Discretization
  • Equal Frequency Discretization
  • Aggregation
  • Value Mapping

Sometimes in the data set we may have variables that are textual in character but they may signify an order. For example a data set may have a column having three distinct values Low, Medium and High. These can be numerically mapped to 0, 1 and 2. However extreme care must be exercised when choosing the values as they must reflect the degree of change in mathematical terms. Who is to say that the right values are not 0, 5 and 6 for instance.

Another very frequent example of value mapping arises when we need to map categorical values into separate columns. This is required often in any deep learning data preparation. This is termed as one hot encoding signifying only one column of the data set representing the boolean is hot.

Consider a dataset as below and it’s one hot encoded form below:

CategoryArticleQuantity

Electronics

Mobile Phone

100

Electronics

Tablet

100

Electronics

Laptop

60

Furniture

Table

25

Furniture

Chair

100


ElectronicsFurnitureArticleQuantity

1

0

Mobile Phone

100

1

0

Tablet

100

1

0

Laptop

60

0

1

Table

25

0

1

Chair

50


As can be observed this makes the data set quite sparse if there are many values in the category columns.

Discretization

Discretization (also referred to as binning) is the process of converting a continuous variable (or a nominal variable into their discrete counterparts. Intuitively it may appear that discretization would lead to loss in information however in certain circumstances the process is quite valuable. For example a risk profile of a customer instead of being represented as any value within 0 to 100 may be categorised into Very Low, Low, Medium, High, Very High. Specifically if there is suspicion about the accuracy of the continuous variable discretization may be a desirable normalisation step.

The mathematical value from discretization arises as the frequency of values in original dataset would be very infrequent thereby leading to poor modelling and correlation. Another discretization of a different nature could be applied to export data for instance. The export data may have millions of company each exporting a handful of materials. It maybe value in grouping the companies into industries and a summarised view at the industry level may lend to a much better analysis.

While discretization may appear to be simply a process of grouping together like values in a dataset there are certain decisions that require consideration. How many intervals to choose is one such. Here two different approaches are commonly used that will be explained with the dataset below.

MathPhysicsChemistryEnglishBiologyEconomicsHistoryCivics

John

55

45

56

87

21

52

89

65

Suresh

75

55

0

64

90

61

58

2

Ramesh

25

54

89

76

95

87

56

74

Jessica

78

55

86

63

54

89

75

45

Jennifer

58

96

78

46

96

77

83

53


Equal Width Discretization

The algorithm first finds the min and max values and the splits the range into equal distances based on the interval.

So let's say we want 5 intervals and the range of marks vary between 0 to 100. In this case we would have the different bins as 0 - 20, 21-40, 41-60, 61-80, 81 -100. After equal width discretization the table would look as below:

MathPhysicsChemistryEnglishBiologyEconomicsHistoryCivics

0-20

0

0

1

0

0

0

0

1

21-40

1

0

0

0

1

0

0

1

41-60

4

4

1

1

1

1

2

1

61-80

0

0

1

3

0

2

1

2

81-100

0

1

2

1

3

2

2

0


Equal Frequency Discretization

The algorithm find the minimum and maximum values there-after divides the range into the given number of intervals, in such a way that every interval contains the equal number of sorted values

As we have five intervals and five observations each observation would get 1 value only. So if list the bins it should suffice for each language as each bin would have a frequency of 1.

The results are generated by using the classInt package in R.

The code is as below for Maths.

1 2 3

dataset <-c(55,75,25,78,58) library(classInt) classIntervals(dataset, 5)

Maths - [18.375,40) [40,56.5) [56.5,66.5) [66.5,76.5) [76.5,84.625]

Aggregation

Sometimes the variable that you are trying to visualise may not be part of the original dataset but maybe a derived variable based as a function of one or more variables in the original dataset.

As example we may have a dataset that has runs scored and balls faced by each batsman in a cricket match. What we may be interested in however could be the metric called strike rate which is defined simply as strike rate is then an aggregated variable.

That sums up the most common data normalisation techniques.


Content from www.bluepiit.com

Reactions

Post a Comment

0 Comments