Data transformation (statistics)

In statistics, data transformation is carried in order to transform the data and assure that it has a normal distribution (a remedy for outliers, failures of normality, linearity, and homoscedasticity; this is usually done to prepare data for regression analysis, as this analysis assumes the data is linear, normal and homoscedastic.). This is also known as transformation to linearity. A good indicator of data having a normal distribution is skewness in the range of &minus;0.8 to 0.8 and kurtosis in range of &minus;3.0 to 3.0.

Small samples from a skewed population are a problem because the confidence intervals they produce are often off center and too narrow. The confidence interval will be larger than the capture rate of these intervals. If the sample size is too small or the data is skewed we might try one of these transformations:

Common transformation techniques
(note that their reflected variants are also used):
 * logarithm
 * square root
 * reciprocal
 * power transformation including all the above, and cube root.

The first step must always be to plot the data. This is the best way to see major deviations from normal. You may find you do not need a transformation. The t test is quite robust. This means it is not too sensitive to departures from normality, even though one of the conditions for this procedure is normality.

The plot may look like the data comes from a normal distribution. If this happens, you do not need to worry about the shape. If your data is skewed, there is a good chance that one of the transformations will make it nearly symmetric. The transformations are changing the scale of the data. Once you find a new scale that makes your data look normal, you no longer need to worry about shape. The change of scales may also take care of outliers and bring them closer to the center.

Don't bother with linear transformations since they have no effect on the shape. This is seen in computing z-scores. Power transformations, on the other hand, have a big impact on the shape. Data is often skewed to the right and the power transformation may create a normal distribution.

Power transformation
See also Box-Cox transformation.

A power transformation with p = 1 is the identity. If p = 0, we consider the transformation to be the log transformation. We commonly use p = 1/2 and p = 1/3 to transform data. The square root transformation can only be done if all the data are non-negative. We do not require this of the cube root transformation.The square root transformation works well with Poisson distributions.

Logarithmic transformation
If the data are very skewed, we might consider using the logarithmic transformation since it has the most impact on skew. The square root has the least impact. If the logarithm transformation is used, it may over compensate a right skewed data set and create left skewed one. The important thing is to plot the data again after performing a transformation.

Reciprocal transformation
A common example of the reciprocal transformation is miles per gallon. We can use gallon per mile instead and remove some skew and outliers. A generalization of this is that any skewed data with ratios might be transformed with the reciprocal transformation. The arc-sine transformation may work well for some proportions.

Perhaps changing the scale helps, but there are still outliers. The next step is to analyze all the data with and without the outliers. If doing these parallel analyses yields the same results, you can safely discard the outliers. If you have two different outcomes, the answer is to obtain more data. Clearly this is not always possible.

15/40 guidelines
As mentioned, the robustness of the t-procedure can be used in certain situations. The 15/40 Guideline helps to decide what to do. The first step is to plot the data. A histogram or a box plot will show shape and outliers well.

If your sample size is 15 or less, you must use extreme caution. The original data must look like it came from a normal distribution or the transformed data must look quite normal. In other words there must be very little skewness and no outliers.

If your sample size is between 15 and 40, you should proceed with caution. You might be able to use the t-procedure by first transforming strongly skewed data. Also, if you have outliers, you will need to do a transformation. This might be the time to do the parallel analysis and compare your results with and without outliers.

If your sample size is more than 40 you probably don't need to worry about skewness. However, if the skew is large, you might still consider a transformation, it is just less important.

How to do it
Plot your data and decide which one of the four situations you are dealing with.
 * 1) Your data looks normal without outliers. Use methods for normal distributions.
 * 2) Your data is not symmetric, but the sample size is big enough to use the robustness of the t-procedure and construct a confidence interval.
 * 3) Your sample is small and skewed. If it were larger you might not need a transformation. Try one and see.
 * 4) You should analyze twice, with and without outliers. (Ann E. Watkins, Richard L. Scheaffer, and George W. Cobb.  Statistics in Action Understanding a World of Data Emeryville CA: KeyCurriculum Press, 2008)

Then if you need to do a transformation follow these steps:

Apply the transformation to all of your data. Find the mean and standard deviation of the transformed data and plot the data. You can now use a Z score or T score to find desired probability.