Squared deviations

The definition of variance is either the expected value (when considering a theoretical distribution), or average (for actual experimental data) of squared deviations from the mean. Computations for analysis of variance involve the partitioning of a sum of squared deviations. An understanding of the complex computations involved is greatly enhanced by a detailed study of the statistical value:


 * $$\operatorname{E}( X ^ 2 ).$$

It is well-known that for a random variable $$X$$ with mean $$\mu$$ and variance $$\sigma^2$$:


 * $$\sigma^2 = \operatorname{E}( X ^ 2 ) - \mu^2$$

Therefore


 * $$\operatorname{E}( X ^ 2 ) = \sigma^2 + \mu^2.$$

From the above, the following are readly derived:


 * $$\operatorname{E}\left( \sum\left( X ^ 2\right) \right) = n\sigma^2 + n\mu^2$$


 * $$\operatorname{E}\left( \left(\sum X \right)^ 2 \right) = n\sigma^2 + n^2\mu^2$$

Sample variance
The sum of squared deviations needed to calculate variance (before deciding whether to divide by n or n &minus; 1) is most easily calculated as


 * $$S = \sum x ^ 2 - \left(\sum x\right)^2/n$$

From the two derived expectations above the expected value of this sum is


 * $$\operatorname{E}(S) = n\sigma^2 + n\mu^2 - (n\sigma^2 + n^2\mu^2)/n$$

which implies


 * $$\operatorname{E}(S) = (n - 1)\sigma^2. $$

This effectively proves the use of the divisor $$(n - 1)$$ in the calculation of an unbiased sample estimate of $$\sigma^2$$

Partition &mdash; analysis of variance
In the situation where data is available for k different treatment groups having size ni where i varies from 1 to k, then it is assumed that the expected mean of each group is


 * $$\operatorname{E}(\mu_i) = \mu + T_i$$

and the variance of each treatment group is unchanged from the population variance $$\sigma^2$$.

Under the Null Hyporthesis that the treatments have no effect, then each of the $$T_i$$ will be zero.

It is now possible to calculate three sums of squares:


 * Individual


 * $$I = \sum x^2 $$


 * $$\operatorname{E}(I) = n\sigma^2 + n\mu^2$$


 * Treatments


 * $$T = \sum_{i=1}^k \left(\left(\sum x\right)^2/n_i\right)$$


 * $$\operatorname{E}(T) = k\sigma^2 + \sum_{i=1}^k n_i(\mu + T_i)^2$$


 * $$\operatorname{E}(T) = k\sigma^2 + n\mu^2 + 2\mu \sum_{i=1}^k (n_iT_i) + \sum_{i=1}^k n_i(T_i)^2$$

Under the null hypothesis that the treatments cause no differences and all the $$T_i$$ are zero, the expectation simplifies to


 * $$\operatorname{E}(T) = k\sigma^2 + n\mu^2.$$


 * Combination


 * $$C = \left(\sum x\right)^2/n$$


 * $$\operatorname{E}(C) = \sigma^2 + n\mu^2$$

Sums of squared deviations
Under the null hypothesis, the difference of any pair of I, T, and C does not contain any dependency on $$\mu$$, only $$\sigma^2$$.


 * $$\operatorname{E}(I - C) = (n - 1)\sigma^2$$ Total Squared Deviations


 * $$\operatorname{E}(T - C) = (k - 1)\sigma^2$$ Treatment Squared Deviations


 * $$\operatorname{E}(I - T) = (n - k)\sigma^2$$ Residual Squared Deviations

The constants (n &minus; 1), (k &minus; 1), and (n &minus; k) are normally referred to as the number of degrees of freedom.

Example
In a very simple example, 5 observations arise from two treatments. The first treatment gives three values 1, 2, and 3, and the second treatment gives two values 4, and 6.


 * $$I = \frac{1^2}{1} + \frac{2^2}{1} + \frac{3^2}{1} + \frac{4^2}{1} + \frac{6^2}{1} = 66$$


 * $$T = \frac{(1 + 2 + 3)^2}{3} + \frac{(4 + 6)^2}{2} = 12 + 50 = 62$$


 * $$C = \frac{(1 + 2 + 3 + 4 + 6)^2}{5} = 256/5 = 51.2$$

Giving


 * Total squared deviations = 66 &minus; 51.2 = 14.8 with 4 degrees of freedom.
 * Treatment squared deviations = 62 &minus; 51.2 = 10.8 with 1 degree of freedom.
 * Residual squared deviations = 66 &minus; 62 = 4 with 3 degrees of freedom.

Two-way analysis of variance
The following hypothetical example gives the yields of 15 plants subject to two environmental variations, and three fertilisers.

Five sums of squares are calculated:

Finally, the sums of squared deviations required for the analysis of variance can be calculated.