1. Mid-term

# Motivation of Analysis of Variance (pdf)

## Example - Reponse Time

### Testing the Tentative Conclusions

For example

1. Histogram of response times made the response time seem to be about 12 msec.
2. Calculate the average
3. Remember your statistics and calculate the sample variance and standard deviation.
4. Look back at at your tentative conclusions. For example,
• You saw two modes in the histogram
• You saw different ranges in the box plot by reqtype
• A tentative conclusion was response time differs according to request type.
• Separate the data set by request type
• ```grep reqtype1 >data1
grep reqtype2 >data2
...```
• Analyse the files separately
• You see differences.
• Are they real?
• Analysis of variance is the tool for answering this question (Next lecture)

### When your data has passed the tests

Measurement is finished. How do you use the results?

Example. You saw a difference of 20 msec between response time for reqtype1 (browsing) and reqtype2 (searching).

• The difference is real (= statistically significant).
• You canculated mean(browsing) and mean(searching)
• You calculated the std dev. of the difference, roughly the square root of ( (variance(browsing) + variance(searching)) / N )
• The difference between the means was enough bigger than the std dev. of the difference
• Does anybody care about the difference? Probably not.

Example. You saw that each increment of 1 in the load makes an increment of 500 msec in the response time for browsing.

• The difference is real (= statistically significant).
• Currently average load is 3.
• The load is expected to double over the next year.
• Yes! Somebody should upgrade the server during the next six months.

# Analysis of Variance (ANOVA) aka Linear Models

## Zero Factor Analysis of Variance

#### New Concepts

• Degrees of freedom
• Null hypothesis

#### General idea

1. Here is a list of ten genuinely random 2-digit numbers
• `53, 81, 98, 12, 59, 40, 40, 39, 43, 69`
•  Count 10 10 Mean 53.4 49.5 Difference 3.8 0 Variance 541.4 833.3 Standard deviation 24.5 28.9 Standard deviation of mean 7.8 9.6
• Variance assuming that mean is 49.5 is 556.7
2. Here is a list of ten genuinely random 2-digit numbers with 15 subtracted from each
• 84, 60, -12, 28, 68, 66, -1, 11, 9, 5
•  Count 10 10 Mean 31.8 49.5 Difference -17.7 0 Variance 1068 833.3 Standard Deviation 34.4 28.9 Standard Deviation of mean 10.9 9.6
• Variance assuming that mean is 49.5 is 1381.3.
3. In each case we ask whether the reduction in the variance (15.3 in the first case, 313.3 in the second) is statistically significant.
4. To work it out from first principles
1. Assume underlying distribution from which data is drawn has finite variance.
2. N data points
3. Mean is normally distributed
4. Variance is distributed by a chi-square distribution with N degrees of freedom
5. The treatment variance, the difference between two variances, is distributed by a chi-square distribution with N1=1 degrees of freedom.
6. Ratio between the treatment variance and the remaining error variance is distributed by an F distribution with N-1 and N1 degrees of freedom
7. Check the ratio (15.3 * (N-N1)/ 541.4 * N1 in the first case, 1381.3 * (N1-1) / 1068.0 * N1 in the second) against the percentage points of the F distribution
5. Of course, for a test as simple as this there are other, and better, ways of doing the test.
• In fact we would be likely just to write
• Difference = 3.8 plus-or-minus 7.8 or
• Difference = -17.7 plus-or-minus 10.9
• Note. Some authors use twice the standard deviation rather than the standard deviation.

## One Factor Analysis of Variance

#### New Concepts

• A linear model that isn't linear in the sense of regression
• Degrees of freedom

#### The Linear Model

• There sre N data points.
• Assume that the factor has M levels i = 1..M
• Assume the data is well described by the model: overall_mean + a_i + error
• There is one degree of freedom for the overall mean
• There are M-1 degrees of freedom for the levels of the factor: sum_i a_i = 0
• There are N-M degrees of freedom for the error
• Assume that the error is normally distributed and uncorrelated.
• The null hypothesis is that all the a_i are zero.

#### The Calculation

• Remove the overall mean from the data, calculate the total variance
• Separate the data into cells one for each level of the factor
• Find the a_i that best fit the data, which are just the means of the corresponding cells.
• Calculate the remaining variance, which is the error variance.
• The difference between the total variance and the error variance is the treatment variance.
• Form the ratio of the treatment variance and the error variance with degrees of freedom taken into account.
• Check against the percentage points of the F distribution.
• If the result is significant then at least one of the coefficients a_i of the model is different fram zero.
• We say that there is a significant effect of a.
• Auxiliary tests are needed to determine which of the a_i are non-zero,
• and which of the a_i are different from one another.

## Two Factor Analysis of Variance

#### New Concepts

• Main effects versus interactions

#### The Linear Model

• Same assumptions as above,
• Except, assume that the data is well described by the model: overall_mean + a_i + b_j + ab_ij + error
• There is one degree of freedom for the overall mean
• There are M_a - 1 degrees of freedom for the first factor
• There are M_b - 1 degrees of freedom for the second factor
• There are M_a * M_b - M_a - M_b + 1 degrees of freedom for the interaction
• There are N - M_a * M_b degrees of freedom for the error
• Three ways to consider the model
1. Main effects only, no interactions
• ab_ij = 0
2. Main effects plus interactions
3. Interactions only, no main effects
• a_i = b_j = 0
• Null hypothesis: All of the terms a_i, b_j, ab_ij are zero.

.