# CS457 - System Performance Evaluation - Winter 2010

## Public Service Announcements

1. Mid-term conflicts

# Analysis of Variance (ANOVA) aka Linear Models (pdf)

## Zero Factor Analysis of Variance

#### New Concepts

• Degrees of freedom
• Null hypothesis

#### General idea

1. To work it out from first principles
1. Assume underlying distribution from which data is drawn has finite variance.
2. N data points
3. Mean is normally distributed
4. Variance is distributed by a chi-square distribution with N degrees of freedom
5. The treatment variance, the difference between two variances, is distributed by a chi-square distribution with N1=1 degrees of freedom.
6. Ratio between the treatment variance and the remaining error variance is distributed by an F distribution with N-1 and N1 degrees of freedom
7. Check the ratio (15.3 * (N-N1)/ 541.4 * N1 in the first case, 1381.3 * (N1-1) / 1068.0 * N1 in the second) against the percentage points of the F distribution
2. Of course, for a test as simple as this there are other, and better, ways of doing the test.
• In fact we would be likely just to write
• Difference = 3.8 plus-or-minus 7.8 or
• Difference = -17.7 plus-or-minus 10.9
• Note. Some authors use twice the standard deviation rather than the standard deviation.

## One Factor Analysis of Variance

#### New Concepts

• A linear model that isn't linear in the sense of regression
• Degrees of freedom

#### The Linear Model

• There sre N data points.
• Assume that the factor has M levels i = 1..M
• Assume the data is well described by the model: overall_mean + a_i + error
• There is one degree of freedom for the overall mean
• There are M-1 degrees of freedom for the levels of the factor: sum_i a_i = 0
• There are N-Mdegrees of freedom for the error
• Assume that the error is normally distributed and uncorrelated.
• The null hypothesis is that all the a_i are zero.

#### The Calculation

• Remove the overall mean from the data, calculate the total variance
• Separate the data into cells one for each level of the factor
• Find the a_i that best fit the data, which are just the means of the corresponding cells.
• Calculate the remaining variance, which is the error variance.
• The difference between the total variance and the error variance is the treatment variance.
• Form the ratio of the treatment variance and the error variance with degrees of freedom taken into account.
• Check against the percentage points of the F distribution.
• If the result is significant then at least one of the coefficients a_i of the model is different fram zero.
• We say that there is a significant effect of a.
• Auxiliary tests are needed to determine which of the a_i are non-zero,
• and which of the a_i are different from one another.

#### Two Tables

One factor, cleaning requests, which has 3 levels.

Measure cleaning time

1. cache
`0, 0, 0, 0`
2. penalty: white
`20, 20, 19, 18`
3. penalty: black
`401, 402, 400, 399 `
 cache -139.9 -139.9 -139.9 -139.9 -139.9 penalty: white -119.9 -119.9 -120.9 -121.9 -120.7 penalty: black 261.1 262.1 260.1 259.1 260.6
 Sum of squares Degrees of Freedom Mean Square Computed f Treatments 408163.2 2 204081.6 236998 Error 7.7 9 0.86 Total Error 408170.9 11

Significant at the 1% level for f > 7.21

## Regression

 Cleaning time 20 20 19 18 Total IPs 9128 8352 7849 7404 IPs removed 2362 1954 1600 1442

#### Assumption

I assume that you have seen linear regression and are able to do it for the assignment

## Two Factor Analysis of Variance

#### New Concepts

• Main effects versus interactions

#### The Linear Model

• Same assumptions as above,
• Except, assume that the data is well described by the model: overall_mean + a_i + b_j + ab_ij + error
• There is one degree of freedom for the overall mean
• There are M_a - 1 degrees of freedom for the levels of the first factor
• There are M_b - 1 degrees of freedom for the levels of the second factor
• There are M_a * M_b - M_a - M_b + 1 degrees of freedom for the interaction
• There are N - M_a * M_b degrees of freedom for the error
• Three ways to consider the model
1. Main effects only, no interactions
• ab_ij = 0
2. Main effects plus interactions
3. Interactions only, no main effects
• a_i = b_j = 0
• Null hypothesis: All of the terms a_i, b_j, ab_ij are zero.

#### The Calculation

• Remove the overall mean from the data, calculate the total variance
• Separate the data into cells one for each pair of levels of the two factors
• Find the a_i that best fit the data, which are just the means of the corresponding cells.
• Find the b_i that best fit the data, which are just the means of the corresponding cells.
• Calculate the remaining variance, which is the error variance.
• The difference between the total variance and the error variance is the treatment variance.
• Form the ratio of the treatment variance for a and b and the error variance with degrees of freedom taken into account.
• Check against the percentage points of the F distribution.
• If the a result is significant then at least one of the coefficients a_i of the model is different fram zero.
• If the b result is significant then at least one of the coefficients b_i of the model is different fram zero.
• Find the ab_ij that best fit the left-over data in each cell
• Form the ratio
• Check against the F distribution

#### Two Tables

 Sum of Squares Degrees of Freedom Mean Square Computed f Treatment a Treatment b Error Total

 Sum of Squares Degrees of Freedom Mean Square Computed f Treatment a Treatment b Treatment ab Error Total