CS457 - System Performance Evaluation - Winter 2010
Public Service Announcements
- Mid-term conflicts
Lecture 13 - Data Analysis III
Analysis of Variance (ANOVA) aka Linear Models (pdf)
Zero Factor Analysis of Variance
New Concepts
- Thinking about variance reduction
- Degrees of freedom
- Null hypothesis
General idea
- To work it out from first principles
- Assume underlying distribution from which data is drawn has finite
variance.
- N data points
- Mean is normally distributed
- Variance is distributed by a chi-square distribution with
N degrees of freedom
- The treatment variance, the difference between two variances, is
distributed by a chi-square distribution with N1=1 degrees
of freedom.
- Ratio between the treatment variance and the remaining error
variance is distributed by an F distribution with
N-1 and N1 degrees of freedom
- Check the ratio (15.3 * (N-N1)/ 541.4 *
N1 in the first case, 1381.3 * (N1-1) / 1068.0 *
N1 in the second) against the percentage points of the
F distribution
- Of course, for a test as simple as this there are other, and better,
ways of doing the test.
- In fact we would be likely just to write
- Difference = 3.8 plus-or-minus 7.8 or
- Difference = -17.7 plus-or-minus 10.9
- Note. Some authors use twice the standard
deviation rather than the standard deviation.
One Factor Analysis of Variance
New Concepts
- A linear model that isn't linear in the sense of regression
- Degrees of freedom
The Linear Model
- There sre N data points.
- Assume that the factor has M levels i =
1..M
- Assume the data is well described by the model: overall_mean +
a_i + error
- There is one degree of freedom for the overall mean
- There are M-1 degrees of freedom for the levels of the
factor: sum_i a_i = 0
- There are N-Mdegrees of freedom for the error
- Assume that the error is normally distributed and uncorrelated.
- The null hypothesis is that all the a_i are zero.
The Calculation
- Remove the overall mean from the data, calculate the total variance
- Separate the data into cells one for each level of the factor
- Find the a_i that best fit the data, which are just the means
of the corresponding cells.
- Calculate the remaining variance, which is the error variance.
- The difference between the total variance and the error variance is the
treatment variance.
- Form the ratio of the treatment variance and the error variance with
degrees of freedom taken into account.
- Check against the percentage points of the F distribution.
- If the result is significant then at least one of the coefficients
a_i of the model is different fram zero.
- We say that there is a significant effect of a.
- Auxiliary tests are needed to determine which of the a_i
are non-zero,
- and which of the a_i are different from one another.
Two Tables
One factor, cleaning requests, which has 3 levels.
Measure cleaning time
- cache
0, 0, 0, 0
- penalty: white
20, 20, 19, 18
- penalty: black
401, 402, 400, 399
Data with mean removed
| cache |
-139.9 |
-139.9 |
-139.9 |
-139.9 |
-139.9 |
| penalty: white |
-119.9 |
-119.9 |
-120.9 |
-121.9 |
-120.7 |
| penalty: black |
261.1 |
262.1 |
260.1 |
259.1 |
260.6 |
ANOVA Table
|
Sum of
squares |
Degrees of
Freedom |
Mean
Square |
Computed
f |
| Treatments |
408163.2 |
2 |
204081.6 |
236998 |
| Error |
7.7 |
9 |
0.86 |
|
| Total Error |
408170.9 |
11 |
|
|
Significant at the 1% level for f > 7.21
Regression
Data for Ips removed
| Cleaning time |
20 |
20 |
19 |
18 |
| Total IPs |
9128 |
8352 |
7849 |
7404 |
| IPs removed |
2362 |
1954 |
1600 |
1442 |
Assumption
I assume that you have seen linear regression and are able to do it for
the assignment
Two Factor Analysis of Variance
New Concepts
- Main effects versus interactions
The Linear Model
- Same assumptions as above,
- Except, assume that the data is well described by the model:
overall_mean + a_i + b_j + ab_ij + error
- There is one degree of freedom for the overall mean
- There are M_a - 1 degrees of freedom for the levels of the
first factor
- There are M_b - 1 degrees of freedom for the levels of the
second factor
- There are M_a * M_b - M_a - M_b
+ 1 degrees of freedom for the interaction
- There are N - M_a * M_b degrees of
freedom for the error
- Three ways to consider the model
- Main effects only, no interactions
- Main effects plus interactions
- Interactions only, no main effects
- Null hypothesis: All of the terms a_i, b_j,
ab_ij are zero.
The Calculation
- Remove the overall mean from the data, calculate the total variance
- Separate the data into cells one for each pair of levels of the two
factors
- Find the a_i that best fit the data, which are just the means
of the corresponding cells.
- Find the b_i that best fit the data, which are just the means
of the corresponding cells.
- Calculate the remaining variance, which is the error variance.
- The difference between the total variance and the error variance is the
treatment variance.
- Form the ratio of the treatment variance for a and b
and the error variance with degrees of freedom taken into account.
- Check against the percentage points of the F distribution.
- If the a result is significant then at least one of the
coefficients a_i of the model is different fram zero.
- If the b result is significant then at least one of the
coefficients b_i of the model is different fram zero.
- Find the ab_ij that best fit the left-over data in each
cell
- Form the ratio
- Check against the F distribution
Two Tables
ANOVA Table without
interaction
|
Sum of Squares |
Degrees of Freedom |
Mean Square |
Computed f |
| Treatment a |
|
|
|
|
| Treatment b |
|
|
|
|
| Error |
|
|
|
|
| Total |
|
|
|
|
ANOVA Table with interaction
|
Sum of Squares |
Degrees of Freedom |
Mean Square |
Computed f |
| Treatment a |
|
|
|
|
| Treatment b |
|
|
|
|
| Treatment ab |
|
|
|
|
| Error |
|
|
|
|
| Total |
|
|
|
|
Return to: