CS457 - System Performance Evaluation - Winter 2010
Public Service Announcements
- Mid-term
Lecture 12 - Data Analysis II
Motivation of Analysis of Variance (pdf)
Example - Reponse Time
Preparing the data
Looking at the Data
Testing the Tentative Conclusions
For example
- Histogram of response times made the response time seem to be about 12
msec.
- Calculate the average
- Remember your statistics and calculate the sample variance and standard
deviation.
- Look back at at your tentative conclusions. For example,
- You saw two modes in the histogram
- You saw different ranges in the box plot by reqtype
- A tentative conclusion was response time differs according to
request type.
- Separate the data set by request type
- Analyse the files separately
- You see differences.
- Are they real?
- Analysis of variance is the tool for answering this question (Next
lecture)
When your data has passed the tests
Measurement is finished. How do you use the results?
Example. You saw a difference of 20 msec between response time for
reqtype1 (browsing) and reqtype2 (searching).
- The difference is real (= statistically significant).
- You canculated mean(browsing) and mean(searching)
- You calculated the std dev. of the difference, roughly the square
root of ( (variance(browsing) + variance(searching)) / N )
- The difference between the means was enough bigger than the std
dev. of the difference
- Does anybody care about the difference? Probably not.
Example. You saw that each increment of 1 in the load makes an increment
of 500 msec in the response time for browsing.
- The difference is real (= statistically significant).
- Does anybody care about the difference? Maybe. You need more
information.
- Currently average load is 3.
- The load is expected to double over the next year.
- Yes! Somebody should upgrade the server during the next six months.
Analysis of Variance (ANOVA) aka Linear Models
Zero Factor Analysis of Variance
New Concepts
- Thinking about variance reduction
- Degrees of freedom
- Null hypothesis
General idea
- Here is a list of ten genuinely random 2-digit numbers
53, 81, 98, 12, 59, 40, 40, 39, 43, 69
-
Summary Statistics of the Random Numbers
Count |
10 |
10 |
Mean |
53.4 |
49.5 |
Difference |
3.8 |
0.0 |
Variance |
541.4 |
833.3 |
Standard deviation |
24.5 |
28.9 |
Standard deviation of mean |
7.8 |
9.6 |
- Variance assuming that mean is 49.5 is 556.7
- Here is a list of ten genuinely random 2-digit numbers with 15
subtracted from each
- 84, 60, -12, 28, 68, 66, -1, 11, 9, 5
-
Summary Statistics
Count |
10 |
10 |
Mean |
31.8 |
49.5 |
Difference |
-17.7 |
0.0 |
Variance |
1068.0 |
833.3 |
Standard Deviation |
34.4 |
28.9 |
Standard Deviation of mean |
10.9 |
9.6 |
- Variance assuming that mean is 49.5 is 1381.3.
- In each case we ask whether the reduction in the variance (15.3 in the
first case, 313.3 in the second) is statistically significant.
- To work it out from first principles
- Assume underlying distribution from which data is drawn has finite
variance.
- N data points
- Mean is normally distributed
- Variance is distributed by a chi-square distribution with
N degrees of freedom
- The treatment variance, the difference between two variances, is
distributed by a chi-square distribution with N1=1 degrees
of freedom.
- Ratio between the treatment variance and the remaining error
variance is distributed by an F distribution with
N-1 and N1 degrees of freedom
- Check the ratio (15.3 * (N-N1)/ 541.4 *
N1 in the first case, 1381.3 * (N1-1) / 1068.0 *
N1 in the second) against the percentage points of the
F distribution
- Of course, for a test as simple as this there are other, and better,
ways of doing the test.
- In fact we would be likely just to write
- Difference = 3.8 plus-or-minus 7.8 or
- Difference = -17.7 plus-or-minus 10.9
- Note. Some authors use twice the standard
deviation rather than the standard deviation.
One Factor Analysis of Variance
New Concepts
- A linear model that isn't linear in the sense of regression
- Degrees of freedom
The Linear Model
- There sre N data points.
- Assume that the factor has M levels i =
1..M
- Assume the data is well described by the model: overall_mean +
a_i + error
- There is one degree of freedom for the overall mean
- There are M-1 degrees of freedom for the levels of the
factor: sum_i a_i = 0
- There are N-M degrees of freedom for the
error
- Assume that the error is normally distributed and uncorrelated.
- The null hypothesis is that all the a_i are zero.
The Calculation
- Remove the overall mean from the data, calculate the total variance
- Separate the data into cells one for each level of the factor
- Find the a_i that best fit the data, which are just the means
of the corresponding cells.
- Calculate the remaining variance, which is the error variance.
- The difference between the total variance and the error variance is the
treatment variance.
- Form the ratio of the treatment variance and the error variance with
degrees of freedom taken into account.
- Check against the percentage points of the F distribution.
- If the result is significant then at least one of the coefficients
a_i of the model is different fram zero.
- We say that there is a significant effect of a.
- Auxiliary tests are needed to determine which of the a_i
are non-zero,
- and which of the a_i are different from one another.
Two Factor Analysis of Variance
New Concepts
- Main effects versus interactions
The Linear Model
- Same assumptions as above,
- Except, assume that the data is well described by the model:
overall_mean + a_i + b_j + ab_ij + error
- There is one degree of freedom for the overall mean
- There are M_a - 1 degrees of freedom for the first
factor
- There are M_b - 1 degrees of freedom for the second
factor
- There are M_a * M_b - M_a - M_b
+ 1 degrees of freedom for the interaction
- There are N - M_a * M_b degrees of
freedom for the error
- Three ways to consider the model
- Main effects only, no interactions
- Main effects plus interactions
- Interactions only, no main effects
- Null hypothesis: All of the terms a_i, b_j,
ab_ij are zero.
The Calculation
.
Return to: