CS457  System Performance Evaluation  Winter 2010
Public Service Announcements
 Midterm
Lecture 12  Data Analysis II
Motivation of Analysis of Variance (pdf)
Example  Reponse Time
Preparing the data
Looking at the Data
Testing the Tentative Conclusions
For example
 Histogram of response times made the response time seem to be about 12
msec.
 Calculate the average
 Remember your statistics and calculate the sample variance and standard
deviation.
 Look back at at your tentative conclusions. For example,
 You saw two modes in the histogram
 You saw different ranges in the box plot by reqtype
 A tentative conclusion was response time differs according to
request type.
 Separate the data set by request type
 Analyse the files separately
 You see differences.
 Are they real?
 Analysis of variance is the tool for answering this question (Next
lecture)
When your data has passed the tests
Measurement is finished. How do you use the results?
Example. You saw a difference of 20 msec between response time for
reqtype1 (browsing) and reqtype2 (searching).
 The difference is real (= statistically significant).
 You canculated mean(browsing) and mean(searching)
 You calculated the std dev. of the difference, roughly the square
root of ( (variance(browsing) + variance(searching)) / N )
 The difference between the means was enough bigger than the std
dev. of the difference
 Does anybody care about the difference? Probably not.
Example. You saw that each increment of 1 in the load makes an increment
of 500 msec in the response time for browsing.
 The difference is real (= statistically significant).
 Does anybody care about the difference? Maybe. You need more
information.
 Currently average load is 3.
 The load is expected to double over the next year.
 Yes! Somebody should upgrade the server during the next six months.
Analysis of Variance (ANOVA) aka Linear Models
Zero Factor Analysis of Variance
New Concepts
 Thinking about variance reduction
 Degrees of freedom
 Null hypothesis
General idea
 Here is a list of ten genuinely random 2digit numbers
53, 81, 98, 12, 59, 40, 40, 39, 43, 69

Summary Statistics of the Random Numbers
Count 
10 
10 
Mean 
53.4 
49.5 
Difference 
3.8 
0.0 
Variance 
541.4 
833.3 
Standard deviation 
24.5 
28.9 
Standard deviation of mean 
7.8 
9.6 
 Variance assuming that mean is 49.5 is 556.7
 Here is a list of ten genuinely random 2digit numbers with 15
subtracted from each
 84, 60, 12, 28, 68, 66, 1, 11, 9, 5

Summary Statistics
Count 
10 
10 
Mean 
31.8 
49.5 
Difference 
17.7 
0.0 
Variance 
1068.0 
833.3 
Standard Deviation 
34.4 
28.9 
Standard Deviation of mean 
10.9 
9.6 
 Variance assuming that mean is 49.5 is 1381.3.
 In each case we ask whether the reduction in the variance (15.3 in the
first case, 313.3 in the second) is statistically significant.
 To work it out from first principles
 Assume underlying distribution from which data is drawn has finite
variance.
 N data points
 Mean is normally distributed
 Variance is distributed by a chisquare distribution with
N degrees of freedom
 The treatment variance, the difference between two variances, is
distributed by a chisquare distribution with N1=1 degrees
of freedom.
 Ratio between the treatment variance and the remaining error
variance is distributed by an F distribution with
N1 and N1 degrees of freedom
 Check the ratio (15.3 * (NN1)/ 541.4 *
N1 in the first case, 1381.3 * (N11) / 1068.0 *
N1 in the second) against the percentage points of the
F distribution
 Of course, for a test as simple as this there are other, and better,
ways of doing the test.
 In fact we would be likely just to write
 Difference = 3.8 plusorminus 7.8 or
 Difference = 17.7 plusorminus 10.9
 Note. Some authors use twice the standard
deviation rather than the standard deviation.
One Factor Analysis of Variance
New Concepts
 A linear model that isn't linear in the sense of regression
 Degrees of freedom
The Linear Model
 There sre N data points.
 Assume that the factor has M levels i =
1..M
 Assume the data is well described by the model: overall_mean +
a_i + error
 There is one degree of freedom for the overall mean
 There are M1 degrees of freedom for the levels of the
factor: sum_i a_i = 0
 There are NM degrees of freedom for the
error
 Assume that the error is normally distributed and uncorrelated.
 The null hypothesis is that all the a_i are zero.
The Calculation
 Remove the overall mean from the data, calculate the total variance
 Separate the data into cells one for each level of the factor
 Find the a_i that best fit the data, which are just the means
of the corresponding cells.
 Calculate the remaining variance, which is the error variance.
 The difference between the total variance and the error variance is the
treatment variance.
 Form the ratio of the treatment variance and the error variance with
degrees of freedom taken into account.
 Check against the percentage points of the F distribution.
 If the result is significant then at least one of the coefficients
a_i of the model is different fram zero.
 We say that there is a significant effect of a.
 Auxiliary tests are needed to determine which of the a_i
are nonzero,
 and which of the a_i are different from one another.
Two Factor Analysis of Variance
New Concepts
 Main effects versus interactions
The Linear Model
 Same assumptions as above,
 Except, assume that the data is well described by the model:
overall_mean + a_i + b_j + ab_ij + error
 There is one degree of freedom for the overall mean
 There are M_a  1 degrees of freedom for the first
factor
 There are M_b  1 degrees of freedom for the second
factor
 There are M_a * M_b  M_a  M_b
+ 1 degrees of freedom for the interaction
 There are N  M_a * M_b degrees of
freedom for the error
 Three ways to consider the model
 Main effects only, no interactions
 Main effects plus interactions
 Interactions only, no main effects
 Null hypothesis: All of the terms a_i, b_j,
ab_ij are zero.
The Calculation
.
Return to: