CS457 - System Performance Evaluation - Winter 2010
Public Service Announcements
- Mid-term
Lecture 11 - Data Analysis I
Exploratory Data Analysis (EDA)
Rule 1
Use a subset of the data if there will not be more available.
- Computers produce enough log data to overwhelm most EDA programs.
Rule 2
Use your eyes
- Your visual system is a better pattern finder than any computer
Rule 3
Split the data
- You see an indistinct pattern.
- Select the data in which the pattern exists: the pattern is now
distinct
- Ask how the selected data differs from the unselected data
Potentially Useful Displays of Data
- Bar charts & box plots
- discrete independent variables
- Box plots are usually more informative
- Histograms & scatter plots
- Histograms better for very dense data
- Scatter plots better with highly selected data
- Three-dimensional plots
- Summaries of partitioned data
- Many statistics to select:
- mean, median, percentile points,
- measures of variability, such as
- standard deviation, range
Important: Look for data that doesn't belong, the look
back at it in the log, paying attention to its environment.
Motivation of Analysis of Variance (pdf)
Example - Reponse Time
What I am doing here is typical and schematic. When you are processing a
log you will execute different commands than the exact ones I give.
Preparing the data
Do not use the comands below on any log file. Every log file has a
different format. You can never avoid step 1 below.
- Study the log to find out how the information is formatted.
- Put a sequence number on the data in case you want to get the order
back
- Get rid of all but arrivals and departures
grep "arrival|departure"
- Now we have only arrivals and departusres
- Sort the data on reqid.
sort -d: -n +3-4
- Now all requests with the same id are together and arrival precedes
departure
- Could add on the system load at this point
- Now join the lines
sed '/arrival/np'
- Now we have arrival and departure on one line.
- You probably want to clean up duplicate and redundant fields
- Now calculate the response time
Look at the Data
For example
- Histogram of response times
- Bar chart of data separated by
reqtype
- Box plot of response time separated by
reqtype
- Scatterplot of response time by arrival time
- 3D plot of response time by arrival time and load
- Lots of things to do when you start to look at arguments
Test the Tentative Conclusions
For example
- Histogram of response times made the response time seem to be about 12
msec.
- Calculate the average
Suppose you get 11.7853609 msec. What do you think?
- It agrees with the histogram
- Times are provided to 1 msec.
- That is when the data says 9 msec it could be as low as 8 and as
high as 10.
- Calculate the variance (1/3 = 0.3) and standard deviation (1/1.7 =
0.6)
- But we averaged over, say N=10000, observations. Therefore, s.d. of
the estimate of the mean is 0.6/100 = 0.006.
- We can, at best, say 11.785.
- Remember your statistics and calculate the sample variance and standard
deviation.
- Essentially E[(RT)^2] - (E[RT])^2
- This is a typical poorly conditioned expression!
- You need a two pass calculation: E[(RT)^2 - (E[RT])^2]
- Suppose you calculate 4 msec for the standard deviation
- First, 4/10000 = 0.04 >> 0.006, so there is too much
variance to be explained by measurement error.
- Where does the extra variance come from?
- Look back at at your tentative conclusions. For example,
- You saw two modes in the histogram
- You saw different ranges in the box plot by reqtype
- A tentative conclusion was response time differs according to
request type.
- Separate the data set by request type
- Analyse the files separately
- You see differences.
- Are they real?
- Analysis of variance is the tool for answering this question (Next
lecture)
When your data has passed the tests
Measurement is finished. How do you use the results?
Example. You saw a difference of 20 msec between response time for
reqtype1 (browsing) and reqtype2 (searching).
- The difference is real (= statistically significant).
- Does anybody care about the difference? Probably not.
Example. You saw that each increment of 1 in the load makes an increment
of 500 msec in the response time for browsing.
- The difference is real (= statistically significant).
- Does anybody care about the difference? Maybe. You need more
information.
- Currently average load is 3.
- The load is expected to double over the next year.
- Yes! Somebody should upgrade the server during the next six months.
.
Return to: