- Mid-term

Use a subset of the data if there will not be more available.

- Computers produce enough log data to overwhelm most EDA programs.

Use your eyes

- Your visual system is a better pattern finder than any computer

Split the data

- You see an indistinct pattern.
- Select the data in which the pattern exists: the pattern is now distinct
- Ask how the selected data differs from the unselected data

- Bar charts & box plots
- discrete independent variables
- Box plots are usually more informative

- Histograms & scatter plots
- Histograms better for very dense data
- Scatter plots better with highly selected data

- Three-dimensional plots
- Try rotating them.

- Summaries of partitioned data
- Many statistics to select:
- mean, median, percentile points,
- measures of variability, such as
- standard deviation, range

- Many statistics to select:

**Important**: Look for data that doesn't belong, the look
back at it in the log, paying attention to its environment.

What I am doing here is typical and schematic. When you are processing a log you will execute different commands than the exact ones I give.

Do not use the comands below on any log file. Every log file has a different format. You can never avoid step 1 below.

- Study the log to find out how the information is formatted.
- We assume that it is formatted as follows
time:reqtype:reqid:responsetype:arguments

- Put a sequence number on the data in case you want to get the order
back
awk '{BEGIN n=0} {print n ":" $0; n+=1}'

- Now the data has sequence numbers
- This allows you to look back in the log using
`head`

and`tail`

.

- Get rid of all but arrivals and departures
grep "arrival|departure"

- Now we have only arrivals and departusres

- Sort the data on reqid.
sort -d: -n +3-4

- Now all requests with the same id are together and arrival precedes departure

- Could add on the system load at this point
awk '\ {BEGIN n=0} \ {/arrival/n+=1; print $0 ":" n} \ {/departure/n-=1; print $0 ":" n}'

- Now you have added the load to the end of each line.

- Now join the lines
sed '/arrival/np'

- Now we have arrival and departure on one line.
- You probably want to clean up duplicate and redundant fields

- Now calculate the response time
awk '{print $0 ":" $8-$2}'

- Now the response time is written at the end of each record

For example

- Histogram of response times
- Bar chart of data separated by
`reqtype`

- Box plot of response time separated by
`reqtype`

- Scatterplot of response time by arrival time
- 3D plot of response time by arrival time and load
- Lots of things to do when you start to look at arguments

For example

- Histogram of response times made the response time seem to be about 12
msec.
- What is it, really?

- Calculate the average

Suppose you get 11.7853609 msec. What do you think?- It agrees with the histogram
- Times are provided to 1 msec.
- That is when the data says 9 msec it could be as low as 8 and as high as 10.
- Calculate the variance (1/3 = 0.3) and standard deviation (1/1.7 = 0.6)
- But we averaged over, say N=10000, observations. Therefore, s.d. of the estimate of the mean is 0.6/100 = 0.006.
- We can, at best, say 11.785.

- Remember your statistics and calculate the sample variance and standard
deviation.
- Essentially E[(RT)^2] - (E[RT])^2
- This is a typical poorly conditioned expression!
- You need a two pass calculation: E[(RT)^2 - (E[RT])^2]

- Suppose you calculate 4 msec for the standard deviation
- First, 4/10000 = 0.04 >> 0.006, so there is too much variance to be explained by measurement error.
- Where does the extra variance come from?

- Essentially E[(RT)^2] - (E[RT])^2
- Look back at at your tentative conclusions. For example,
- You saw two modes in the histogram
- You saw different ranges in the box plot by reqtype
- A tentative conclusion was response time differs according to request type.
- Separate the data set by request type
grep reqtype1 >data1 grep reqtype2 >data2 ...

- Analyse the files separately
- You see differences.
- Are they real?

- Analysis of variance is the tool for answering this question (Next lecture)

Measurement is finished. How do you use the results?

Example. You saw a difference of 20 msec between response time for reqtype1 (browsing) and reqtype2 (searching).

- The difference is real (= statistically significant).
- Does anybody care about the difference? Probably not.

Example. You saw that each increment of 1 in the load makes an increment of 500 msec in the response time for browsing.

- The difference is real (= statistically significant).
- Does anybody care about the difference? Maybe. You need more
information.
- Currently average load is 3.
- The load is expected to double over the next year.

- Yes! Somebody should upgrade the server during the next six months.

.

Return to: