1. Mid-term

# Lecture 11 - Data Analysis I

## Exploratory Data Analysis (EDA)

#### Rule 1

Use a subset of the data if there will not be more available.

• Computers produce enough log data to overwhelm most EDA programs.

#### Rule 2

• Your visual system is a better pattern finder than any computer

#### Rule 3

Split the data

1. You see an indistinct pattern.
2. Select the data in which the pattern exists: the pattern is now distinct
3. Ask how the selected data differs from the unselected data

### Potentially Useful Displays of Data

1. Bar charts & box plots
• discrete independent variables
2. Histograms & scatter plots
• Histograms better for very dense data
• Scatter plots better with highly selected data
3. Three-dimensional plots
• Try rotating them.
4. Summaries of partitioned data
• Many statistics to select:
• mean, median, percentile points,
• measures of variability, such as
• standard deviation, range

Important: Look for data that doesn't belong, the look back at it in the log, paying attention to its environment.

# Motivation of Analysis of Variance (pdf)

## Example - Reponse Time

What I am doing here is typical and schematic. When you are processing a log you will execute different commands than the exact ones I give.

### Preparing the data

Do not use the comands below on any log file. Every log file has a different format. You can never avoid step 1 below.

1. Study the log to find out how the information is formatted.
• We assume that it is formatted as follows
• `time:reqtype:reqid:responsetype:arguments`
2. Put a sequence number on the data in case you want to get the order back
• `awk '{BEGIN n=0} {print n ":" \$0; n+=1}'`
• Now the data has sequence numbers
• This allows you to look back in the log using `head` and `tail`.
3. Get rid of all but arrivals and departures
• `grep "arrival|departure"`
• Now we have only arrivals and departusres
4. Sort the data on reqid.
• `sort -d: -n +3-4 `
• Now all requests with the same id are together and arrival precedes departure
• ```awk '\
{BEGIN n=0} \
{/arrival/n+=1; print \$0 ":" n} \
{/departure/n-=1; print \$0 ":" n}'```
• Now you have added the load to the end of each line.
6. Now join the lines
• `sed '/arrival/np'`
• Now we have arrival and departure on one line.
• You probably want to clean up duplicate and redundant fields
7. Now calculate the response time
• `awk '{print \$0 ":" \$8-\$2}'`
• Now the response time is written at the end of each record

### Look at the Data

For example

• Histogram of response times
• Bar chart of data separated by `reqtype`
• Box plot of response time separated by `reqtype`
• Scatterplot of response time by arrival time
• 3D plot of response time by arrival time and load
• Lots of things to do when you start to look at arguments

### Test the Tentative Conclusions

For example

1. Histogram of response times made the response time seem to be about 12 msec.
• What is it, really?
2. Calculate the average
Suppose you get 11.7853609 msec. What do you think?
• It agrees with the histogram
• Times are provided to 1 msec.
• That is when the data says 9 msec it could be as low as 8 and as high as 10.
• Calculate the variance (1/3 = 0.3) and standard deviation (1/1.7 = 0.6)
• But we averaged over, say N=10000, observations. Therefore, s.d. of the estimate of the mean is 0.6/100 = 0.006.
• We can, at best, say 11.785.
3. Remember your statistics and calculate the sample variance and standard deviation.
• Essentially E[(RT)^2] - (E[RT])^2
• This is a typical poorly conditioned expression!
• You need a two pass calculation: E[(RT)^2 - (E[RT])^2]
• Suppose you calculate 4 msec for the standard deviation
• First, 4/10000 = 0.04 >> 0.006, so there is too much variance to be explained by measurement error.
• Where does the extra variance come from?
4. Look back at at your tentative conclusions. For example,
• You saw two modes in the histogram
• You saw different ranges in the box plot by reqtype
• A tentative conclusion was response time differs according to request type.
• Separate the data set by request type
• ```grep reqtype1 >data1
grep reqtype2 >data2
...```
• Analyse the files separately
• You see differences.
• Are they real?
• Analysis of variance is the tool for answering this question (Next lecture)

### When your data has passed the tests

Measurement is finished. How do you use the results?

Example. You saw a difference of 20 msec between response time for reqtype1 (browsing) and reqtype2 (searching).

• The difference is real (= statistically significant).
• Does anybody care about the difference? Probably not.

Example. You saw that each increment of 1 in the load makes an increment of 500 msec in the response time for browsing.

• The difference is real (= statistically significant).