CS251 - Computer Organization and Design - Spring 2008
Lecture 32 - Cache Misses
Practical Details
- Assignment 7
- Finished pipelined execution
Cache Concepts
How the big picture is implemented
Break the address into pieces
- High order bits determine which block
- low order bits determine where in the block
- Where the break is done determines the number of blocks and the size of
the blocks
- Here the block is the size of a cache line
Effect is that the low order bits apply to many locations in memory.
The cache normally contains an integral number of lines, usually a power
of 2.
- Break the block number into two pieces
- low order bits adequate to address each line in the cache
- high order bits indicate which block in memory is in that line.
- Store the high order bits with the line
When a memory reference occurs
- Break the address into three pieces
- address in the line
- address of the line
- high order bits of the block number
- Using the line address retrieve the high order bits of the block
number.
- Compare to the address
- If they match
- read from or write to the address
- on write hardware usually also writes the main memory
asynchronously using a write buffer. This is called
write-through.
- It is also possible to re-write memory only when the cache line is
replaced. This is called write-back.
otherwise
- Stall the processor
- Use the block number to get the relevant block from memory
- When it is installed, rerun the instruction
This way of doing things is called direct mapping.
Examples of Cache
1. One word per cache line, 16 Cache lines
Cache
| Line number |
Valid |
Tag |
Data |
| 0000 |
1 |
26 bits |
1 word |
| 0001 |
0 |
|
|
| 0010 |
1 |
|
|
| 0011 |
1 |
|
|
| 0100 |
1 |
|
|
| 0101 |
1 |
|
|
| 0110 |
1 |
|
|
| 0111 |
1 |
|
|
| 1000 |
0 |
|
|
| 1001 |
1 |
|
|
| 1010 |
1 |
|
|
| 1011 |
0 |
|
|
| 1100 |
0 |
|
|
| 1101 |
1 |
|
|
| 1110 |
1 |
|
|
| 1111 |
1 |
|
|
Address
| Line number
in memory
|
Line number
in cache
|
Access Size |
| 31-6 |
5-2 |
1-0 |
Circuit (p. 476)
- Ignore access size bits
- Use 5-2 to choose the line in the cache
- Test 31-6 against the tag in the line
- AND with valid to get HIT
- Return data in line plus HIT
Comments
A very basic cache.
Easy to see how to extend to a bigger cache.
2. Longer cache lines: 16 word line, 256 lines
Cache
| Line number |
Valid |
Tag |
Word 0 |
Word 1 |
... |
Word 15 |
| 00000000 |
1 |
18 bits |
32 bits |
32 bits |
|
32 bits |
| 00000001 |
1 |
|
|
|
|
|
| 00000010 |
0 |
|
|
|
|
|
| ... |
|
|
|
|
|
|
| 11111101 |
0 |
|
|
|
|
|
| 11111110 |
1 |
|
|
|
|
|
| 11111111 |
0 |
|
|
|
|
|
Address
| Line number
in memory
|
Line number
in cache
|
Word number
in line
|
Access
size
|
| 31-14 |
13-6 |
5-2 |
1-0 |
Circuit (p. 486)
- Ignore bits 0-1
- Use bits 13-6 to choose the line in the cache
- Compare bits 31-14 to the tag
- AND the result with Valid to get HIT
- Multiplexer selects word to return based on bits 5-2
Comments
Typical early cache technology
Cache Misses
On Read
Send memory line number to main memory (SDRAM).
SDRAM returns whole line in one transaction
- Address to DRAM, Data 0 to Cache, Data 1 to cache, etc.
- Wait until word 0 typically 40 nsec
- Then time per word much faster, 2-4 nsec.
- Reason is slow row addressing, fast column addressing
Load instruction, or instruction fetch is rerun.
- Pipeline is frozen while this takes place
On Write
Send memory line number to main memory (SDRAM).
SDRAM returns whole line in one transaction
- Address to DRAM, Data 0 to Cache, Data 1 to cache, etc.
- Wait until word 0 typically 40 nsec
- Then time per word much faster, 2-4 nsec.
Store instruction is rerun.
- Pipeline is frozen while this takes place.
Main memory is synchronized
- Write-through: two writes happen at once
- but what if another write occurs immediately after?
- possibly stall
- assisted by a hardware write queue
- Write-back: don't synchronize immediately, but when the cache line
leaves the cache
- some misses are even more expensive
- multi-processing is difficult
Possible Improvements
Obvious trade-off: the longer the lines
The best performance depends on locality in the code.
Typically, cache misses slow the processor by a factor of 2.
Widen and speed up bus between main memory and cache.
Send the desired word first, then fill the rest of the line.
Return to: