## Problem 1. Integrating a processor with a data cache

In this problem, we will consider a program executing on a processor connected to a data cache. The processor is a 5-stage pipelined TinyRV1 processor with **full bypassing** to resolve data hazards. The data cache is a **two-line direct-mapped** cache with 8B cache lines.

The program is calculating accumulated sum of an array.

```
int accu_sum( int src[], int size ) {
  int sum = 0;
  for ( int i = 0; i < size; i++ ) {
    sum += src[i]
    src[i] = sum;
}
return sum;
}</pre>
```

The simplified assembly instructions which correspond to the loop above are as shown below. Study the assembly instructions and make sure that you understand how it corresponds to the C++ program above. For simplicity, assume the following:

- x1: holds the accumulated sum of the src[] array
- x2: holds the base address of the src[] array
- x3: holds the size or loop count

```
1 loop:
2 lw x4, 0(x2)
3 add x1, x1, x4
4 sw x1, 0(x2)
5 addi x2, x2, 4
6 addi x3, x3, -1
7 bne x3, x0, loop
8 opA
9 opB
10 ...
```

## Part 1.A Cache Behavior

Assume the data cache is empty initially. We also assume the base address of the src array stored in x2 is 0x2000. The table below shows the first few dynamic memory transations. Fill in the table. What's the cache miss rate of this program? Can you think of any way to reduce cache miss rate? Will larger cache help?



|           |        |     |          | Set    |                |
|-----------|--------|-----|----------|--------|----------------|
|           | tag    | idx | hit/miss | 0      | 1              |
| rd 0x2000 | 0× 200 | 0   | M        |        |                |
| wr 0x2000 | 0×200  | 0   | h        | OxSno  |                |
| rd 0x2004 | 0×200  | 0   | h        | Orso   |                |
| wr 0x2004 | 0×500  | 0   | h        | 02500  | - 1 5 KM 85 KM |
| rd 0x2008 | 0×500  | . 1 | M        | Oxxio  |                |
| wr 0x2008 | 0x20   | 1   | h        | DXZNO  | OKYNO          |
| rd 0x200c | Drsno  | 1   | h        | Oxzu   | ()×ZvO         |
| wr 0x200c | Ox 200 | 1   | h        | (J×Jus | Oxen           |
| rd 0x2010 | Ox 25  | 0   | W        | DXSVD  | OXXD           |
| wr 0x2010 | 0×20   | O   | h        | 105201 | Oxsvo          |
| rd 0x2014 | 0×20   | 0   | h        | احتمل  | OKXO           |
| wr 0x2014 | 0×2×0  | 0   | h        | 105×0  | 0×200          |
| rd 0x2018 | 0×201  |     | M        | lac×0  | (الايمال       |

miss rate is 25%.

larger cache cannot help reduce cache miss rate of this program.

Solutions: (1) large cache lines.

(2) software/hardware prefetching.

Part 1.B Pipeline Diagram Illustrating Cache Behavior

Now let's try to draw the pipeline diagram of the first two iterations of the loop. We also include the first instruction of the third iteration as shown in the instruction sequence.

Assume the miss penalty is two cycles. So on a cache miss, the processor stalls for two extra cycles in the M stage. Note that the cache is not pipelined so please be aware of potential structual hazards in the M stage.

Draw arrows in your pipeline diagram to indicate any data or control hazards. Ignoring the startup overhead, what's the CPI of the first iteration? What's the CPI of the second iteration? What's the overall CPI if loop size is 64? If cache lines are 16B instead of 8B, what's the overall CPI?

