1 Processor Microarchitectural Design Patterns 3
  1.1. Transactions and Steps .......................................................... 3
  1.2. Microarchitecture: Control/Datapath Split .......................... 4

2 TinyRV1 Single-Cycle Processor 5
  2.2. Single-Cycle Processor Datapath ........................................... 7
  2.3. Single-Cycle Processor Control Unit ..................................... 13
  2.4. Analyzing Performance ....................................................... 13

3 TinyRV1 FSM Processor 16
  3.1. High-Level Idea for FSM Processors .................................... 17
  3.2. FSM Processor Datapath ..................................................... 17
  3.3. FSM Processor Control Unit ................................................. 24
  3.4. Analyzing Performance ........................................................ 28

4 TinyRV1 Pipelined Processor 30
  4.1. High-Level Idea for Pipelined Processors ............................ 31
4.2. Pipelined Processor Datapath and Control Unit .................................. 33

5 Pipeline Hazards: RAW Data Hazards .................................................... 38
  5.1. Expose in Instruction Set Architecture ............................................... 40
  5.2. Hardware Stalling ............................................................................. 41
  5.3. Hardware Bypassing/Forwarding ....................................................... 42
  5.4. RAW Data Hazards Through Memory ............................................... 46

6 Pipeline Hazards: Control Hazards ......................................................... 47
  6.1. Expose in Instruction Set Architecture ............................................... 49
  6.2. Hardware Speculation ..................................................................... 50
  6.3. Interrupts and Exceptions ................................................................. 53

7 Pipeline Hazards: Structural Hazards ...................................................... 58
  7.1. Expose in Instruction Set Architecture ............................................... 59
  7.2. Hardware Stalling ............................................................................. 60
  7.3. Hardware Duplication ...................................................................... 61

8 Pipeline Hazards: WAW and WAR Name Hazards .................................... 62
  8.1. Software Renaming ......................................................................... 63
  8.2. Hardware Stalling ............................................................................ 64

9 Summary of Processor Performance ......................................................... 64

10 Case Study: Transition from CISC to RISC ........................................... 68
  10.1. Example CISC: IBM 360/M30 ......................................................... 69
  10.2. Example RISC: MIPS R2K ............................................................. 72

Topic 2: Fundamental Processor Microarchitecture 2
1. Processor Microarchitectural Design Patterns

\[
\text{Time/Program} = \frac{\text{Instructions/Program}}{\text{Avg Cycles/Instruction}} \times \frac{\text{Avg Cycles/Instruction}}{\text{Time/Cycle}}
\]

- Instructions / program depends on source code, compiler, ISA
- Avg cycles / instruction (CPI) depends on ISA, microarchitecture
- Time / cycle depends upon microarchitecture and implementation

<table>
<thead>
<tr>
<th>Microarchitecture</th>
<th>CPI</th>
<th>Cycle Time</th>
</tr>
</thead>
<tbody>
<tr>
<td>Single-Cycle Processor</td>
<td>1</td>
<td>long</td>
</tr>
<tr>
<td>FSM Processor</td>
<td>&gt;1</td>
<td>short</td>
</tr>
<tr>
<td>Pipelined Processor</td>
<td>≈1</td>
<td>short</td>
</tr>
</tbody>
</table>

1.1. Transactions and Steps

- We can think of each instruction as a transaction
- Executing a transaction involves a sequence of steps

<table>
<thead>
<tr>
<th></th>
<th>add</th>
<th>addi</th>
<th>mul</th>
<th>lw</th>
<th>sw</th>
<th>jal</th>
<th>jr</th>
<th>bne</th>
</tr>
</thead>
<tbody>
<tr>
<td>Fetch</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
</tr>
<tr>
<td>Decode</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
</tr>
<tr>
<td>Read Registers</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
</tr>
<tr>
<td>Register Arithmetic</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
</tr>
<tr>
<td>Read Memory</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
</tr>
<tr>
<td>Write Memory</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
</tr>
<tr>
<td>Write Registers</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
</tr>
<tr>
<td>Update PC</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
</tr>
</tbody>
</table>
1.2. Microarchitecture: Control/Datapath Split

![Diagram of Control Unit, Datapath, and Memory with control signals and status signals]
2. TinyRV1 Single-Cycle Processor

\[
\frac{\text{Time}}{\text{Program}} = \frac{\text{Instructions}}{\text{Program}} \times \frac{\text{Avg Cycles}}{\text{Instruction}} \times \frac{\text{Time}}{\text{Cycle}}
\]

- Instructions / program depends on source code, compiler, ISA
- Avg cycles / instruction (CPI) depends on ISA, microarchitecture
- Time / cycle depends upon microarchitecture and implementation

<table>
<thead>
<tr>
<th>Microarchitecture</th>
<th>CPI</th>
<th>Cycle Time</th>
</tr>
</thead>
<tbody>
<tr>
<td>Single-Cycle Processor</td>
<td>1</td>
<td>long</td>
</tr>
<tr>
<td>FSM Processor</td>
<td>&gt;1</td>
<td>short</td>
</tr>
<tr>
<td>Pipelined Processor</td>
<td>≈1</td>
<td>short</td>
</tr>
</tbody>
</table>

Technology Constraints

- Assume technology where logic is not too expensive, so we do not need to overly minimize the number of registers and combinational logic
- Assume multi-ported register file with a reasonable number of ports is feasible
- Assume a dual-ported combinational memory
2.1. High-Level Idea for Single-Cycle Processors

<table>
<thead>
<tr>
<th></th>
<th>add</th>
<th>addi</th>
<th>mul</th>
<th>lw</th>
<th>sw</th>
<th>jal</th>
<th>jr</th>
<th>bne</th>
</tr>
</thead>
<tbody>
<tr>
<td>Fetch Instruction</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
</tr>
<tr>
<td>Decode Instruction</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
</tr>
<tr>
<td>Read Registers</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
</tr>
<tr>
<td>Register Arithmetic</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
</tr>
<tr>
<td>Read Memory</td>
<td></td>
<td></td>
<td></td>
<td>✓</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Write Memory</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>✓</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Write Registers</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td></td>
<td>✓</td>
</tr>
<tr>
<td>Update PC</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
</tr>
</tbody>
</table>

![Single-Cycle Diagram](image)
2.2. Single-Cycle Processor Datapath

**ADD**

\[
\text{add } rd, rs1, rs2 \\
R[rd] \leftarrow R[rs1] + R[rs2] \\
PC \leftarrow PC + 4
\]

**ADDI**

\[
\text{addi } rd, rs1, \text{imm} \\
R[rd] \leftarrow R[rs1] + \text{sext(imm)} \\
PC \leftarrow PC + 4
\]
Implementing ADD and ADDI Instructions

MUL
mul rd, rs1, rs2
R[rd] ← R[rs1] × R[rs2]
PC ← PC + 4
**LW**

\[
\text{lw } rd, \text{ imm}(rs1) \\
R[rd] \leftarrow M[R[rs1] + \text{sext}(\text{imm})] \\
\text{PC} \leftarrow \text{PC} + 4
\]

**SW**

\[
\text{sw } rs2, \text{ imm}(rs1) \\
M[R[rs1] + \text{sext}(\text{imm})] \leftarrow R[rs2] \\
\text{PC} \leftarrow \text{PC} + 4
\]
### JAL

**jal rd, imm**

\[ R[rd] \leftarrow PC + 4 \]

\[ PC \leftarrow PC + \text{sext}(\text{imm}) \]

\[ \text{imm} = \{ \text{inst}[31], \text{inst}[19:12], \text{inst}[20], \text{inst}[30:21], 0 \} \]

---

### JR

**jr rs1**

\[ PC \leftarrow R[rs1] \]

---

**Topic 2: Fundamental Processor Microarchitecture**
### BNE

\[ \text{bne rs1, rs2, imm} \]

\[
\text{if (R[rs1] \neq R[rs2]) PC ← PC + sext(imm)} \\
\text{else} \quad \text{PC ← PC + 4}
\]

\[
\begin{array}{c|c|c|c|c|c|c|c|c|c}
\text{imm} & \text{rs2} & \text{rs1} & 001 & \text{imm} & 1100011 \\
\end{array}
\]

\[
\text{imm} = \{ \text{inst}[31], \text{inst}[7], \text{inst}[30:25], \text{inst}[11:8], 0 \}
\]
Adding a New Auto-Incrementing Load Instruction

Draw on the datapath diagram what paths we need to use as well as any new paths we will need to add in order to implement the following auto-incrementing load instruction.

\[
\text{LW.AI} \\
\text{lw.ai rd, imm(rsl)} \\
R[rd] \leftarrow M[R[rsl] + \text{sext(imm)}] \\
R[rsl] \leftarrow R[rsl] + 4 \\
PC \leftarrow PC + 4
\]
### 2.3. Single-Cycle Processor Control Unit

<table>
<thead>
<tr>
<th>inst</th>
<th>pc sel</th>
<th>imm type</th>
<th>op2 sel</th>
<th>alu func</th>
<th>wb sel</th>
<th>rf wen</th>
<th>imem req val</th>
<th>dmem req val</th>
</tr>
</thead>
<tbody>
<tr>
<td>add</td>
<td>pc+4</td>
<td>-</td>
<td>rf</td>
<td>+</td>
<td>alu</td>
<td>1</td>
<td>1</td>
<td>0</td>
</tr>
<tr>
<td>addi</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>mul</td>
<td>pc+4</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>mul</td>
<td>1</td>
<td>1</td>
<td>0</td>
</tr>
<tr>
<td>lw</td>
<td>pc+4</td>
<td>i</td>
<td>imm</td>
<td>+</td>
<td>mem</td>
<td>1</td>
<td>1</td>
<td>1</td>
</tr>
<tr>
<td>sw</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>jal</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>jr</td>
<td>jr</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>0</td>
<td>1</td>
<td>0</td>
</tr>
<tr>
<td>bne</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

Need to factor eq status signal into pc_sel signal for BNE!

### 2.4. Analyzing Performance

\[
\frac{\text{Time}}{\text{Program}} = \frac{\text{Instructions}}{\text{Program}} \times \frac{\text{Cycles}}{\text{Instruction}} \times \frac{\text{Time}}{\text{Cycles}}
\]

- Instructions / program depends on source code, compiler, ISA
- Cycles / instruction (CPI) depends on ISA, microarchitecture
- Time / cycle depends upon microarchitecture and implementation
Estimating cycle time

There are many paths through the design that start at a state element and end at a state element. The “critical path” is the longest path across all of these paths. We can usually use a simple first-order static timing estimate to estimate the cycle time (i.e., the clock period and thus also the clock frequency).

- register read $= 1\tau$
- register write $= 1\tau$
- regfile read $= 10\tau$
- regfile write $= 10\tau$
- memory read $= 20\tau$
- memory write $= 20\tau$
- +4 unit $= 4\tau$
- immgen $= 2\tau$
- mux $= 3\tau$
- multiplier $= 20\tau$
- alu $= 10\tau$
- adder $= 8\tau$
Estimating execution time

Using our first-order equation for processor performance, how long in units of $\tau$ will it take to execute the vector-vector add example assuming $n$ is 64?

```
loop:
  lw  x5, 0(x13)
  lw  x6, 0(x14)
  add x7, x5, x6
  sw  x7, 0(x12)
  addi x13, x12, 4
  addi x14, x14, 4
  addi x12, x12, 4
  addi x15, x15, -1
  bne x15, x0, loop
jr  x1
```

Using our first-order equation for processor performance, how long in units of $\tau$ will it take to execute the mystery program assuming $n$ is 64 and that we find a match on the last element.

```
addi x5, x0, 0
loop:
  lw  x6, 0(x12)
  bne x6, x14, foo
  addi x10, x5, 0
  jr  x1
foo:
  addi x12, x12, 4
  addi x5, x5, 1
  bne x5, x13, loop
  addi x10, x0, -1
  jr  x1
```
3. TinyRV1 FSM Processor

\[
\frac{\text{Time}}{\text{Program}} = \frac{\text{Instructions}}{\text{Program}} \times \frac{\text{Avg Cycles}}{\text{Instruction}} \times \frac{\text{Time}}{\text{Cycle}}
\]

- Instructions / program depends on source code, compiler, ISA
- Avg cycles / instruction (CPI) depends on ISA, microarchitecture
- Time / cycle depends upon microarchitecture and implementation

<table>
<thead>
<tr>
<th>Microarchitecture</th>
<th>CPI</th>
<th>Cycle Time</th>
</tr>
</thead>
<tbody>
<tr>
<td>Single-Cycle Processor</td>
<td>1</td>
<td>long</td>
</tr>
<tr>
<td>FSM Processor</td>
<td>&gt;1</td>
<td>short</td>
</tr>
<tr>
<td>Pipelined Processor</td>
<td>≈1</td>
<td>short</td>
</tr>
</tbody>
</table>

Technology Constraints

- Assume legacy technology where logic is expensive, so we want to minimize the number of registers and combinational logic
- Assume an (unrealistic) combinational memory
- Assume multi-ported register files and memories are too expensive, these structures can only have a single read/write port
3.1. High-Level Idea for FSM Processors

<table>
<thead>
<tr>
<th></th>
<th>add</th>
<th>addi</th>
<th>mul</th>
<th>lw</th>
<th>sw</th>
<th>jal</th>
<th>jr</th>
<th>bne</th>
</tr>
</thead>
<tbody>
<tr>
<td>Fetch Instruction</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
</tr>
<tr>
<td>Decode Instruction</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
</tr>
<tr>
<td>Read Registers</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
</tr>
<tr>
<td>Register Arithmetic</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
</tr>
<tr>
<td>Read Memory</td>
<td>✓</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Write Memory</td>
<td>✓</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Write Registers</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
</tr>
<tr>
<td>Update PC</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
</tr>
</tbody>
</table>

3.2. FSM Processor Datapath

Implementing an FSM datapath requires thinking about the required FSM states, but we will defer discussion of how to implement the control logic to the next section.

Topic 2: Fundamental Processor Microarchitecture
Implementing Fetch Sequence

(pseudo-control-signal syntax)
Implementing ADD Instruction

```
(pseudo-control-signal syntax)
add rd, rs1, rs2
```
Full Datapath for TinyRV1 FSM Processor

ADDI Pseudo-Control-Signal Fragment

addi rd, rs1, imm
**MUL Instruction**

mul rd, rs1, rs2

M0: \(A \leftarrow RF[x0]\)
M1: \(B \leftarrow RF[rs1]\)
M2: \(C \leftarrow RF[rs2]\)
M3: \(A \leftarrow A +? B;\)
    \(B \leftarrow B << 1; C \leftarrow C >> 1\)
M4: \(A \leftarrow A +? B;\)
    \(B \leftarrow B << 1; C \leftarrow C >> 1\)

...  
M35: \(RF[rd] \leftarrow A +? B; \text{goto F0}\)

**LW Instruction**

lw rd, imm(rs1)

L0: \(A \leftarrow RF[rs1]\)
L1: \(B \leftarrow \text{sext(imm_i)}\)
L2: \(\text{memreq.addr} \leftarrow A + B\)
L3: \(RF[rd] \leftarrow \text{RD}; \text{goto F0}\)

**SW Instruction**

sw rs2, imm(rs1)

S0: \(WD \leftarrow RF[rs2]\)
S1: \(A \leftarrow RF[rs1]\)
S2: \(B \leftarrow \text{sext(imm_s)}\)
S3: \(\text{memreq.addr} \leftarrow A + B; \text{goto F0}\)

**JAL Instruction**

jal rd, imm

JA0: \(RF[rd] \leftarrow \text{PC}\)
JA1: \(B \leftarrow \text{sext(imm_j)}\)
JA2: \(\text{PC} \leftarrow A + B; \text{goto F0}\)

**JR Instruction**

jr rs1

JR0: \(\text{PC} \leftarrow RF[rs1]; \text{goto F0}\)

**BNE Instruction**

bne rs1, rs2, imm

B0: \(A \leftarrow RF[rs1]\)
B1: \(B \leftarrow RF[rs2]\)
B2: \(B \leftarrow \text{sext(imm_b)};\)
    if \(A == B\) goto F0
B3: \(A \leftarrow \text{PC}\)
B4: \(A \leftarrow A - 4\)
B5: \(\text{PC} \leftarrow A + B; \text{goto F0}\)
Adding a Complex Instruction

FSM processors simplify adding complex instructions. New instructions usually do not require datapath modifications, only additional states.

\[
\text{add.mm } rd, \text{ rs1, rs2}
\]

\[
M[R[rd]] \leftarrow M[R[rs1]] + M[R[rs2]]
\]

---

**Table of ALU Functions and Imm Gen**

<table>
<thead>
<tr>
<th>ALU Func</th>
<th>IMM Gen</th>
</tr>
</thead>
<tbody>
<tr>
<td>+4: A + 4</td>
<td>i: sext(\text{IR}[31:20])</td>
</tr>
<tr>
<td>+: A + B</td>
<td>s: sext(\text{IR}[31:25],\text{IR}[11:7])</td>
</tr>
<tr>
<td>+?: A +? B</td>
<td>b: sext(\text{IR}[31],\text{IR}[7],\text{IR}[30:25],\text{IR}[11:8],0)</td>
</tr>
<tr>
<td>cmp: A == B</td>
<td>j: sext(\text{IR}[31],\text{IR}[19:12],\text{IR}[20],\text{IR}[30:21],0)</td>
</tr>
<tr>
<td>−4: A − 4</td>
<td></td>
</tr>
</tbody>
</table>

---

**Diagram**

The diagram shows the datapath and control unit interactions, including registers, ALU, and memory access for the add.mm instruction.
Adding a New Auto-Incrementing Load Instruction

Implement the following auto-incrementing load instruction using pseudo-control-signal syntax. Modify the datapath if necessary.

\[
\text{lw.ai } rd, \text{ imm}(rs1)
\]

\[
R[rd] \leftarrow M[R[rs1] + \text{sext(imm_i)}]; R[rs1] \leftarrow R[rs1] + 4
\]
We will study three techniques for implementing FSM control units:

- **Hardwired control units** are high-performance, but inflexible

- **Horizontal μcoding** increases flexibility, requires large control store

- **Vertical μcoding** is an intermediate design point

### Hardwired FSM

![Diagram of Hardwired FSM Processor Control Unit]

- **State**
  - **Control Signal Logic**
    - Control Signals (22)
  - **State Transition Logic**
    - Status Signals (1)
Control signal output table for hardwired control unit

F0: memreq.addr ← PC; A ← PC
F1: IR ← RD
F2: PC ← A + 4; goto inst

A0: A ← RF[rs1]
A1: B ← RF[rs2]
A2: RF[rd] ← A + B; goto F0

---

<table>
<thead>
<tr>
<th>state</th>
<th>Bus Enables</th>
<th>Register Enables</th>
<th>Mux</th>
<th>Func</th>
<th>RF</th>
<th>MReq</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>pc ig alu rf rd</td>
<td>pc ir a b c wd b c ig alu sel wen val op</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>F0</td>
<td>1 0 0 0 0 0 0 0 1 0 0 0</td>
<td>- - - - - - 0 1 r</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>F1</td>
<td>0 0 0 0 0 1 0 1 0 0 0 0</td>
<td>- - - - - - 0 0 0</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>F2</td>
<td>0 0 1 0 0 1 0 0 0 0 0 0</td>
<td>- - - +4 - 0 0 0</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>A0</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>A1</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>A2</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>
Vertically Microcoded FSM

- Use memory array (called the control store) instead of random logic to encode both the control signal logic and the state transition logic.
- Enables a more systematic approach to implementing complex multi-cycle instructions.
- Microcoding can produce good performance if accessing the control store is much faster than accessing main memory.
- Read-only control stores might be replaceable enabling in-field updates, while read-write control stores can simplify diagnostics and microcode patches.
Control signal store for microcoded control unit

### ALU Function

- **+4**: `A + 4`
- **+**: `A + B`
- **+?**: `A + B`
- **cmp**: `A == B`
- **−4**: `A − 4`

### Immediate Generation

- **i**: \( \text{sext}(\text{IR}[31:20]) \)
- **s**: \( \text{sext}((\text{IR}[31:25],\text{IR}[11:7])) \)
- **b**: \( \text{sext}((\text{IR}[31],\text{IR}[7:0],\text{IR}[30:25],\text{IR}[11:8],0)) \)
- **j**: \( \text{sext}((\text{IR}[31],\text{IR}[19:12],\text{IR}[20],\text{IR}[30:21],0)) \)

---

**Example Codes**

- **B0**: `A ← RF[rs1]`
- **B1**: `B ← RF[rs2]`
- **B2**: `B ← \text{sext}(\text{imm}_b); \text{if } A == B \text{ goto F0}`
- **B3**: `A ← PC`
- **B4**: `A ← A − 4`
- **B5**: `PC ← A + B; \text{goto F0}`

---

### Table of Bus Enables, Register Enables, Mux, Func, RF, and MReq

<table>
<thead>
<tr>
<th>state</th>
<th>Bus Enables</th>
<th>Register Enables</th>
<th>Mux</th>
<th>Func</th>
<th>RF</th>
<th>MReq</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>pc</td>
<td>ig</td>
<td>alu</td>
<td>rf</td>
<td>rd</td>
<td>pc</td>
</tr>
<tr>
<td>B0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>1</td>
<td>0</td>
</tr>
<tr>
<td>B1</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>1</td>
<td>0</td>
</tr>
<tr>
<td>B2</td>
<td>0</td>
<td>1</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>B3</td>
<td>1</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>B4</td>
<td>0</td>
<td>0</td>
<td>1</td>
<td>0</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>B5</td>
<td>0</td>
<td>0</td>
<td>1</td>
<td>0</td>
<td>0</td>
<td>1</td>
</tr>
</tbody>
</table>

---

**Notes**

- This table outlines the states (B0, B1, B2, B3, B4, B5) and their corresponding bus enables, register enables, mux, function, RF, and MReq values.
- The states and actions are part of the implementation details for controlling the microcoded control unit in a TinyRV1 FSM Processor.
3.4. Analyzing Performance

\[
\text{Time} = \frac{\text{Instructions}}{\text{Program}} \times \frac{\text{Cycles}}{\text{Instruction}} \times \frac{\text{Time}}{\text{Cycles}}
\]

Estimating cycle time

- register read/write = 1\(\tau\)
- registerfile read/write = 10\(\tau\)
- mem read/write = 20\(\tau\)
- immgen = 2\(\tau\)
- mux = 3\(\tau\)
- alu = 10\(\tau\)
- lb shifter = 1\(\tau\)
- tri-state buf = 1\(\tau\)

Topic 2: Fundamental Processor Microarchitecture
Estimating execution time

Using our first-order equation for processor performance, how long in units of $\tau$ will it take to execute the vector-vector add example assuming $n$ is 64?

```
loop:
    lw  x5, 0(x13)
    lw  x6, 0(x14)
    add x7, x5, x6
    sw  x7, 0(x12)
    addi x13, x12, 4
    addi x14, x14, 4
    addi x12, x12, 4
    addi x15, x15, -1
    bne x15, x0, loop
j  x1
```

Using our first-order equation for processor performance, how long in units of $\tau$ will it take to execute the mystery program assuming $n$ is 64 and that we find a match on the last element.

```
addi x5, x0, 0
loop:
    lw  x6, 0(x12)
    bne x6, x14, foo
    addi x10, x5, 0
j  x1
foo:
    addi x12, x12, 4
    addi x5, x5, 1
    bne x5, x13, loop
    addi x10, x0, -1
j  x1
```
4. TinyRV1 Pipelined Processor

\[
\text{Time Program} = \frac{\text{Instructions Program}}{\text{Instructions Program}} \times \frac{\text{Avg Cycles Instruction}}{\text{Avg Cycles Instruction}} \times \frac{\text{Time Cycle}}{\text{Time Cycle}}
\]

- Instructions / program depends on source code, compiler, ISA
- Avg cycles / instruction (CPI) depends on ISA, microarchitecture
- Time / cycle depends upon microarchitecture and implementation

<table>
<thead>
<tr>
<th>Microarchitecture</th>
<th>CPI</th>
<th>Cycle Time</th>
</tr>
</thead>
<tbody>
<tr>
<td>Single-Cycle Processor</td>
<td>1</td>
<td>long</td>
</tr>
<tr>
<td>FSM Processor</td>
<td>&gt;1</td>
<td>short</td>
</tr>
<tr>
<td>Pipelined Processor</td>
<td>(\approx1)</td>
<td>short</td>
</tr>
</tbody>
</table>

Technology Constraints

- Assume modern technology where logic is cheap and fast (e.g., fast integer ALU)
- Assume multi-ported register files with a reasonable number of ports are feasible
- Assume small amount of very fast memory (caches) backed by large, slower memory
4.1. High-Level Idea for Pipelined Processors

- Anne, Brian, Cathy, and Dave each have one load of clothes
- Washing, drying, folding, and storing each take 30 minutes

### Fixed Time-Slot Laundry

<table>
<thead>
<tr>
<th>Time</th>
<th>Anne's Load</th>
<th>Ben's Load</th>
<th>Cathy's Load</th>
<th>Dave's Load</th>
</tr>
</thead>
<tbody>
<tr>
<td>7pm</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>8pm</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>9pm</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>10pm</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>11pm</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>12am</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>1am</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>2am</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>3am</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

### Pipelined Laundry

<table>
<thead>
<tr>
<th>Time</th>
<th>Anne's Load</th>
<th>Ben's Load</th>
<th>Cathy's Load</th>
<th>Dave's Load</th>
</tr>
</thead>
<tbody>
<tr>
<td>7pm</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>8pm</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>9pm</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>10pm</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

### Pipelined Laundry with Slow Dryers

<table>
<thead>
<tr>
<th>Time</th>
<th>Anne's Load</th>
<th>Ben's Load</th>
<th>Cathy's Load</th>
<th>Dave's Load</th>
</tr>
</thead>
<tbody>
<tr>
<td>7pm</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>8pm</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>9pm</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>10pm</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

### Pipelining lessons

- Multiple transactions operate simultaneously using different resources
- Pipelining does not help the transaction latency
- Pipelining does help the transaction throughput
- Potential speedup is proportional to the number of pipeline stages
- Potential speedup is limited by the slowest pipeline stage
- Potential speedup is reduced by time to fill the pipeline
Applying pipelining to processors

<table>
<thead>
<tr>
<th></th>
<th>add</th>
<th>addi</th>
<th>mul</th>
<th>lw</th>
<th>sw</th>
<th>jal</th>
<th>jr</th>
<th>bne</th>
</tr>
</thead>
<tbody>
<tr>
<td>Fetch Instruction</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
</tr>
<tr>
<td>Decode Instruction</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
</tr>
<tr>
<td>Read Registers</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
</tr>
<tr>
<td>Register Arithmetic</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
</tr>
<tr>
<td>Read Memory</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
</tr>
<tr>
<td>Write Memory</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>✓</td>
<td></td>
<td></td>
</tr>
<tr>
<td>Write Registers</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
</tr>
<tr>
<td>Update PC</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
</tr>
</tbody>
</table>
4.2. Pipelined Processor Datapath and Control Unit

- Incrementally develop an unpipelined datapath
- Keep data flowing from left to right
- Position control signal table early in the diagram
- Divide datapath/control into stages by inserting pipeline registers
- Keep the pipeline stages roughly balanced
- Forward arrows should avoid “skipping” pipeline registers
- Backward arrows will need careful consideration
Adding a new auto-incrementing load instruction

Draw on the above datapath diagram what paths we need to use as well as any new paths we will need to add in order to implement the following auto-incrementing load instruction.

\[
lw.ai \quad rd, \quad imm(rs1)\\
R[rd] \leftarrow M[R[rs1] + \text{sext}(imm)]; R[rs1] \leftarrow R[rs1] + 4
\]
4. TinyRV1 Pipelined Processor 4.2. Pipelined Processor Datapath and Control Unit

Pipeline diagrams

<table>
<thead>
<tr>
<th>Instruction</th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td>addi x1, x2, 1</td>
<td></td>
</tr>
<tr>
<td>addi x3, x4, 1</td>
<td></td>
</tr>
<tr>
<td>addi x5, x6, 1</td>
<td></td>
</tr>
</tbody>
</table>

What would be the total execution time if these three instructions were repeated 10 times?

**Hazards occur when instructions interact with each other in pipeline**

- **RAW Data Hazards:** An instruction depends on a data value produced by an earlier instruction
- **Control Hazards:** Whether or not an instruction should be executed depends on a control decision made by an earlier instruction
- **Structural Hazards:** An instruction in the pipeline needs a resource being used by another instruction in the pipeline
- **WAW and WAR Name Hazards:** An instruction in the pipeline is writing a register that an earlier instruction in the pipeline is either writing or reading

**Stalling and squashing instructions**

- **Stalling:** An instruction originates a stall due to a hazard, causing all instructions earlier in the pipeline to also stall. When the hazard is resolved, the instruction no longer needs to stall and the pipeline starts flowing again.
- **Squashing:** An instruction originates a squash due to a hazard, and squashes all previous instructions in the pipeline (but not itself). We restart the pipeline to begin executing a new instruction sequence.
Control logic with no stalling and no squashing

always_ff @(posedge clk)
  if (reset)
    val_B <= 0
  else
    val_B <= next_val_A

next_val_B = val_B

greg_en_B = !stall_B

always_ff @(posedge clk)
  if (reset)
    val_B <= 0
  else if (reg_en_B)
    val_B <= next_val_A

ostall_B = val_B && (ostall_hazard1_B || ostall_hazard2_B)

stall_B = val_B && (ostall_B || ostall_C || ...)

next_val_B = val_B && !stall_B

---

<table>
<thead>
<tr>
<th>ostall_B</th>
<th>Originating stall due to hazards detected in B stage.</th>
</tr>
</thead>
<tbody>
<tr>
<td>stall_B</td>
<td>Should we actually stall B stage? Factors in ostalls due to hazards and ostalls from later pipeline stages.</td>
</tr>
<tr>
<td>next_val_B</td>
<td>Only send transaction to next stage if transaction in B stage is valid and we are not stalling B stage.</td>
</tr>
</tbody>
</table>

Topic 2: Fundamental Processor Microarchitecture 36
Control logic with stalling and squashing

\[
\text{reg}_\text{en}_B = \neg \text{stall}_B \quad \text{||} \quad \text{squash}_B_B
\]

\[
\text{always} \quad \text{ff} \quad (@( \text{posedge} \ \text{clk} ) \quad \text{if} \quad ( \text{reset} ) \quad \text{val}_B \ <= \ 0 \quad \text{else} \quad \text{if} \quad ( \text{reg}_\text{en}_B ) \quad \text{val}_B \ <= \ \text{next}_\text{val}_A
\]

\[
\text{ostall}_B = \text{val}_B \ & \ & ( \text{ostall}_\text{hazard}_1_B \quad \text{||} \quad \text{ostall}_\text{hazard}_2_B )
\]

\[
\text{stall}_B = \text{val}_B \ & \ & ( \text{ostall}_B \quad \text{||} \quad \text{ostall}_C \quad \text{||} \quad \ldots )
\]

\[
\text{osquash}_B = \text{val}_B \ & \ & \neg \text{stall}_B \ & \ & ( \text{osquash}_\text{hazard}_1_B \quad \text{||} \quad \ldots )
\]

\[
\text{squash}_B = \text{val}_B \ & \ & ( \text{osquash}_C \quad \text{||} \quad \ldots )
\]

\[
\text{next}_\text{val}_B = \text{val}_B \ & \ & \neg \text{stall}_B \ & \ & \neg \text{squash}_B
\]

<table>
<thead>
<tr>
<th>Variable</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>ostall_B</td>
<td>Originating stall due to hazards detected in B stage.</td>
</tr>
<tr>
<td>stall_B</td>
<td>Should we actually stall B stage? Factors in ostalls due to hazards and ostalls from later pipeline stages.</td>
</tr>
<tr>
<td>osquash_B</td>
<td>Originating squash due to hazards detected in B stage. If this stage is stalling, do not originate a squash.</td>
</tr>
<tr>
<td>squash_B</td>
<td>Should we squash B stage? Factors in the originating squashes from later pipeline stages. An originating squash from B stage means to squash all stages earlier than B, so osquash_B is not factored into squash_B.</td>
</tr>
<tr>
<td>next_val_B</td>
<td>Only send transaction to next stage if transaction in B stage is valid and we are not stalling or squashing B stage.</td>
</tr>
</tbody>
</table>
5. Pipeline Hazards: RAW Data Hazards

RAW data hazards occur when one instruction depends on a data value produced by a preceding instruction still in the pipeline. We use architectural dependency arrows to illustrate RAW dependencies in assembly code sequences.

\[
\text{addi } x_1, x_2, 1 \\
\text{addi } x_3, x_1, 1 \\
\text{addi } x_4, x_3, 1
\]

**Using pipeline diagrams to illustrate RAW hazards**

We use microarchitectural dependency arrows to illustrate RAW hazards on pipeline diagrams.

\[
\text{addi } x_1, x_2, 1 \\
\text{addi } x_3, x_1, 1 \\
\text{addi } x_4, x_3, 1
\]
Approaches to resolving data hazards

- **Exposé in Instruction Set Architecture:** Expose data hazards in ISA forcing compiler to explicitly avoid scheduling instructions that would create hazards (i.e., software scheduling for correctness)

- **Hardware Scheduling:** Hardware dynamically schedules instructions to avoid RAW hazards, potentially allowing instructions to execute out of order

- **Hardware Stalling:** Hardware includes control logic that freezes later instructions until earlier instruction has finished producing data value; software scheduling can still be used to avoid stalling (i.e., software scheduling for performance)

- **Hardware Bypassing/Forwarding:** Hardware allows values to be sent from an earlier instruction to a later instruction before the earlier instruction has left the pipeline (sometimes called *forwarding*).

- **Hardware Speculation:** Hardware guesses that there is no hazard and allows later instructions to potentially read invalid data; detects when there is a problem, squashes and then re-executes instructions that operated on invalid data.
5.1. Expose in Instruction Set Architecture

Insert nops to delay read of earlier write. These nops count as real instructions increasing instructions per program.

```
addi x1, x2, 1  
nop             
nop             
nop             
addi x3, x1, 1  
nop             
nop             
nop             
addi x4, x3, 1  
```

Insert independent instructions to delay read of earlier write, and only use nops if there is not enough useful work.

```
addi x1, x2, 1  
addi x6, x7, 1  
addi x8, x9, 1  
nop             
addi x3, x1, 1  
nop             
nop             
nop             
addi x4, x3, 1  
```

Pipeline diagram showing exposing RAW data hazards in the ISA

Note: If hazard is exposed in ISA, software scheduling is required for correctness! A scheduling mistake can cause undefined behavior.
5.2. Hardware Stalling

Hardware includes control logic that freezes later instructions (in front of pipeline) until earlier instruction (in back of pipeline) has finished producing data value.

Pipeline diagram showing hardware stalling for RAW data hazards

addi x1, x2, 1
addi x3, x1, 1
addi x4, x3, 1

Note: Software scheduling is not required for correctness, but can improve performance! Programmer or compiler schedules independent instructions to reduce the number of cycles spent stalling.

Modifications to datapath/control to support hardware stalling
5. Pipeline Hazards: RAW Data Hazards

5.3. Hardware Bypassing/Forwarding

Deriving the stall signal

<table>
<thead>
<tr>
<th>add</th>
<th>addi</th>
<th>mul</th>
<th>lw</th>
<th>sw</th>
<th>jal</th>
<th>jr</th>
<th>bne</th>
</tr>
</thead>
<tbody>
<tr>
<td>rs1_en</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>rs2_en</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>rf_wen</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

\[
\text{ostall}_{\text{waddr}}_{\text{X}}_{\text{rs1}}_{\text{D}} = \\
\quad \text{val}_D \&\& \text{rs1}_{\text{en}}_D \&\& \text{val}_X \&\& \text{rf}_{\text{wen}}_X \\
\quad \&\& (\text{inst}_{\text{rs1}}_D == \text{rf}_{\text{waddr}}_X) \&\& (\text{rf}_{\text{waddr}}_X != 0)
\]

\[
\text{ostall}_{\text{waddr}}_{\text{M}}_{\text{rs1}}_{\text{D}} = \\
\quad \text{val}_D \&\& \text{rs1}_{\text{en}}_D \&\& \text{val}_M \&\& \text{rf}_{\text{wen}}_M \\
\quad \&\& (\text{inst}_{\text{rs1}}_D == \text{rf}_{\text{waddr}}_M) \&\& (\text{rf}_{\text{waddr}}_M != 0)
\]

\[
\text{ostall}_{\text{waddr}}_{\text{W}}_{\text{rs1}}_{\text{D}} = \\
\quad \text{val}_D \&\& \text{rs1}_{\text{en}}_D \&\& \text{val}_W \&\& \text{rf}_{\text{wen}}_W \\
\quad \&\& (\text{inst}_{\text{rs1}}_D == \text{rf}_{\text{waddr}}_W) \&\& (\text{rf}_{\text{waddr}}_W != 0)
\]

... similar for ostall signals for rs2 source register ...

\[
\text{ostall}_D = \text{val}_D \\
\quad \&\& ( \text{ostall}_{\text{waddr}}_{\text{X}}_{\text{rs1}}_{\text{D}} || \text{ostall}_{\text{waddr}}_{\text{X}}_{\text{rs2}}_{\text{D}} \\
\quad \quad || \text{ostall}_{\text{waddr}}_{\text{M}}_{\text{rs1}}_{\text{D}} || \text{ostall}_{\text{waddr}}_{\text{M}}_{\text{rs2}}_{\text{D}} \\
\quad \quad || \text{ostall}_{\text{waddr}}_{\text{W}}_{\text{rs1}}_{\text{D}} || \text{ostall}_{\text{waddr}}_{\text{W}}_{\text{rs2}}_{\text{D}} )
\]

5.3. Hardware Bypassing/Forwarding

Hardware allows values to be sent from an earlier instruction (in back of pipeline) to a later instruction (in front of pipeline) before the earlier instruction has left the pipeline. Sometimes called “forwarding”.
5. Pipeline Hazards: RAW Data Hazards

5.3. Hardware Bypassing/Forwarding

Pipeline diagram showing hardware bypassing for RAW data hazards

```
addi x1, x2, 1
addi x3, x1, 1
addi x4, x3, 1
```

Adding single bypass path to support limited hardware bypassing

Deriving the bypass and stall signals

```
ostall_waddr_X_rs1_D = 0
bypass_waddr_X_rs1_D =
    val_D && rs1_en_D && val_X && rf_wen_X
    && (inst_rs1_D == rf_waddr_X) && (rf_waddr_X != 0)
```
5. Pipeline Hazards: RAW Data Hazards

5.3. Hardware Bypassing/Forwarding

Pipeline diagram showing multiple hardware bypass paths

```
addi x2, x10, 1
addi x2, x11, 1
addi x1, x2, 1
addi x3, x4, 1
addi x5, x3, 1
add x6, x1, x3
sw x5, 0(x1)
jr x6
```

Adding all bypass path to support full hardware bypassing

---

Topic 2: Fundamental Processor Microarchitecture
Handling load-use RAW dependencies

ALU-use latency is only one cycle, but load-use latency is two cycles.

```
lw x1, 0(x2)
addi x3, x1, 1

lw x1, 0(x2)
addi x3, x1, 1
```

```
ostall_load_use_X_rs1_D =
val_D && rs1_en_D && val_X && rf_wen_X
&& (inst_rs1_D == rf_waddr_X) && (rf_waddr_X != 0)
&& (op_X == lw)

ostall_load_use_X_rs2_D =
val_D && rs2_en_D && val_X && rf_wen_X
&& (inst_rs2_D == rf_waddr_X) && (rf_waddr_X != 0)
&& (op_X == lw)

ostall_D =
val_D && ( ostall_load_use_X_rs1_D || ostall_load_use_X_rs2_D )

bypass_waddr_X_rs1_D =
val_D && rs1_en_D && val_X && rf_wen_X
&& (inst_rs1_D == rf_waddr_X) && (rf_waddr_X != 0)
&& (op_X != lw)

bypass_waddr_X_rs2_D =
val_D && rs2_en_D && val_X && rf_wen_X
&& (inst_rs2_D == rf_waddr_X) && (rf_waddr_X != 0)
&& (op_X != lw)
```
Pipeline diagram for simple assembly sequence

Draw a pipeline diagram illustrating how the following assembly sequence would execute on a fully bypassed pipelined TinyRV1 processor. Include microarchitectural dependency arrows to illustrate how data is transferred along various bypass paths.

- lw x1, 0(x2)
- lw x3, 0(x4)
- add x5, x1, x3
- sw x5, 0(x6)
- addi x2, x2, 4
- addi x4, x4, 4
- addi x6, x6, 4
- addi x7, x7, -1
- bne x7, x0, loop

5.4. RAW Data Hazards Through Memory

So far we have only studied RAW data hazards through registers, but we must also carefully consider RAW data hazards through memory.

- sw x1, 0(x2)
- lw x3, 0(x4) # RAW dependency occurs if R[x2] == R[x4]

- sw x1, 0(x2)
- lw x3, 0(x4)
6. **Pipeline Hazards: Control Hazards**

Control hazards occur when whether or not an instruction should be executed depends on a control decision made by an earlier instruction. We use architectural dependency arrows to illustrate control dependencies in assembly code sequences.

### Static Instr Sequence

- `addi x1, x0, 1`
- `jal x0, foo`
- `opA`
- `opB`

### Dynamic Instr Sequence

- `addi x1, x0, 1`
- `jal x0, foo`
- `addi x2, x3, 1`
- `bne x0, x1, bar`
- `addi x4, x5, 1`

### Using pipeline diagrams to illustrate control hazards

We use microarchitectural dependency arrows to illustrate control hazards on pipeline diagrams.
The jump resolution latency and branch resolution latency are the number of cycles we need to delay the fetch of the next instruction in order to avoid any kind of control hazard. Jump resolution latency is two cycles, and branch resolution latency is three cycles.

```
addi x1, x0, 1
jal x0, foo
addi x2, x3, 1
bne x0, x1, bar
addi x4, x5, 1
```

**Approaches to resolving control hazards**

- **Expose in Instruction Set Architecture**: Expose control hazards in ISA forcing compiler to explicitly avoid scheduling instructions that would create hazards (i.e., software scheduling for correctness)

- **Software Predication**: Programmer or compiler converts control flow into data flow by using instructions that conditionally execute based on a data value

- **Hardware Speculation**: Hardware guesses which way the control flow will go and potentially fetches incorrect instructions; detects when there is a problem and re-executes instructions that are along the correct control flow

- **Software Hints**: Programmer or compiler provides hints about whether a conditional branch will be taken or not taken, and hardware can use these hints for more efficient hardware speculation
6.1. Expose in Instruction Set Architecture

Exposé branch delay slots as part of the instruction set. Branch delay slots are instructions that follow a jump or branch and are always executed regardless of whether a jump or branch is taken or not taken. Compiler tries to insert useful instructions, otherwise inserts nops.

```
addi x1, x0, 1
jal x0, foo
nop

opA
opB

foo: addi x2, x3, 1
bne x0, x1, bar
nop

nop

opC
opD
opE

bar: addi x4, x5, 1
```

Assume we modify the TinyRV1 instruction set to specify that JAL, and JR instructions have a single-instruction branch delay slot (i.e., one instruction after a JAL and JR is always executed) and the BNE instruction has a two-instruction branch delay slot (i.e., two instructions after a BNE are always executed).

---

Pipeline diagram showing using branch delay slots for control hazards

```
<table>
<thead>
<tr>
<th>addi x1, x0, 1</th>
</tr>
</thead>
<tbody>
<tr>
<td>jal x0, foo</td>
</tr>
<tr>
<td></td>
</tr>
<tr>
<td>nop</td>
</tr>
<tr>
<td></td>
</tr>
<tr>
<td>addi x2, x3, 1</td>
</tr>
<tr>
<td>bne x0, x1, bar</td>
</tr>
<tr>
<td></td>
</tr>
<tr>
<td>nop</td>
</tr>
<tr>
<td></td>
</tr>
<tr>
<td>nop</td>
</tr>
<tr>
<td></td>
</tr>
<tr>
<td>addi x4, x5, 1</td>
</tr>
</tbody>
</table>
```
6.2. Hardware Speculation

Hardware guesses which way the control flow will go and potentially fetches incorrect instructions; detects when there is a problem and re-executes instructions the instructions that are along the correct control flow. For now, we will only consider a simple branch prediction scheme where the hardware always predicts not taken.

**Pipeline diagram when branch is not taken**

<table>
<thead>
<tr>
<th>addi</th>
<th>x1, x0, 1</th>
</tr>
</thead>
<tbody>
<tr>
<td>jal</td>
<td>x0, foo</td>
</tr>
<tr>
<td>opA</td>
<td></td>
</tr>
<tr>
<td>addi</td>
<td>x2, x3, 1</td>
</tr>
<tr>
<td>bne</td>
<td>x0, x1, bar</td>
</tr>
<tr>
<td>opC</td>
<td></td>
</tr>
<tr>
<td>opD</td>
<td></td>
</tr>
</tbody>
</table>

**Pipeline diagram when branch is taken**

<table>
<thead>
<tr>
<th>addi</th>
<th>x1, x0, 1</th>
</tr>
</thead>
<tbody>
<tr>
<td>jal</td>
<td>x0, foo</td>
</tr>
<tr>
<td>opA</td>
<td></td>
</tr>
<tr>
<td>addi</td>
<td>x2, x3, 1</td>
</tr>
<tr>
<td>bne</td>
<td>x0, x1, bar</td>
</tr>
<tr>
<td>opC</td>
<td></td>
</tr>
<tr>
<td>opD</td>
<td></td>
</tr>
<tr>
<td>addi</td>
<td>x4, x5, 1</td>
</tr>
</tbody>
</table>
6. Pipeline Hazards: Control Hazards

6.2. Hardware Speculation

Modifications to datapath/control to support hardware speculation

Deriving the squash signals

\[
osquash_{j,D} = (\text{op}_D == \text{jal}) || (\text{op}_D == \text{jr})
\]
\[
osquash_{br,X} = (\text{op}_X == \text{bne}) && \neg \text{eq}_X
\]

Our generic stall/squash scheme gives priority to squashes over stalls.

\[
osquash_D = \text{val}_D && \neg \text{stall}_D && \osquash_{j,D}
\]
\[
squash_D = \text{val}_D && \osquash_X
\]

\[
osquash_X = \text{val}_D && \neg \text{stall}_X && \osquash_{br,X}
\]
\[
squash_X = 0
\]

Important: PC select logic must give priority to older instructions (i.e., prioritize branches over jumps)!  
Good quiz question?
Pipeline diagram for simple assembly sequence

Draw a pipeline diagram illustrating how the following assembly sequence would execute on a fully bypassed pipelined TinyRV1 processor that uses hardware speculation which always predicts not-taken. **Unlike the “standard” TinyRV1 processor, you should also assume that we add a single-instruction branch delay slot to the instruction set.** So this processor will partially expose the control hazard in the instruction, but also use hardware speculation. Include microarchitectural dependency arrows to illustrate both data and control flow.

```
addi x1, x2, 1
bne x0, x3, foo  # assume R[rs] != 0
addi x4, x5, 1   # instruction is in branch delay slot
addi x6, x7, 1
...
foo:             
    add x8, x1, x4
    addi x9, x1, 1
```
6.3. Interrupts and Exceptions

Interrupts and exceptions alter the normal control flow of the program. They are caused by an external or internal event that needs to be processed by the system, and these events are usually unexpected or rare from the program’s point of view.

- **Asynchronous Interrupts**
  - Input/output device needs to be serviced
  - Timer has expired
  - Power disruption or hardware failure

- **Synchronous Exceptions**
  - Undefined opcode, privileged instruction
  - Arithmetic overflow, floating-point exception
  - Misaligned memory access for instruction fetch or data access
  - Memory protection violation
  - Virtual memory page faults
  - System calls (traps) to jump into the operating system kernel

**Interrupts and Exception Semantics**

- Interrupts are asynchronous with respect to the program, so the microarchitecture can decide when to service the interrupt
- Exceptions are synchronous with respect to the program, so they must be handled immediately
To handle an interrupt or exception the hardware/software must:

- Stop program at current instruction ($I$), ensure previous insts finished
- Save cause of interrupt or exception in privileged arch state
- Save the PC of the instruction $I$ in a special register (EPC)
- Switch to privileged mode
- Set the PC to the address of either the interrupt or the exception handler
- Disable interrupts
- Save the user architectural state
- Check the type of interrupt or exception
  - **Handle the interrupt or exception**
- Enable interrupts
- Switch to user mode
- Set the PC to EPC if $I$ should be restarted
- Potentially set PC to EPC+4 if we should skip $I$

**Handling a misaligned data address and syscall exceptions**

**Static Instr Sequence**

```
addi x1, x0, 0x2001
lw x2, 0(x1)
syscall
opB
opC
...
```

**Dynamic Instr Sequence**

```
addi x1, x0, 0x2001
lw x2, 0(x1) (excep)
opD
opE
opF
opG
```

```
addi EPC, EPC, 4
eret
```

**exception_handler:**

```
opD  # disable interrupts
opE  # save user registers
opF  # check exception type
opG  # handle exception
opH  # enable interrupts
addi EPC, EPC, 4
eret
```

```
syscall (excep)
opD
opE
```

```
oF
```
Interrupts and Exceptions in a RISC-V Pipelined Processor

- How should we handle a single instruction which generates multiple exceptions in different stages as it goes down the pipeline?
  - Exceptions in earlier pipeline stages override later exceptions for a given instruction

- How should we handle multiple instructions generating exceptions in different stages at the same or different times?
  - We always want the execution to appear as if we have completely executed one instruction before going onto the next instruction
  - So we want to process the exception corresponding to the earliest instruction in program order first
  - Hold exception flags in pipeline until commit point
  - Commit point is after all exceptions could be generated but before any architectural state has been updated
  - To handle an exception at the commit point: update cause and EPC, squash all stages before the commit point, and set PC to exception handler

- How and where to handle external asynchronous interrupts?
  - Inject asynchronous interrupts at the commit point
  - Asynchronous interrupts will then naturally override exceptions caused by instructions earlier in the pipeline
Modifications to datapath/control to support exceptions

Deriving the squash signals

\[
\text{osquash}_j_D = (\text{op}_D == \text{jal}) \ OR \ (\text{op}_D == \text{jr}) \\
\text{osquash}_\text{br}_X = (\text{op}_X == \text{bne}) \ AND \ \neg \text{eq}_X \\
\text{osquash}_\text{xcept}_M = \text{exception}_M
\]

Control logic needs to redirect the front end of the pipeline just like for a jump or branch. Again, squashes take priority over stalls, and PC select logic must give priority to older instructions (i.e., prioritize exceptions, over branches, over jumps)!
Pipeline diagram of exception handling

```
addi x1, x0, 0x2001
lw x2, 0(x1)  # assume causes misaligned address exception
syscall       # causes a syscall exception
...           
exception_hander:
opD  # disable interrupts
opE  # save user registers
opF  # check exception type
opG  # handle exception
opH  # enable interrupts
addi EPC, EPC, 4
eret
```
7. Pipeline Hazards: Structural Hazards

Structural hazards occur when an instruction in the pipeline needs a resource being used by another instruction in the pipeline. The TinyRV1 processor pipeline is specifically designed to avoid structural hazards.

Let’s introduce a structural hazard by allowing instructions that do not do any real work in the M stage (i.e., non-memory instructions) to effectively skip that stage. This would require adding an extra path which “skips over” the pipeline register between the X and M stages and connects directly to the writeback mux at the end of the M stage. For non-memory instructions we set \( wb\_sel\_M \) to choose the value from the end of the X stage, while for memory instructions we set \( wb\_sel\_M \) to choose the value coming back from memory.
Using pipeline diagrams to illustrate structural hazards

We use structural dependency arrows to illustrate structural hazards.

<table>
<thead>
<tr>
<th>Instruction</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>addi x1, x2, 1</td>
<td></td>
</tr>
<tr>
<td>addi x3, x4, 1</td>
<td></td>
</tr>
<tr>
<td>lw x5, 0(x6)</td>
<td></td>
</tr>
<tr>
<td>addi x7, x8, 1</td>
<td></td>
</tr>
</tbody>
</table>

Note that the key shared resources that are causing the structural hazard are the pipeline registers at the end of the M stage. We cannot write these pipeline registers with the transaction that is in the X stage and also the transaction that is in the M stage at the same time.

Approaches to resolving structural hazards

- **Expose in Instruction Set Architecture**: Expose structural hazards in ISA forcing compiler to explicitly avoid scheduling instructions that would create hazards (i.e., software scheduling for correctness)

- **Hardware Stalling**: Hardware includes control logic that freezes later instructions until earlier instruction has finished using the shared resource; software scheduling can still be used to avoid stalling (i.e., software scheduling for performance)

- **Hardware Duplication**: Add more hardware so that each instruction can access separate resources at the same time

**7.1. Expose in Instruction Set Architecture**

Insert independent instructions or nops to delay non-memory instructions if they follow a LW or SW instruction.
Pipeline diagram showing exposing structural hazards in the ISA

addi x1, x2, 1
addi x3, x4, 1
lw x5, 0(x6)
nop
addi x7, x8, 1

7.2. Hardware Stalling

Hardware includes control logic that stalls a non-memory instruction if it follows a LW or SW instruction.

Pipeline diagram showing hardware stalling for structural hazards

addi x1, x2, 1
addi x3, x4, 1
lw x5, 0(x6)
addi x7, x8, 1

Deriving the stall signal

ostall_wport_hazard_D = val_D && !mem_inst_D && val_X && mem_inst_X

where mem_inst is true for a LW or SW instruction and false otherwise. Stall far before the structural hazard actually occurs, because we know exactly how instructions move down the pipeline. Also possible to use dynamic arbitration in the back-end of the pipeline.
7.3. **Hardware Duplication**

Add more pipeline registers at the end of M stage and a second write port so that non-memory and memory instructions can writeback to the register file at the same time.

Does allowing early writeback help performance in the first place?

```plaintext
addi x1, x2, 1
addi x3, x1, 1
addi x4, x3, 1
addi x5, x4, 1
addi x6, x5, 1
addi x7, x6, 1
```
8. Pipeline Hazards: WAW and WAR Name Hazards

WAW dependencies occur when an instruction overwrites a register than an earlier instruction has already written. WAR dependencies occur when an instruction writes a register than an earlier instruction needs to read. We use architectural dependency arrows to illustrate WAW and WAR dependencies in assembly code sequences.

```plaintext
mul x1, x2, x3
addi x4, x1, 1
addi x1, x5, 1
```

WAW name hazards occur when an instruction in the pipeline writes a register before an earlier instruction (in back of the pipeline) has had a chance to write that same register.

WAR name hazards occur when an instruction in the pipeline writes a register before an earlier instruction (in back of pipeline) has had a chance to read that same register.

The TinyRV1 processor pipeline is specifically designed to avoid any WAW or WAR name hazards. Instructions always write the registerfile in-order in the same stage, and instructions always read registers in the front of the pipeline and write registers in the back of the pipeline.

Let’s introduce a WAW name hazard by using an iterative variable latency multiplier, and allowing other instructions to continue executing while the multiplier is working.
Using pipeline diagrams to illustrate WAW name hazards

We use microarchitectural dependency arrows to illustrate WAW hazards on pipeline diagrams.

\[
\begin{align*}
mul & \, x1, \, x2, \, x3 \\
addi & \, x4, \, x6, \, 1 \\
addi & \, x1, \, x5, \, 1
\end{align*}
\]

Approaches to resolving WAW and WAR hazards

- **Software Renaming**: Programmer or compiler changes the register names to avoid creating name hazards
- **Hardware Renaming**: Hardware dynamically changes the register names to avoid creating name hazards
- **Hardware Stalling**: Hardware includes control logic that freezes later instructions until earlier instruction has finished either writing or reading the problematic register name

8.1. Software Renaming

As long as we have enough architectural registers, renaming registers in software is easy. WAW and WAR dependencies occur because we have a finite number of architectural registers.

\[
\begin{align*}
mul & \, x1, \, x2, \, x3 \\
addi & \, x4, \, x6, \, 1 \\
addi & \, x7, \, x5, \, 1
\end{align*}
\]
### 8.2. Hardware Stalling

Simplest approach is to add stall logic in the decode stage similar to what the approach used to resolve other hazards.

\[
\begin{align*}
\text{mul} & \quad x1, x2, x3 \\
\text{addi} & \quad x4, x6, 1 \\
\text{addi} & \quad x1, x5, 1
\end{align*}
\]

Deriving the stall signal

\[
\text{ostall}\_\text{struct\_hazard}_D = \text{val}_D \land \text{(op}_D = \text{MUL}) \land \lnot \text{imul}\_\text{rdy}_D
\]

\[
\text{ostall}\_\text{waw\_hazard}_D = \\
\text{val}_D \land \text{rf}\_\text{wen}_D \land \text{val}_Z \land \text{rf}\_\text{wen}_Z \\
\land \text{(rf}\_\text{waddr}_D = \text{rf}\_\text{waddr}_Z) \land \text{(rf}\_\text{waddr}_Z \neq 0)
\]

### 9. Summary of Processor Performance

\[
\frac{\text{Time}}{\text{Program}} = \frac{\text{Instructions}}{\text{Program}} \times \frac{\text{Cycles}}{\text{Instruction}} \times \frac{\text{Time}}{\text{Cycles}}
\]

Results for vector-vector add example

<table>
<thead>
<tr>
<th>Microarchitecture</th>
<th>Inst</th>
<th>CPI</th>
<th>Cycle Time</th>
<th>Exec Time</th>
</tr>
</thead>
<tbody>
<tr>
<td>Single-Cycle Processor</td>
<td>576</td>
<td>1.0</td>
<td>$74 \tau$</td>
<td>$43 , k\tau$</td>
</tr>
<tr>
<td>FSM Processor</td>
<td>576</td>
<td>6.7</td>
<td>$36 \tau$</td>
<td>$138 , k\tau$</td>
</tr>
<tr>
<td>Pipelined Processor</td>
<td>576</td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>
Estimating cycle time for pipelined processor

- register read \( = 1\tau \)
- register write \( = 1\tau \)
- regfile read \( = 10\tau \)
- regfile write \( = 10\tau \)
- memory read \( = 20\tau \)
- memory write \( = 20\tau \)
- +4 unit \( = 4\tau \)
- immgen \( = 2\tau \)
- mux \( = 3\tau \)
- multiplier \( = 20\tau \)
- alu \( = 10\tau \)
- adder \( = 8\tau \)
9. Summary of Processor Performance

Estimating execution time

Using our first-order equation for processor performance, how long in $\tau$ will it take to execute the vvadd example assuming $n$ is 64?

```
loop:
  lw x5, 0(x13)
  lw x6, 0(x14)
  add x7, x5, x6
  sw x7, 0(x12)
  addi x13, x12, 4
  addi x14, x14, 4
  addi x12, x12, 4
  addi x15, x15, -1
  bne x15, x0, loop
  jr x1
```

lw
lw
add
sw
addi
addi
addi
addi
addi
bne
bne
opA
opB
lw
Using our first-order equation for processor performance, how long in $\tau$ will it take to execute the mystery program assuming $n$ is 64 and that we find a match on the last element.

```
addi x5, x0, 0
loop:
  lw x6, 0(x12)
  bne x6, x14, foo
  addi x10, x5, 0
  jr x1
foo:
  addi x12, x12, 4
  addi x5, x5, 1
  bne x5, x13, loop
  addi x10, x0, -1
  jr x1
```
10. Case Study: Transition from CISC to RISC

- Microcoding thrived in the 1970’s
  - ROMs significantly faster than DRAMs
  - For complex instruction sets, microcode was cheaper and simpler
  - New instructions supported without modifying datapath
  - Fixing bugs in controller is easier
  - ISA compatibility across models relatively straightforward

— Maurice Wilkes, 1954
### 10.1. Example CISC: IBM 360/M30

<table>
<thead>
<tr>
<th></th>
<th>M30</th>
<th>M40</th>
<th>M50</th>
<th>M65</th>
</tr>
</thead>
<tbody>
<tr>
<td>Datapath width (bits)</td>
<td>8</td>
<td>16</td>
<td>32</td>
<td>64</td>
</tr>
<tr>
<td>µinst width (bits)</td>
<td>50</td>
<td>52</td>
<td>85</td>
<td>87</td>
</tr>
<tr>
<td>µcode size (1K µinsts)</td>
<td>4</td>
<td>4</td>
<td>2.75</td>
<td>2.75</td>
</tr>
<tr>
<td>µstore technology</td>
<td>CCROS</td>
<td>TROS</td>
<td>BCROS</td>
<td>BCROS</td>
</tr>
<tr>
<td>µstore cycle (ns)</td>
<td>750</td>
<td>625</td>
<td>500</td>
<td>200</td>
</tr>
<tr>
<td>Memory cycle (ns)</td>
<td>1500</td>
<td>2500</td>
<td>2000</td>
<td>750</td>
</tr>
<tr>
<td>Rental fee ($K/month)</td>
<td>4</td>
<td>7</td>
<td>15</td>
<td>35</td>
</tr>
</tbody>
</table>

TROS = transformer read-only storage (magnetic storage)
BCROS = balanced capacitor read-only storage (capacitive storage)
CCROS = card capacitor read-only storage (metal punch cards, replace in field)

Only the fastest models (75,95) were hardwired

**IBM 360/M30 microprogram for register-register logical OR**

![Diagram of IBM 360/M30 microprogram for register-register logical OR]
IBM 360/M30 microprogram for register-register binary ADD

Analyzing Microcoded Machines

- John Cocke and group at IBM
  - Working on a simple pipelined processor, 801, and advanced compilers
  - Ported experimental PL8 compiler to IBM 370, and only used simple register-register and load/store instructions similar to 801
  - Code ran faster than other existing compilers that used all 370 instructions! (up to 6 MIPS, whereas 2 MIPS considered good before)

- Joel Emer and Douglas Clark at DEC
  - Measured VAX-11/780 using external hardware
  - Found it was actually a 0.5 MIPS machine, not a 1 MIPS machine
  - 20% of VAX instrs = 60% of µcode, but only 0.2% of the dynamic execution

- VAX 8800, high-end VAX in 1984
  - Control store: 16K×147b RAM, Unified Cache: 64K×8b RAM
  - 4.5× more microstore RAM than cache RAM!
From CISC to RISC

- Key changes in technology constraints
  - Logic, RAM, ROM all implemented with MOS transistors
  - RAM ≈ same speed as ROM

- Use fast RAM to build fast instruction cache of user-visible instructions, not fixed hardware microfragments
  - Change contents of fast instruction memory to fit what app needs

- Use simple ISA to enable hardwired pipelined implementation
  - Most compiled code only used a few of CISC instructions
  - Simpler encoding allowed pipelined implementations
  - Load/Store Reg-Reg ISA as opposed to Mem-Mem ISA

- Further benefit with integration
  - Early 1980’s → fit 32-bit datapath, small caches on single chip
  - No chip crossing in common case allows faster operation
10.2. Example RISC: MIPS R2K

- MIPS R2K is one of the first popular pipelined RISC processors
- MIPS R2K implements the MIPS I instruction set
- MIPS = Microprocessor without Interlocked Pipeline Stages
- MIPS I used software scheduling to avoid some RAW hazards by including a single-instruction load-use delay slot
- MIPS I used software scheduling to avoid some control hazards by including a single-instruction branch delay slot

### One-Instr Branch Delay Slot

```assembly
  addiu r1, r2, 1
  j   foo
  addiu r3, r4, 1  # BDS
  ...

foo:
  addiu r5, r6, 1
  bne  r7, r8, bar
  addiu r9, r10, 1  # BDS
  ...

bar:
```

### One-Instr Load-Use Delay Slot

```assembly
  lw   r1, 0(r2)
  lw   r3, 0(r4)
  addiu r2, r2, 4  # LDS
  addu  r5, r1, r3

Depreciated in MIPS II instruction set; legacy code can still execute on new microarchitectures, but code using the MIPS II instruction set can rely in hardware stalling

Present in all MIPS instruction sets; not possible to deprecate and still enable legacy code to execute on new microarchitectures
```
MIPS R2K Microarchitecture

The pipelined datapath and control were located on a single die. Cache control and memory management unit were also integrated on-die, but the actual tag and data storage for the cache was located off-chip.

Used two-phase clocking to enable five pipeline stages to fit into four clock cycles. This avoided the need for explicit bypassing from the W stage to the end of the D stage.

Two-phase clocking enabled a single-cycle branch resolution latency since register read, branch address generation, and branch comparison can fit in a single cycle.
MIPS R2K VLSI Design

Process: 2 µm, two metal layers
Clock Frequency: 8–15 MHz
Size: 110K transistors, 80 mm$^2$
CISC/RISC Convergence

**MIPS R10K** uses sophisticated out-of-order engine; branch delay slot not useful

– Gwennap, MPR, 1994

**Intel Nehalem** frontend breaks x86 CISC into smaller RISC-like µops; µcode engine handles rarely used complex instr

– Kanter, Real World Technologies, 2009
Microprogramming Today

- Microprogramming is far from extinct

- Played a crucial role in microprocessors of the 1980s (DEC VAX, Motorola 68K series, Intel 386/486)

- Microprogramming plays assisting role in many modern processors (AMD Phenom, Intel Nehalem, Intel Atom, IBM Z196)
  - 761 Z196 instructions executed with hardwired control
  - 219 Z196 “complex” instructions always executed with microcode
  - 24 Z196 instructions conditionally executed with microcode

- Patchable microcode common for post-fabrication bug fixes (Intel processors load µcode patches at bootup)