1 Branch Prediction Overview

2 Software-Based Branch Prediction
   2.1. Static Software Hints
   2.2. Branch Delay Slots
   2.3. Predication

3 Hardware-Based Branch Prediction
   3.1. Fixed Branch Predictor
   3.2. Branch History Table (BHT) Predictor
   3.3. Two-Level Predictor For Temporal Correlation
   3.4. Two-Level Predictor For Spatial Correlation
   3.5. Generalized Two-Level Predictors
   3.6. Tournament Predictors
   3.7. Branch Target Buffers (BTBs) Predictor
1. Branch Prediction Overview

Assume incorrect branch prediction in dual-issue I2OL processor.

```
bne
opA
opB
opC
opD
opE
opF
opG
opTARG
```

Assume correct branch prediction in dual-issue I2OL processor.

```
bne
opA
opTARG
opX
opY
opZ
```

Three critical pieces of information we need to predict control flow:

- (1) Is this instruction a control flow instruction?
- (2) What is the target of this control flow instruction?
- (3) Do we redirect control flow to the target or next instr?
2. Software-Based Branch Prediction

When do we know these critical pieces of information?

<table>
<thead>
<tr>
<th>jal</th>
<th>jr</th>
<th>bne</th>
</tr>
</thead>
<tbody>
<tr>
<td>(1) Is this instruction a control flow instruction?</td>
<td>D</td>
<td>D</td>
</tr>
<tr>
<td>(2) What is the target of this control flow instruction?</td>
<td>D</td>
<td>X</td>
</tr>
<tr>
<td>(3) Do we redirect ctrl flow to the target or next instr?</td>
<td>D</td>
<td>D</td>
</tr>
</tbody>
</table>

What do we need to predict in F stage vs. D stage?

<table>
<thead>
<tr>
<th>jal</th>
<th>jr</th>
<th>bne</th>
</tr>
</thead>
<tbody>
<tr>
<td>F stage</td>
<td>predict 1,2,3</td>
<td>predict 1,2,3</td>
</tr>
<tr>
<td>D stage</td>
<td>no prediction</td>
<td>predict 2</td>
</tr>
</tbody>
</table>

2. Software-Based Branch Prediction

- Static software hints
- Branch delay slots
- Predication
2.1. Static Software Hints

Software provides hints about whether a control flow instruction is likely to be taken or not taken. These hints are part of the instruction and thus are available earlier in the pipeline (e.g., in the D stage).

bne.t

\text{opA}

\text{opTARG}

bne.nt

\text{opY}

\text{opZ}

What if the hint is wrong?

bne.t

\text{opA}

\text{opTARG}

bne.nt

\text{opA}

\text{opB}
2. Software-Based Branch Prediction

2.2. Branch Delay Slots

Without branch delay slots must squash fall through instructions if branch is taken.

Without branch delay slots:

```
bne
opA
opB
targ
```

With branch delay slots compiler can put useful work in the slots. Instructions in the delay slots are always executed regardless of branch condition.

With branch delay slots:

```
bne
opA
opB
targ
```
2.3. Predication

Not really “prediction”. Idea is to turn control flow into dataflow completely eliminating the control hazard.

**Conditional move instructions** conditionally move a source register to a destination register.

\[
\text{movn rd, rs1, rs2} \quad \text{if (R[rs2] \neq 0) } R[rd] \leftarrow R[rs1]
\]

\[
\text{movz rd, rs1, rs2} \quad \text{if (R[rs2] == 0) } R[rd] \leftarrow R[rs1]
\]

**Pseudocode**

<table>
<thead>
<tr>
<th>Pseudocode</th>
<th>w/o Predication</th>
<th>w/ Predication</th>
</tr>
</thead>
<tbody>
<tr>
<td>if ( a &lt; b )</td>
<td>slt x1, x2, x3</td>
<td>slt x1, x2, x3</td>
</tr>
<tr>
<td></td>
<td>beq x1, x0, L1</td>
<td>movz x4, x2, x1</td>
</tr>
<tr>
<td></td>
<td>addi x4, x2, x0</td>
<td>movn x4, x3, x1</td>
</tr>
<tr>
<td>else</td>
<td>jal x0, L2</td>
<td></td>
</tr>
<tr>
<td></td>
<td>addi x4, x3, x0</td>
<td></td>
</tr>
<tr>
<td></td>
<td>L1:</td>
<td></td>
</tr>
<tr>
<td></td>
<td>addi x4, x3, x0</td>
<td></td>
</tr>
<tr>
<td></td>
<td>L2:</td>
<td></td>
</tr>
</tbody>
</table>

**Full predication** enables almost all instructions to be executed under a predicate. If predicate is false, instruction should turn into a NOP.

<table>
<thead>
<tr>
<th>Pseudocode</th>
<th>w/ Predication</th>
</tr>
</thead>
<tbody>
<tr>
<td>if ( a &lt; b )</td>
<td>slt.p p1, x2, x3</td>
</tr>
<tr>
<td></td>
<td>( p1) opA</td>
</tr>
<tr>
<td></td>
<td>( p1) opB</td>
</tr>
<tr>
<td>else</td>
<td>(!p1) opC</td>
</tr>
<tr>
<td></td>
<td>(!p1) opD</td>
</tr>
</tbody>
</table>

- What if both sides of branch have many instructions?
- What if one side of branch has many more than the other side?
3. Hardware-Based Branch Prediction

• Fixed branch predictor
• Branch history table (BHT) predictor
• Two-level predictor for temporal correlation
• Two-level predictor for temporal correlation
• Generalized two-level predictors
• Tournament predictor
• Branch target buffer (BTB) predictor

3.1. Fixed Branch Predictor

• Always predict not taken
  – What we have been assuming so far
  – Simple to implement and can perform prediction in F
  – Poor accuracy, especially on very important backwards branch in loops

• Always predict taken
  – Difficult to implement: we don’t know if this is a branch until D
  – Difficult to implement: we don’t know target until at least D
  – Could predict not taken in F, and then adjust in D
  – Poor accuracy, especially on if/then/else

• Predict taken for backward branches
  and predict not taken for forward branches
  – Difficult to implement: we don’t know if this is a branch until D
  – Difficult to implement: we don’t know target until at least D
  – Could predict not taken in F, and then adjust in D
  – Better accuracy
3. Hardware-Based Branch Prediction

3.1. Fixed Branch Predictor

```assembly
loop:    <------------------.
w x1, 0(x2) | backward
lw x3, 0(x4) | branches
slt x5, x1, x3 | taken on avg
beq x5, x0, L1 --. forward | 90%
addi x6, x1, x0 | branches |
jal x0, L2 | taken on avg |
L1: <-’ 50%
addi x6, x3, x0 |
L2: |
sw x6, 0(x7) |
addi x2, x2, 4 |
addi x4, x4, 4 |
addi x7, x7, 4 |
addi x8, x8, -1 |
```

- For now let’s focus on conditional branches as opposed to unconditional jumps
- Let’s assume we always predict not-taken in the F stage
- In the D stage, we know if the instruction is a branch and we know the target of the branch
- So key goal is to predict whether or not we need to redirect the control flow, i.e., to predict the branch outcome in the D stage instead of waiting until the X stage
- By doing this prediction in the D stage we can reduce the branch misprediction penalty by several cycles although it is still not zero if we predict the branch is taken
3.2. Branch History Table (BHT) Predictor

How can we do better? Exploit structure in the program, namely temporal correlation: the outcomes of specific static branch in the past may be a good indicator of the outcomes of future dynamic instances of the same static branch.

One-Bit Saturating Counter

Remember the previous outcome of a specific static branch and predict the outcome will be the same for the next dynamic instance of the same branch.

Consider how this saturating counter would be have for a backwards branch in a loop with four iterations. Assume the entire loop is executed several times.

<table>
<thead>
<tr>
<th>Iteration</th>
<th>Prediction</th>
<th>Actual</th>
<th>Mispredict?</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>2</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>3</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>4</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>1</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>2</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>3</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>4</td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>
Exploiting temporal correlation works well, but a one-bit saturating counter will always mispredicts the backwards branch in a loop twice. Loops are very common!

**Two-Bit Saturating Counter**

Remember the last two outcomes of a specific static branch. Require two consecutive “counter examples” before changing the prediction.

Consider how this saturating counter would be have for a backwards branch in a loop with four iterations. Assume the entire loop is executed several times.

<table>
<thead>
<tr>
<th>Iteration</th>
<th>Prediction</th>
<th>Actual</th>
<th>Mispredict?</th>
<th>ST</th>
<th>WT</th>
<th>WNT</th>
<th>SNT</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>2</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>3</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>4</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>1</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>2</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>3</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>4</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

What if start state is strongly taken?
Other Two-Bit FSM Branch Predictors

See Fig 5.8 in Swu + Lipasti for other alternatives
Branch History Table

- So far we have focused on a simple FSM that exploits temporal correlation to make a prediction for a specific static branch.

- To make predictions for many different static branches, we need to keep track of a dedicated FSM per static branch.

- A branch history table (BHT) is a table where each entry is the state of the FSM for a different static branch.

- Two PC’s can “alias” to the same entry in BHT.

- Aliasing is similar to a cache conflict.

- We could store the PC as a tag along with the FSM state to make sure we don’t mix up the FSM state across two static branches.

- Storing the PC is too expensive though, so we can just let branches alias and this just reduces the branch prediction accuracy.

- Can reduce aliasing with larger BHT.

BHT with 4k entries and 2 bits/entry = 80–90% accuracy

How do we continue to improve prediction accuracy? Exploit even more complicated temporal correlation.
3. Hardware-Based Branch Prediction

3.2. Branch History Table (BHT) Predictor

Often a branch exhibits more complicated patterns than just “always taken” or “always not taken.” Could develop a more complicated FSM, but then patterns vary per branch. We want per branch customized FSMs.

```c
void convolve(int B[], int A[], int size) {
    for (int i = 2; i < size-2; i++)
        for (int j = 0; j < 5; j++)
            B[i-(2-j)] = A[i] * COEFF[j];
}
```

Can we predict that every fifth dynamic instance of the backwards loop branch will be not taken?
3.3. Two-Level Predictor For Temporal Correlation

When a branch is taken or not taken we shift in either a one (taken) or a zero (not taken) into the least significant bit of the corresponding BHSR.

<table>
<thead>
<tr>
<th>Index</th>
<th>Value</th>
</tr>
</thead>
<tbody>
<tr>
<td>...</td>
<td></td>
</tr>
<tr>
<td>0111</td>
<td>ST</td>
</tr>
<tr>
<td>1000</td>
<td>WT</td>
</tr>
<tr>
<td>1001</td>
<td>WT</td>
</tr>
<tr>
<td>1010</td>
<td>WT</td>
</tr>
<tr>
<td>1011</td>
<td>ST</td>
</tr>
<tr>
<td>1100</td>
<td>WT</td>
</tr>
<tr>
<td>1101</td>
<td>ST</td>
</tr>
<tr>
<td>1110</td>
<td>ST</td>
</tr>
<tr>
<td>1111</td>
<td>SNT</td>
</tr>
</tbody>
</table>

- BHSR captures temporal pattern for that branch
- We use the BHSR to index into the PHT. A BHT has an entry per branch, but a PHT has an entry per branch pattern.
- The PHT says for a given pattern over the past n executions of a branch, should I take or not take the next execution of this branch?
- Once the two-level predictor is warmed up for previous nested loop example, the state of the PHT would be what is shown on the left
- Need at least four bits of “history” to learn this pattern and perfectly predict this branch
Problem: Multiple branches with same history might need different predictions. In other words, aliasing in one bit can lose accuracy.

Solution: Add multiple bits, use bits from PC to choose which bit to use.
The way one branch is resolved may be a good indicator of the way a later (different) branch will resolve.

If \( x < 7 \)
\[
\begin{align*}
\text{y} &+ 1 \\
\text{if} (x < 5) &+ 2 \\
2 &+ 2
\end{align*}
\]
branch \( \phi \)
\[
\begin{align*}
&\text{be } r_2, r_1, 7 \\
&\text{add one } r_3, r_1, 1 \\
&\text{L1:} \\
&\text{be } r_2, r_1, 5 \\
&\text{add one } r_4, r_1, 1 \\
\end{align*}
\]
branch 1
\[
\begin{align*}
&\text{L2:}
\end{align*}
\]

If branch \( \phi \) is taken (ie \( x \geq 7 \))
then branch 1 is always taken (ie \( x \) must be \( \geq 5 \))

So whether branch \( \phi \) is taken or not taken can be used to predict if we should take branch 1.

For above example, \textbf{Bush} will capture history - so we will know even first branch is taken, and that value(s) of one Bush will point to an entry in one \textbf{Put} that predicts taken.
As before, multiple jumps can help avoid aliasing in PHT
GENERALIZED TWO-LEVEL TBITS

Combined approach to exploit both complex temporal correlation and spatial correlation.

Difference from discussion on complex temporal correlation is that we purposely choose a smaller \( m \) to cause aliasing in the \( T \) bits, since this aliasing allows us to capture spatial correlation. (Choose higher order \( m \) bits)

<table>
<thead>
<tr>
<th>( k = 0 )</th>
<th>( 0 &lt; k &lt; 30 )</th>
<th>( k = 30 )</th>
</tr>
</thead>
<tbody>
<tr>
<td>( m = 0 )</td>
<td>( G_A^g )</td>
<td>( G_A S )</td>
</tr>
<tr>
<td>( 0 &lt; m &lt; 30 )</td>
<td>( P_A^g )</td>
<td>( P_A S )</td>
</tr>
<tr>
<td>( m = 30 )</td>
<td>( S_A^g )</td>
<td>( S_A S )</td>
</tr>
</tbody>
</table>

97% Accuracy
3. Hardware-Based Branch Prediction

3.5. Generalized Two-Level Predictors

Instead of concatenating four bits from PC with BUS, helps avoid aliasing in the output more effectively.
3.6. Tournament Predictors

Different predictors are better at predicting different types of branches:
- One-level 2-bit sampling counter - loops
- Two-level gshare - irregular code

Tournament Predictors Selection Table
- Predicts which branch predictor we should use
3. Hardware-Based Branch Prediction  

3.7. Branch Target Buffers (BTBs) Predictor

**HW BRANCH PREDICTING TARGET ADDRESS**

Even with best possible prediction of branch outcome, still need to wait for target address to be determined.

**Branch Target Buffer**

![Diagram of Branch Target Buffer]

- **PC**
- **Predicted FSM**
- **Target State**
- **Hit?**
- **New PC = Predicted Target PC If Hit and Predicted Taken**

**Can put BTB in fetch stage**

- Predicting if PC points to a branch
- Predicting target of branch
- Predicting if branch is taken.

*Sometimes 62 = 0, if hit we assume predict taken*
3. Hardware-Based Branch Prediction

3.7. Branch Target Buffers (BTBs) Predictor

```
COMBINE BTB AND BUT

BTB is much more expensive than BUT, but BTB earlier in pipeline + can accelerate JR

Combine BTB w/ few entries with BUT w/ many entries

RETURN ADDRESS STACK PREDICTOR

BTB only works for JR function call returns if always call function from same place (not realistic)

STACK PREDICTOR

- Push target address on stack for JAL/JALR
- Pop off target address for JR to predict target

Move stack predictor into fetch and predict which PC's are JR.

Use tournament prediction to choose between BTB and Stack predictor.
```