# ECE 4750 Computer Architecture, Fall 2025

# Topic 10: Advanced Processors Branch Prediction

## School of Electrical and Computer Engineering Cornell University

revision: 2025-12-03-14-43

| 1 | Branch Prediction Overview                        | 2  |
|---|---------------------------------------------------|----|
| 2 | Software-Based Branch Prediction                  | 3  |
|   | 2.1. Static Software Hints                        | 4  |
|   | 2.2. Branch Delay Slots                           | 5  |
|   | 2.3. Predication                                  | 6  |
| 3 | Hardware-Based Branch Prediction                  | 7  |
|   | 3.1. Fixed Branch Predictor                       | 7  |
|   | 3.2. Branch History Table (BHT) Predictor         | 9  |
|   | 3.3. Two-Level Predictor For Temporal Correlation | 14 |
|   | 3.4. Two-Level Predictor For Spatial Correlation  | 16 |
|   | 3.5. Generalized Two-Level Predictors             | 18 |
|   | 3.6. Tournament Predictors                        | 20 |
|   | 3.7. Branch Target Buffers (BTBs) Predictor       | 21 |

Copyright © 2025 Anne Bracy. All rights reserved. This handout was prepared by Prof. Anne Bracy at Cornell University for ECE 4750 Computer Architecture (derived from previous handouts prepared and copyrighted by Prof. Christopher Batten). Download and use of this handout is permitted for individual educational non-commercial purposes only. Redistribution either in part or in whole via both commercial or non-commercial means requires written permission.

#### 1. Branch Prediction Overview

Assume incorrect branch prediction in dual-issue IO2L processor.



Assume correct branch prediction in dual-issue IO2L processor.



Three critical pieces of information we need to predict control flow:

- (1) Is this instruction a control flow instruction?
- (2) What is the target of this control flow instruction?
- (3) Do we redirect control flow to the target or next instr?

#### When do we know these critical pieces of information?



|                                                           | jal | jr | bne |
|-----------------------------------------------------------|-----|----|-----|
| (1) Is this instruction a control flow instruction?       | D   | D  | D   |
| (2) What is the target of this control flow instruction?  | D   | X  | D   |
| (3) Do we redirect ctrl flow to the target or next instr? | D   | D  | X   |

#### What do we need to predict in F stage vs. D stage?

|         | jal           | jr            | bne           |
|---------|---------------|---------------|---------------|
| F stage | predict 1,2,3 | predict 1,2,3 | predict 1,2,3 |
| D stage | no prediction | predict 2     | predict 3     |

#### 2. Software-Based Branch Prediction

- Static software hints
- Branch delay slots
- Predication

#### 2.1. Static Software Hints

Software provides hints about whether a control flow instruction is likely to be taken or not taken. These hints are part of the instruction and thus are available earlier in the pipeline (e.g., in the D stage).



What if the hint is wrong?

| bne.t  |  |  |  |  |  |  |  |
|--------|--|--|--|--|--|--|--|
| орА    |  |  |  |  |  |  |  |
| opTARG |  |  |  |  |  |  |  |
| bne.nt |  |  |  |  |  |  |  |
| орА    |  |  |  |  |  |  |  |
| орВ    |  |  |  |  |  |  |  |

# 2.2. Branch Delay Slots

Without branch delay slots must squash fall through instructions if branch is taken.



With branch delay slots compiler can put useful work in the slots. Instructions in the delay slots are always executed regardless of branch condition.



#### 2.3. Predication

Not really "prediction". Idea is to turn control flow into dataflow completely eliminating the control hazard.

**Conditional move instructions** conditionally move a source register to a destination register.

movn rd, rs1, rs2 if (R[rs2]!=0) R[rd] 
$$\leftarrow$$
 R[rs1] movz rd, rs1, rs2 if (R[rs2] == 0) R[rd]  $\leftarrow$  R[rs1]

| Pseudocode | w/o Predication | w/ Predication  |  |  |
|------------|-----------------|-----------------|--|--|
| if (a < b) | slt x1, x2, x3  | slt x1, x2, x3  |  |  |
| x = a      | beq x1, x0, L1  | movz x4, x3, x1 |  |  |
| else       | addi x4, x2, x0 | movn x4, x2, x1 |  |  |
| x = b      | jal x0, L2      |                 |  |  |
|            | L1:             |                 |  |  |
|            | addi x4, x3, x0 |                 |  |  |
|            | L2:             |                 |  |  |

**Full predication** enables almost all instructions to be executed under a predicate. If predicate is false, instruction should turn into a NOP.

| Pseudocode | w/ Predication   |
|------------|------------------|
| if (a < b) | slt.p p1, x2, x3 |
| opA        | ( p1) opA        |
| орВ        | ( p1) opB        |
| else       | (!p1) opC        |
| opC        | (!p1) opD        |
| opD        |                  |

- What if both sides of branch have many instructions?
- What if one side of branch has many more than the other side?

#### 3. Hardware-Based Branch Prediction

- Fixed branch predictor
- Branch history table (BHT) predictor
- Two-level predictor for temporal correlation
- Two-level predictor for temporal correlation
- Generalized two-level predictors
- Tournament predictor
- Branch target buffer (BTB) predictor

#### 3.1. Fixed Branch Predictor

- · Always predict not taken
  - What we have been assuming so far
  - Simple to implement and can perform prediction in F
  - Poor accuracy, especially on very important backwards branch in loops
- Always predict taken
  - Difficult to implement: we don't know if this is a branch until D
  - Difficult to implement: we don't know target until at least D
  - Could predict not taken in F, and then adjust in D
  - Poor accuracy, especially on if/then/else
- Predict taken for backward branches and predict not taken for forward branches
  - Difficult to implement: we don't know if this is a branch until D
  - Difficult to implement: we don't know target until at least D
  - Could predict not taken in F, and then adjust in D
  - Better accuracy

```
loop:
lw x1, 0(x2)
                                   l backward
1w x3, 0(x4)
                                   branches
slt x5, x1, x3
                                   | taken on avg
beg x5, x0, L1 --. forward
                                   1 90%
addi x6, x1, x0 | branches
jal x0, L2
                  taken on avg
L1:
                <-, 50%
addi x6, x3, x0
1.2:
x6, 0(x7)
addi x2, x2, 4
addi x4, x4, 4
addi x7, x7, 4
addi x8, x8, -1
bne x8, x0, loop -----
```

- For now let's focus on conditional branches as opposed to unconditional jumps
- Let's assume we always predict not-taken in the F stage
- In the D stage, we know if the instruction is a branch and we know the target of the branch
- So key goal is to predict whether or not we need to redirect the control flow, i.e., to predict the branch outcome in the D stage instead of waiting until the X stage
- By doing this prediction in the D stage we can reduce the branch misprediction penalty by several cycles although it is still not zero if we predict the branch is taken

## 3.2. Branch History Table (BHT) Predictor

How can we do better? Exploit structure in the program, namely temporal correlation: the outcomes of specific static branch in the past may be a good indicator of the outcomes of future dynamic instances of the same static branch.

#### **One-Bit Saturating Counter**

Remember the previous outcome of a specific static branch and predict the outcome will be the same for the next dynamic instance of the same branch.

predict NT

Consider how this saturating counter would behave for a backwards branch in a loop with four iterations. Assume the entire loop is executed several times.

| Iteration | Prediction | Actual | Mispredict? |
|-----------|------------|--------|-------------|
| 1         |            |        |             |
| 2         |            |        |             |
| 3         |            |        |             |
| 4         |            |        |             |
| 1         |            |        |             |
| 2         |            |        |             |
| 3         |            |        |             |
| 4         |            |        |             |

Exploiting temporal correlation works well, but a one-bit saturating counter will always mispredicts the backwards branch in a loop twice. Loops are *very* common!

#### **Two-Bit Saturating Counter**

Remember the last two outcomes of a specific static branch. Require two consecutive "counter examples" before changing the prediction.



Consider how this saturating counter would be have for a backwards branch in a loop with four iterations. Assume the entire loop is executed several times.

| Iteration | Prediction | Actual | Mispredict? | ST | WT | WNT | SNT |
|-----------|------------|--------|-------------|----|----|-----|-----|
| 1         |            |        |             | 0  | 0  | 0   | 0   |
| 2         |            |        |             | 0  | 0  | 0   | 0   |
| 3         |            |        |             | 0  | 0  | 0   | 0   |
| 4         |            |        |             | 0  | 0  | 0   | 0   |
| 1         |            |        |             | 0  | 0  | 0   | 0   |
| 2         |            |        |             | 0  | 0  | 0   | 0   |
| 3         |            |        |             | 0  | 0  | 0   | 0   |
| 4         |            |        |             | 0  | 0  | 0   | 0   |

What if start state is strongly taken?

#### Other Two-Bit FSM Branch Predictors





#### **Branch History Table**

- So far we have focused on a simple FSM that exploits temporal correlation to make a prediction for a *specific static branch*
- To make predictions for many different static branches, we need to keep track of a *dedicated* FSM per static branch
- A branch history table (BHT) is a table where each entry is the state
  of the FSM for a different static branch.



- Two PC's can "alias" to the same entry in BHT
- Aliasing is similar to a cache conflict
- We could store the PC as a tag along with the FSM state to make sure we don't mix up the FSM state across two static branches
- Storing the PC is too expensive though, so we can just let branches alias and this just reduces the branch prediction accuracy
- Can reduce aliasing with larger BHT

BHT with 4k entries and 2bits/entry = 80–90% accuracy

How do we continue to improve prediction accuracy? Exploit even more complicated temporal correlation.

#### More Complicated Temporal Correlation

Often a branch exhibits more complicated pattern than just "always taken" or "always not taken." Could develop a more complicated FSM, but these patterns vary per branch. We want per branch customized FSMs.

```
void convolve( int B[], int A[], int size){
  for (int i = 2; i < size-2; i ++)
     for (int j = 0; j < 5; j++)
          B[i-(2-j)] = A[i] * COEFF[j];
}</pre>
```

Can we predict that every 5th dynamic instance of the backwards loop branch will be Not Taken?

# 3.3. Two-Level Predictor For Temporal Correlation



When a branch is taken or not taken we shift in either a one (taken) or a zero (not taken) into the least significant bit of the corresponding BHSR.

| Index | Value |
|-------|-------|
|       |       |
| 0111  | ST    |
| 1000  | WT    |
| 1001  | WT    |
| 1010  | WT    |
| 1011  | ST    |
| 1100  | WT    |
| 1101  | ST    |
| 1110  | ST    |
| 1111  | SNT   |

- BHSR captures temporal pattern for that branch
- We use the BHSR to index into the PHT. A BHT has an entry per branch, but a PHT has an entry per branch pattern.
- The PHT says for a given pattern over the past n executions of a branch, should I take or not take the next execution of this branch?
- Once the two-level predictor is warmed up for previous nested loop example, the state of the PHT would be what is shown on the left
- Need at least four bits of "history" to learn this pattern and perfectly predict this branch

**Problem:** Multiple branches with same history might need different predictions. In other words, aliasing in the PHT can reduce accuracy.

**Solution:** Add multiple PHTs, use bits from PC to choose which PHT to use.



# HW BPNED: EXPIDITING SPATIAL COMELATION

The way one Branch is resolved may the A good motivation OF me way A later (DIFFERENT) Branch will resolve

If brance \$ 15 TAKEN (LE X 37) then braden I is always taken (ie x must be >5)

So where branch of is taken or NOT TAKEN CAN BE USES TO presict it we shows take Branch !



For Above example, Bush will capture HURRY - so we will me Bush will point to an every in me Put mar projects TAKEN.



# GENERALIZED TWO-LEVEL BUTS

COMBINED APPROAGE TO EXPLOIT DOM COMPLEX TEMPORAL CONCENTROL



DIFFERENCE from
DISCUSSIONS ON
COMPLEX TEMPORAL
CORRECTIONS IS
THAT WE PURPOSELY
CHOOSE A SMALLER
M TO CAUSE ALLASING
IN THE BURNT
SILLE THIS ALLASING
Allows US TO
CAPPINE SPATIAL
CORRECTION
(CHOOSE MYME ORDER
M 6175)

|                      |          | ore put $k=0$ | 0 < k < 30 | Pur for each<br>PC<br>k=30 |
|----------------------|----------|---------------|------------|----------------------------|
| OHE BHIN             | m=0      | GAg           | GAS        | GAP                        |
|                      | 0.CM(30  | PAg           | PAS        | PAP                        |
| ODE BHEN<br>For each | m=80     | SAg           | SAs .      | SAe                        |
|                      | J. Cu. j | q             | 7% ACC     | URACY                      |

Topic 10: Advanced Processors - Branch Prediction



Topic 10: Advanced Processors - Branch Prediction

# 3.6. Tournament Predictors

#### **Tournament Predictors**

Different predictors are better at predicting different types of branches.

One-level 2-bit saturating counter: loops

Two-level gshare: irregular code



HW BPND: PREDICTING TATGET ADDRESS

EVEN WIM TERST POSSIBLE PRESICTION OF TITATION OUTCOME STILL NEED TO WAIT FOR TATGET ADDRESS TO The DETERMINES.



CAN PUT BTB IN FETCH STAGE - Presidenty IF PC POINTS TO A GRANCH - PROSIDENT TARGET OF BRANCH - PROSIDENT IF BRANCH IS TAKEN.

SOMETIMES 6=0, It HIT THE ASSUME PRESICT TAKEN

#### COMBINE BTB AND BUT

- But D UPDATE BTB for J/Bravenes

BTB is Much More expersive man But BUT BTB earlier IN PIPELINE + CAD Accelerate JR

I UPDATE BUT for Brawers

COMBINE DIS if few entries wim But I many entries

#### RETURN APPRISS STACK PREDICTOR

BTB only wonis for In toward call nerums it Always Call function from same place (NOT realiste)

STACK presiction

- PUSH TARGET ADDRESS ON STACK FOR JAL JALR POP OFF TARGET ADDRESS FOR JR TO PRESICT TARGET

Move Stack Presictor who Feran AND presict which PC'S AM JR.

USE TOURAMENT PRESIDENT TO CHOOSE BETWEEN BTB AND STACK PREDICTOR.