# ECE 4750 Computer Architecture Fall 2024

# **Topic 1: Processor Concepts**

# School of Electrical and Computer Engineering Cornell University

revision: 2024-09-03-14-29

| 1 | Instruction Set Architecture                      | 3  |
|---|---------------------------------------------------|----|
|   | 1.1. IBM 360 Instruction Set Architecture         | 5  |
|   | 1.2. MIPS32 Instruction Set Architecture          | 7  |
|   | 1.3. Tiny RISC-V Instruction Set Architecture     | 12 |
| 2 | Processor Functional-Level Model                  | 16 |
|   | 2.1. Transactions and Steps                       | 16 |
|   | 2.2. TinyRV1 Simple Assembly Example              | 17 |
|   | 2.3. TinyRV1 VVAdd Asm and C Program              | 18 |
|   | 2.4. TinyRV1 Mystery Asm and C Program            | 19 |
| 3 | Processor/Laundry Analogy                         | 20 |
|   | 3.1. Arch vs. µArch vs. VLSI Impl                 | 20 |
|   | 3.2. Processor Microarchitectural Design Patterns | 21 |
|   | 3.3. Transaction Diagrams                         | 22 |
| 4 | Analyzing Processor Performance                   | 23 |



### 1. Instruction Set Architecture

By early 1960's, IBM had several incompatible lines of computers!

- Defense: 701

- Scientific: 704, 709, 7090, 7094

Business: 702, 705, 7080Mid-Sized Business: 1400

- Decimal Architectures: 7070, 7072, 7074

• Each system had its own:

- Implementation and potentially even technology

- Instruction set

- I/O system and secondary storage (tapes, drums, disks)

- Assemblers, compilers, libraries, etc

- Application niche

- IBM 360 was the first line of machines to separate ISA from microarchitecture
  - Enabled same software to run on different current and future microarchitectures
  - Reduced impact of modifying the microarchitecture enabling rapid innovation in hardware

| Application                  |
|------------------------------|
| Algorithm                    |
| Programming Language         |
| Operating System             |
| Compiler                     |
| Instruction Set Architecture |
| Microarchitecture            |
| Register-Transfer Level      |
| Gate Level                   |
| Circuits                     |
| Devices                      |
| Technology                   |

... the structure of a computer that a machine language programmer must understand to write a correct (timing independent) program for that machine.

— Amdahl, Blaauw, Brooks, 1964

## ISA is the contract between software and hardware

| • 1                                                                                                                                                         |
|-------------------------------------------------------------------------------------------------------------------------------------------------------------|
| <ul> <li>Representations for characters, integers, floating-point</li> <li>Integer formats can be signed or unsigned</li> </ul>                             |
| Floating-point formats can be single- or double-precision                                                                                                   |
| - Byte addresses can ordered within a word as either little- or big-endian                                                                                  |
| • 2                                                                                                                                                         |
| - Registers: general-purpose, floating-point, control/status                                                                                                |
| <ul> <li>Memory: different addresses spaces for heap, stack, I/O</li> </ul>                                                                                 |
| • 3                                                                                                                                                         |
| - Register: operand stored in registers                                                                                                                     |
| - Immediate: operand is an immediate in the instruction                                                                                                     |
| Direct: address of operand in memory is stored in instruction  Parists Indianate address of an area discussions at an discussion.                           |
| <ul><li>Register Indirect: address of operand in memory is stored in register</li><li>Displacement: register indirect, addr is added to immediate</li></ul> |
| <ul> <li>Autoincrement/decrement: register indirect, addr is automatically adj</li> </ul>                                                                   |
| <ul> <li>PC-Relative: displacement is added to the program counter</li> </ul>                                                                               |
| • 4.                                                                                                                                                        |
| Integer and floating-point arithmetic instructions                                                                                                          |
| <ul> <li>Register and memory data movement instructions</li> </ul>                                                                                          |
| <ul> <li>Control transfer instructions</li> </ul>                                                                                                           |
| <ul> <li>System control instructions</li> </ul>                                                                                                             |
| • 5                                                                                                                                                         |
| - Opcode, addresses of operands and destination, next instruction                                                                                           |
| <ul> <li>Variable length vs. fixed length</li> </ul>                                                                                                        |

## 1.1. IBM 360 Instruction Set Architecture

- How is data represented?
  - 8-bit bytes, 16-bit half-words, 32-bit words, 64-bit double-words
  - IBM 360 is why bytes are 8-bits long today!
- Where can data be stored?
  - 2<sup>24</sup> 8-bit memory locations
  - 16 general-purpose 32-bit registers and 4 floating-point 64-bit registers
  - Condition codes, control flags, program counter
- What operations can be performed on data?
  - Large number of arithmetic, data movement, and control instructions



M[B1 + D1] -M[B1 + D1] op M[B2 + D2]

|               | Model 30      | Model 70              |
|---------------|---------------|-----------------------|
| Storage       | 8–64 KB       | 256-512 KB            |
| Datapath      | 8-bit         | 64-bit                |
| Circuit Delay | 30 ns/level   | 5 ns/level            |
| Local Store   | Main store    | Transistor registers  |
| Control Store | Read only 1µs | Conventional circuits |

- IBM 360 instruction set architecture completely hid the underlying technological differences between various models
- Significant Milestone: The first true ISA designed as a portable hardware-software interface
- IBM 360: 60 years later ... The zSeries z15 Microprocessor
  - 5+GHz in IBM 14 nm SOI
  - 9.2B transistors in 696 mm<sup>2</sup>
  - 17 metal layers
  - 12 cores per chip
  - Aggressive out-of-order execution
  - Four-level cache hierarchy
  - On-chip 256MB eDRAM L3 cache
  - Off-chip 960MB eDRAM L4 cache
  - Can still run IBM 360 code!



C. Berry, et al., "IBM z15: A 12-Core 5.2GHz Microprocessor," Int'l Solid-State Circuits Conference, Feb. 2020.

## 1.2. MIPS32 Instruction Set Architecture

- How is data represented?
  - 8-bit bytes, 16-bit half-words, 32-bit words
  - 32-bit single-precision, 64-bit double-precision floating point
- Where can data be stored?
  - 2<sup>32</sup> 8-bit memory locations
  - 32 general-purpose 32-bit registers, 32 SP (16 DP) floating-point registers
  - FP status register, Program counter
- How can data be accessed?
  - Register, immediate, displacement
- What operations can be performed on data?
  - Large number of arithmetic, data movement, and control instructions
- How are instructions encoded?
  - Fixed-length 32-bit instructions



MIPS R2K: 1986, single-issue, in-order, off-chip caches, 2 μm, 8–15 MHz, 110K transistors, 80 mm<sup>2</sup>



MIPS R10K: 1996, quad-issue, out-of-order, on-chip caches, 0.35 μm, 200 MHz, 6.8M transistors, 300 mm<sup>2</sup>

| 31              | 26 | 25 | 21 | 20 1 | 6 | 15 0      |   |
|-----------------|----|----|----|------|---|-----------|---|
| ADDIU<br>001001 |    | rs |    | rt   |   | immediate |   |
|                 |    | 5  |    | - 5  |   | 16        | • |

Format: ADDIU rt, rs, immediate MIPS32

Purpose: Add Immediate Unsigned Word To add a constant to a 32-bit integer

Description: GPR[rt] ← GPR[rs] + immediate

The 16-bit signed *immediate* is added to the 32-bit value in GPR rs and the 32-bit arithmetic result is placed into GPR rt.

No Integer Overflow exception occurs under any circumstances.

#### Restrictions:

None

#### Operation:

```
\label{eq:condition} \begin{split} \text{temp} &\leftarrow \text{GPR[rs]} + \text{sign\_extend(immediate)} \\ \text{GPR[rt]} &\leftarrow \text{temp} \end{split}
```

#### Exceptions:

None

#### Programming Notes:

The term "unsigned" in the instruction name is a misnomer; this operation is 32-bit modulo arithmetic that does not trap on overflow. This instruction is appropriate for unsigned arithmetic, such as address arithmetic, or integer arithmetic environments that ignore overflow, such as C language arithmetic.

Load Word LW

| 31           | 26 | 25   | 21 | 20 | 1  | 6 | 15 0   |   |
|--------------|----|------|----|----|----|---|--------|---|
| LW<br>100011 |    | base |    |    | rt |   | offset |   |
| - 6          |    | 5    |    |    | 5  |   | 16     | _ |

Format: LW rt, offset(base) MIPS32

Purpose: Load Word

To load a word from memory as a signed value

**Description:** GPR[rt] ← memory[GPR[base] + offset]

The contents of the 32-bit word at the memory location specified by the aligned effective address are fetched, signextended to the GPR register length if necessary, and placed in GPR rt. The 16-bit signed offset is added to the contents of GPR base to form the effective address.

#### Restrictions:

The effective address must be naturally-aligned. If either of the 2 least-significant bits of the address is non-zero, an Address Error exception occurs.

#### Operation:

```
\label{eq:vaddr} $$ vAddr_{1,0} \neq 0^2$ then signalException(AddressError) endif (pAddr, CCA) $$ AddressTranslation (vAddr, DATA, LOAD) memword $$ LoadMemory (CCA, WORD, pAddr, vAddr, DATA)$$ GPR(rt] $$ memword$$
```

#### **Exceptions:**

TLB Refill, TLB Invalid, Bus Error, Address Error, Watch

Load Word Left LWL

| 31            | 26 | 25   | 21 | 20 16 | 15 0   |  |
|---------------|----|------|----|-------|--------|--|
| LWL<br>100010 |    | base |    | rt    | offset |  |
| - 6           |    | 5    |    | 5     | 16     |  |

Format: LWL rt, offset(base) MIPS32

Purpose: Load Word Left

To load the most-significant part of a word as a signed value from an unaligned memory address

**Description:** GPR[rt] ← GPR[rt] MERGE memory[GPR[base] + offset]

The 16-bit signed offset is added to the contents of GPR base to form an effective address (EffAddr). EffAddr is the address of the most-significant of 4 consecutive bytes forming a word (W) in memory starting at an arbitrary byte boundary.

The most-significant 1 to 4 bytes of W is in the aligned word containing the EffAddr. This part of W is loaded into the most-significant (left) part of the word in GPR rt. The remaining least-significant part of the word in GPR rt is unchanged.

The figure below illustrates this operation using big-endian byte ordering for 32-bit and 64-bit registers. The 4 consecutive bytes in 2..5 form an unaligned word starting at location 2. A part of W, 2 bytes, is in the aligned word containing the most-significant byte at 2. First, LWL loads these 2 bytes into the left part of the destination register word and leaves the right part of the destination word unchanged. Next, the complementary LWR loads the remainder of the unaligned word

Figure 3.4 Unaligned Word Load Using LWL and LWR

The bytes loaded from memory to the destination register depend on both the offset of the effective address within an aligned word, that is, the low 2 bits of the address ( $vAddr_{1..0}$ ), and the current byte-ordering mode of the processor (big- or little-endian). The figure below shows the bytes loaded for every combination of offset and byte ordering.

| 31            | 26 | 25 21 | 20 16 | 15 0   |  |
|---------------|----|-------|-------|--------|--|
| BNE<br>000101 |    | rs    | rt    | offset |  |
| - 6           |    | 5     | 5     | 16     |  |

Format: BNE rs, rt, offset MIPS32

Purpose: Branch on Not Equal

To compare GPRs then do a PC-relative conditional branch

**Description:** if  $GPR[rs] \neq GPR[rt]$  then branch

An 18-bit signed offset (the 16-bit offset field shifted left 2 bits) is added to the address of the instruction following the branch (not the branch itself), in the branch delay slot, to form a PC-relative effective target address.

If the contents of GPR rs and GPR rt are not equal, branch to the effective target address after the instruction in the delay slot is executed.

#### Restrictions:

Processor operation is UNPREDICTABLE if a branch, jump, ERET, DERET, or WAIT instruction is placed in the delay slot of a branch or jump.

#### Operation:

#### **Exceptions:**

None

#### Programming Notes:

With the 18-bit signed instruction offset, the conditional branch range is ± 128 KBytes. Use jump (J) or jump register (JR) instructions to branch to addresses outside this range.

# 1.3. Tiny RISC-V Instruction Set Architecture

- RISC-V instruction set architecture
  - Brand new free, open instruction set architecture
  - Significant excitement around RISC-V hardware/software ecosystem
  - Helping to energize "open-source hardware"
  - Specifically designed to encourage subsetting and extension
  - Link to official ISA manual on course webpage
- Tiny RISC-V instruction set architecture
  - Subset we use in this course
  - Small enough for teaching, powerful enough for running real C programs
  - How is data represented?
  - Where can data be stored?
  - How can data be accessed?
  - What ops can be performed on data?
  - How are inst encoded?
  - http://www.csl.cornell.edu/courses/ece4750/handouts.shtml
- TinyRV1: Small subset suitable for lecture, problems, exams

| _ |  |
|---|--|
|   |  |
|   |  |
| _ |  |
|   |  |
|   |  |
| _ |  |

- TinyRV2: Subset suitable for lab assignments and capable of executing simple C programs without an operating system
  - add, addi, sub, mul, and, andi, or, ori, xor, xori
  - slt, slti, sltu, sltiu
  - sra, srai, srl, srli, sll, slli
  - lui, aupic, lw, sw
  - jal, jalr, beq, bne, blt, bge, bltu, bgeu
  - csrr, csrw

## TinyRV1 instruction assembly, semantics, and encoding

#### ADD

| add  | rd,    | rs1,   | rs2      |
|------|--------|--------|----------|
| R[rd | l] ← F | R[rs1] | + R[rs2] |
| PC ← | - PC   | + 4    |          |

#### 31 25 24 20 19 15 14 12 11 7 6 0 0000000 rs2 rs1 000 rd 0110011

#### ADDI

| addi            | rd,  | rs1,  | imm       |
|-----------------|------|-------|-----------|
| R[rd]           | ← R[ | rs1]+ | sext(imm) |
| $PC \leftarrow$ | PC+  | 4     |           |

| 31 2 | 0 | 19 15 | 14 12 | 11 | 7  | 6 | 0       |  |
|------|---|-------|-------|----|----|---|---------|--|
| imm  |   | rs1   | 000   |    | rd | Γ | 0010011 |  |

#### MUL

| mul  | rd,    | rs1,   | rs2     |
|------|--------|--------|---------|
| R[rd | l] ← F | R[rs1] | ×R[rs2] |
| PC ← | - PC   | + 4    |         |

| 31    | 25 | 24 | 20 | 19 | 15 | 14 1 | 12 1 | 11 | 7 | 6 |        | 0 |
|-------|----|----|----|----|----|------|------|----|---|---|--------|---|
| 00000 | 01 | r  | 52 | rs | 1  | 00   | 0    | rd |   | 0 | 110011 |   |

#### LW

| lw rd,             | imm(rs1)              |
|--------------------|-----------------------|
| $R[rd] \leftarrow$ | M[R[rs1] + sext(imm)] |
| PC ← PC            | C + 4                 |

| 31  | 20 | 19 15 | 14 12 | 11 | 7 | 6       | 0 |
|-----|----|-------|-------|----|---|---------|---|
| imm |    | rs1   | 010   | rd |   | 0000011 |   |

#### SW

| sw rs2, imm(rs1)                          |
|-------------------------------------------|
| $M[R[rs1] + sext(imm)] \leftarrow R[rs2]$ |
| $PC \leftarrow PC + 4$                    |

| 31 |     | 25 | 24    | 20  | 19  | 15 | 14  | 12 | 11  | 7  | 6  |         | 0 |
|----|-----|----|-------|-----|-----|----|-----|----|-----|----|----|---------|---|
|    | imm |    | rs2   | 2   | rs1 |    | 01  | 0  | i   | mm |    | 0100011 |   |
|    |     | 1  | imm = | : { | ins | t[ | 31: | 2  | 5], | in | st | [11:7]  | } |

#### JAL

| jal   | rd,  | imm         |
|-------|------|-------------|
| R[rd] | ← P  | C + 4       |
| PC ←  | PC · | + sext(imm) |

| 31 |     | 12 | 11 | 7 | 6       | 0 |
|----|-----|----|----|---|---------|---|
|    | imm |    | rd |   | 1101111 | l |

imm = { inst[31], inst[19:12],
 inst[20], inst[30:21], 0 }

#### JR

| JΤ | 151                 |
|----|---------------------|
| PC | $\leftarrow R[rs1]$ |

| 31           | 20 | 19 1 | 5 | 14 12 | 11 7  | 6       | 0      |  |
|--------------|----|------|---|-------|-------|---------|--------|--|
| 000000000000 |    | rs1  |   | 000   | 00000 | 1100111 | $\Box$ |  |

#### BNE

| bne   | rs1,   | rs2,  | imm                                |
|-------|--------|-------|------------------------------------|
| if (R | [rs1]! | =R[rs | 2]) $PC \leftarrow PC + sext(imm)$ |
| else  |        |       | $PC \leftarrow PC + 4$             |

| 31 |     | 25 | 24  | 20 | 19  | 15 | 14 | 12 | 11 | 7   | 6 |         | 0 |
|----|-----|----|-----|----|-----|----|----|----|----|-----|---|---------|---|
|    | imm |    | rs2 | 2  | rs1 | L  | 00 | 1  |    | imm |   | 1100011 |   |

imm = { inst[31], inst[7],
inst[30:25], inst[11:8], 0 }

op

op

rs1

jump target



| Base Integer Instructions: RV32I, RV64I, and RV128I RV Privileged Instructions |        |                   |                         |        |          |                  |        |                     |
|--------------------------------------------------------------------------------|--------|-------------------|-------------------------|--------|----------|------------------|--------|---------------------|
|                                                                                |        |                   |                         |        |          |                  |        | Instructions        |
| Category Name                                                                  | Fmt    | RV32I Base        | +RV{64,128              | }      | Catego   |                  | lame   | RV mnemonic         |
| <b>Loads</b> Load Byte                                                         | I      | LB rd,rs1,imm     |                         |        | CSR Ac   |                  | ,      | , ,                 |
| Load Halfword                                                                  | I      | LH rd,rs1,imm     |                         |        |          | Atomic Read & S  |        |                     |
| Load Word                                                                      | I      | LW rd,rs1,imm     | L{D Q} rd,rs1           | ,imm   | At       | omic Read & Cle  |        | , ,                 |
| Load Byte Unsigned                                                             | I      | LBU rd,rs1,imm    |                         |        |          |                  |        | CSRRWI rd,csr,imm   |
| Load Half Unsigned                                                             | I      | LHU rd,rs1,imm    | L{W D}U rd,rsl          | ,imm   |          |                  |        | CSRRSI rd,csr,imm   |
| Stores Store Byte                                                              | S      | SB rs1,rs2,imm    |                         |        | Atomic   | Read & Clear Bit | Imm    | CSRRCI rd,csr,imm   |
| Store Halfword                                                                 | S      | SH rs1,rs2,imm    |                         |        |          |                  | . Call |                     |
| Store Word                                                                     | S      | SW rs1,rs2,imm    | S{D Q} rsl,rs           | 2,imm  | Env      | ironment Break   | point  | EBREAK              |
| Shifts Shift Left                                                              | R      | SLL rd,rs1,rs2    | SLL{W D} rd,rs1         | rs2    |          | Environment R    | eturn  | ERET                |
| Shift Left Immediate                                                           | I      | SLLI rd,rsl,shamt | SLLI{W D} rd,rs1        | ,shamt | Trap R   | edirect to Supe  | erviso | MRTS                |
| Shift Right                                                                    | R      | SRL rd,rs1,rs2    | SRL{W D} rd,rs1         | rs2    | Redire   | ect Trap to Hype | rvisor | MRTH                |
| Shift Right Immediate                                                          | I      | SRLI rd,rs1,shamt | SRLI{W D} rd,rsl        |        |          | or Trap to Super |        |                     |
| Shift Right Arithmetic                                                         | R      | SRA rd,rs1,rs2    | SRA{W D} rd,rs1         |        |          | pt Wait for Int  | errupt | WFI                 |
| Shift Right Arith Imm                                                          | I      | SRAI rd,rs1,shamt | SRAI{W D} rd,rs1        |        | мми      |                  |        | SFENCE.VM rsl       |
| Arithmetic ADD                                                                 | R      | ADD rd,rs1,rs2    | ADD{W D} rd,rs1         |        |          |                  |        | •                   |
| ADD Immediate                                                                  | Ï      | ADDI rd,rsl,imm   | ADDI{W D} rd,rs1        |        | I        |                  |        |                     |
| SUBtract                                                                       | R      | SUB rd,rs1,rs2    | SUB(W D) rd.rs1         |        | I        |                  |        |                     |
| Load Upper Imm                                                                 | U      | LUI rd.imm        |                         | _      | sed (1)  | S-hit) Instru    | ctic   | n Extension: RVC    |
| Add Upper Imm to PC                                                            | IJ     | AUIPC rd,imm      | Category Name           | Fmt    | 3CU (11  | RVC              | CLIO   | RVI equivalent      |
| Logical XOR                                                                    | R      | XOR rd,rs1,rs2    | Loads Load Word         |        | C.LW     | rd',rsl',im      | nm     | LW rd',rsl',imm*4   |
| XOR Immediate                                                                  | I      | XORI rd,rs1,imm   | Load Word S             |        | C.LWSP   | rd,imm           |        | LW rd,sp,imm*4      |
|                                                                                |        | 1                 |                         |        |          |                  |        | _                   |
| OR                                                                             | R      | OR rd,rs1,rs2     | Load Double             |        | C.LD     | rd',rsl',im      | nm     | LD rd',rsl',imm*8   |
| OR Immediate                                                                   | I      | ORI rd,rs1,imm    | Load Double S           |        | C.LDSP   | rd,imm           |        | LD rd,sp,imm*8      |
| AND                                                                            | R      | AND rd,rs1,rs2    | Load Qua                |        | C.LQ     | rd',rsl',iπ      | nm     | LQ rd',rsl',imm*16  |
| AND Immediate                                                                  | I      | ANDI rd,rsl,imm   | Load Quad S             |        | C.LQSP   | rd,imm           |        | LQ rd,sp,imm*16     |
| Compare Set <                                                                  | R      | SLT rd,rs1,rs2    | Stores Store Word       |        | C.SW     | rs1',rs2',i      | .mm    | SW rs1',rs2',imm*4  |
| Set < Immediate                                                                | I      | SLTI rd,rs1,imm   | Store Word S            |        | C.SWSP   | rs2,imm          |        | SW rs2,sp,imm*4     |
| Set < Unsigned                                                                 | R      | SLTU rd,rs1,rs2   | Store Double            |        | C.SD     | rs1',rs2',i      | .mm    | SD rs1',rs2',imm*8  |
| Set < Imm Unsigned                                                             | I      | SLTIU rd,rsl,imm  | Store Double S          |        | C.SDSP   | rs2,imm          |        | SD rs2,sp,imm*8     |
| Branches Branch =                                                              | SB     | BEQ rs1,rs2,imm   | Store Qua               |        | C.SQ     | rs1',rs2',i      | .mm    | SQ rs1',rs2',imm*16 |
| Branch ≠                                                                       | SB     | BNE rs1,rs2,imm   | Store Quad S            |        | C.SQSP   | rs2,imm          |        | SQ rs2,sp,imm*16    |
| Branch <                                                                       | SB     | BLT rs1,rs2,imm   | Arithmetic ADI          | CR     | C.ADD    | rd,rsl           |        | ADD rd,rd,rsl       |
| Branch ≥                                                                       | SB     | BGE rs1,rs2,imm   | ADD Work                | CR     | C.ADDW   | rd,rsl           |        | ADDW rd,rd,imm      |
| Branch < Unsigned                                                              | SB     | BLTU rs1,rs2,imm  | ADD Immediate           | CI     | C.ADDI   | rd,imm           |        | ADDI rd,rd,imm      |
| Branch ≥ Unsigned                                                              | SB     | BGEU rs1,rs2,imm  | ADD Word Imn            |        | C.ADDIV  |                  |        | ADDIW rd,rd,imm     |
| Jump & Link J&L                                                                | UJ     | JAL rd,imm        | ADD SP Imm * 1          |        |          | 16SP x0,imm      |        | ADDI sp,sp,imm*16   |
| Jump & Link Register                                                           | UJ     | JALR rd,rs1,imm   | ADD SP Imm *            |        |          | 4SPN rd',imm     |        | ADDI rd',sp,imm*4   |
| Synch Synch thread                                                             | I      | FENCE             | Load Immediat           |        | C.LI     | rd,imm           |        | ADDI rd,x0,imm      |
| Synch Instr & Data                                                             | I      | FENCE.I           | Load Upper Imn          |        | C.LUI    | rd,imm           |        | LUI rd,imm          |
| System System CALL                                                             | I      | SCALL             | MoV                     |        | C.MV     | rd,rsl           |        | ADD rd,rs1,x0       |
| System BREAK                                                                   | I      | SBREAK            | SUI                     |        | C.SUB    | rd,rsl           |        | SUB rd,rd,rsl       |
| Counters ReaD CYCLE                                                            | I      | RDCYCLE rd        | Shifts Shift Left Imr   |        | C.SLLI   | rd,imm           |        | SLLI rd,rd,imm      |
| ReaD CYCLE upper Half                                                          | I      | RDCYCLEH rd       | Branches Branch=        |        | C.BEQZ   | rsl',imm         | 1      | BEQ rsl',x0,imm     |
| ReaD TIME                                                                      | I      | RDTIME rd         | Branch≠                 |        | C.BNEZ   | rsl',imm         | ı      | BNE rsl',x0,imm     |
| ReaD TIME upper Half                                                           | I      | RDTIMEH rd        | Jump Jump               |        | C.J      | imm              |        | JAL x0,imm          |
| ReaD INSTR RETired                                                             | I      | RDINSTRET rd      | Jump Registe            | CR     | C.JR     | rd,rsl           |        | JALR x0,rs1,0       |
| ReaD INSTR upper Half                                                          | I      | RDINSTRETH rd     | Jump & Link J&I         |        | C.JAL    | imm              |        | JAL ra,imm          |
|                                                                                |        | •                 | Jump & Link Registe     | r CR   | C.JALR   | rsl              |        | JALR ra,rs1,0       |
| System Env. BREAK CI C.EBREAK EBREAK                                           |        |                   |                         |        |          |                  |        |                     |
| 32-bit Instruction Formats 16-bit (RVC) Instruction Formats                    |        |                   |                         |        |          |                  |        |                     |
| 31 30 25 24                                                                    | 21     | 20 19 15 14 12    | 11 8 7 6 0              | CR     | 15 14 13 |                  |        |                     |
| R funct7                                                                       | rs     |                   | rd opcode               |        | funct3   |                  |        | rs2 op              |
| I imm[11:0]                                                                    | 10     | rs1 funct3        | rd opcode               |        | funct3   | imm rd/r         | SI     | imm op              |
| S imm[11:5]                                                                    | rs     |                   | imm[4:0] opcode         |        | funct3   | imn              | n      | rs2 op<br>rd' op    |
| SB imm[12] imm[10:5]                                                           | rs     |                   | imm[4:1] imm[11] opcode |        | funct3   | imm              | rs1'   | imm rd' op          |
| -   mm[12]   mm[10.0]                                                          | imm[3] |                   | rd opcode               | cs     | funct3   | imm              | rs1'   | imm rs2' op         |

RISC-V Integer Base (RV32I/64I/128I), privileged, and optional compressed extension (RVC). Registers x1-x31 and the pc are 32 bits wide in RV321, 64 in RV641, and 128 in RV1281 (x0=0). RV641/1281 add 10 instructions for the wider formats. The RV1 base of <50 classic integer RISC instructions is required. Every 16-bit RVC instruction matches an existing 32-bit RVI instruction. See risc.org.

imm[31:12]

imm [19:12]

UJ imm[20]

opcode CS

opcode CB

funct3

funct3

offset



|                  | -                                    | V        |                                                    |                |                |                          |                                            |
|------------------|--------------------------------------|----------|----------------------------------------------------|----------------|----------------|--------------------------|--------------------------------------------|
|                  |                                      |          | Optional Multiply-Divide                           | Instruc        |                |                          |                                            |
| Category         | Name                                 | Fmt      | RV32M (Multiply-Divide)                            |                | +RV{6          |                          |                                            |
| Multiply         | MULtiply                             | R        | MUL rd,rs1,rs2                                     | MUL {W   D     | } :            | rd,rs1,rs2               |                                            |
|                  | MULtiply upper Half                  |          | MULH rd,rs1,rs2                                    |                |                |                          |                                            |
|                  | JLtiply Half Sign/Uns                |          | MULHSU rd,rs1,rs2                                  |                |                |                          |                                            |
|                  | Ltiply upper Half Uns                |          | MULHU rd,rs1,rs2                                   |                |                |                          |                                            |
| Divide           | DIVide                               |          | DIV rd,rs1,rs2                                     | DIA{M D        | }              | rd,rs1,rs2               |                                            |
|                  | DIVide Unsigned                      |          | DIVU rd,rs1,rs2                                    |                |                |                          |                                            |
| Remainde         |                                      | R        | REM rd,rs1,rs2                                     | REM {W   D     |                | rd,rs1,rs2               |                                            |
|                  | REMainder Unsigned                   | R        | REMU rd,rs1,rs2                                    | REMU{W         | )} :           | rd,rs1,rs2               |                                            |
|                  |                                      |          | al Atomic Instruction Extension                    | n: RVA         |                |                          |                                            |
| Category         | Name                                 | Fmt      | RV32A (Atomic)                                     | 1              | +RV{6          |                          |                                            |
| Load             | Load Reserved                        | R        | LR.W rd,rsl                                        | LR. {D   Q     |                | rd,rsl                   |                                            |
| Store            | Store Conditional                    | R        | SC.W rd,rs1,rs2                                    | SC.{D Q        |                | rd,rs1,rs2               |                                            |
| Swap             | SWAP                                 | R        | AMOSWAP.W rd,rs1,rs2                               | AMOSWAP        |                | rd,rs1,rs2               |                                            |
| Add              | ADD                                  | R<br>R   | AMOADD.W rd,rs1,rs2                                | AMOADD.        |                | rd,rs1,rs2               |                                            |
| Logical          | XOR<br>AND                           | R        | AMOXOR.W rd,rs1,rs2<br>AMOAND.W rd,rs1,rs2         | AMOXOR.        |                | rd,rs1,rs2<br>rd,rs1,rs2 |                                            |
|                  | OR                                   | R        |                                                    | AMOOR. {       |                | rd,rs1,rs2               |                                            |
| Min /Morro       |                                      |          |                                                    | <del></del>    |                |                          |                                            |
| Min/Max          | MINimum                              | R        | AMOMIN.W rd,rs1,rs2                                | AMOMIN.        |                | rd,rs1,rs2               |                                            |
|                  | MAXimum<br>MINimum Unsigned          | R<br>R   | AMOMAX.W rd,rs1,rs2<br>AMOMINU.W rd,rs1,rs2        | AMOMAX.        |                | rd,rs1,rs2<br>rd,rs1,rs2 |                                            |
|                  | MAXimum Unsigned                     | R        | AMOMAXU.W rd,rs1,rs2                               |                |                | rd,rs1,rs2               |                                            |
| _                |                                      |          |                                                    |                |                |                          |                                            |
|                  |                                      |          | ng-Point Instruction Extensio                      | ns: KVF,       |                |                          |                                            |
| Category<br>Move | Name                                 | Fmt<br>R | RV32{F D Q} (HP/SP,DP,QP FI Pt) FMV.{H S}.X rd,rs1 | mar en la      | +RV{6          |                          |                                            |
| Move             | Move from Integer<br>Move to Integer | R        | FMV.X.{H S} rd,rs1                                 | FMV. {D   C    |                | rd,rsl<br>rd.rsl         |                                            |
| Convert          | Convert from Int                     | R        | FCVT.{H S D Q}.W rd,rs1                            | FMV.X.{I       |                |                          |                                            |
|                  | rt from Int Unsigned                 | R        | FCVT.{H S D Q}.WU rd,rs1                           |                |                | .{L T}U rd,rs1           |                                            |
| Conve            | Convert to Int                       | R        | FCVT.W.{H S D Q} rd,rs1                            |                |                | S D Q} rd,rs1            |                                            |
| Con              | ivert to Int Unsigned                | R        | FCVT.WU.{H S D Q} rd,rs1                           |                |                | S D Q} rd,rs1            |                                            |
| Load             | Load                                 | I        | FL{W,D,Q} rd,rs1,imm                               | <u> </u>       | , ,            |                          | ng Convention                              |
| Store            | Store                                | S        | FS{W,D,Q} rs1,rs2,imm                              | Register       | ARI Nam        |                          | Description                                |
| Arithmeti        | ic ADD                               | R        | FADD. {S D Q} rd,rs1,rs2                           | x0             | zero           |                          | Hard-wired zero                            |
|                  | SUBtract                             | R        | FSUB. {S D Q} rd,rs1,rs2                           | x1             | ra             | Caller                   | Return address                             |
|                  | MULtiply                             | R        | FMUL. {S D Q} rd,rs1,rs2                           | x2             | sp             | Callee                   | Stack pointer                              |
|                  | DIVide                               | R        | FDIV.{S D Q} rd,rs1,rs2                            | x3             | gp             |                          | Global pointer                             |
|                  | SQuare RooT                          | R        | FSQRT. {S D Q} rd,rs1                              | x4             | tp             |                          | Thread pointer                             |
| Mul-Add          | Multiply-ADD                         | R        | FMADD. {S D Q} rd,rs1,rs2,rs3                      | x5-7           | t0-2           | Caller                   | Temporaries                                |
|                  | Multiply-SUBtract                    | R        | FMSUB. {S D Q} rd,rs1,rs2,rs3                      | x8             | s0/fp          | Callee                   | Saved register/frame pointer               |
|                  | ve Multiply-SUBtract                 | R        | FNMSUB.{S D Q} rd,rs1,rs2,rs3                      | x9             | s1             | Callee                   | Saved register                             |
|                  | egative Multiply-ADD                 | R        | FNMADD.{S D Q} rd,rs1,rs2,rs3                      | x10-11         | a0-1           | Caller                   | Function arguments/return values           |
| Sign Inje        |                                      | R        | FSGNJ.{S D Q} rd,rs1,rs2                           | x12-17         | a2-7           | Caller                   | Function arguments                         |
| N                | legative SiGN source                 | R        | FSGNJN.{S D Q} rd,rs1,rs2                          | x18-27         | s2-11          | Callee                   | Saved registers                            |
| Min /Marri       | Xor SiGN source                      | R        | FSGNJX.{S D Q} rd,rs1,rs2                          | x28-31         | t3-t6          | Caller                   | Temporaries                                |
| Min/Max          | MINimum                              | R        | FMIN.{S D Q} rd,rs1,rs2                            | f0-7           | ft0-7          | Caller                   | FP temporaries                             |
| Compare          | MAXimum<br>Compare Float =           | R<br>R   | FMAX.{S D Q} rd,rs1,rs2                            | f8-9<br>f10-11 | fs0-1<br>fa0-1 | Callee<br>Caller         | FP saved registers                         |
| Compare          | Compare Float <                      |          | FEQ.{S D Q} rd,rs1,rs2<br>FLT.{S D Q} rd,rs1,rs2   | f10-11         | fa2-7          | Caller                   | FP arguments/return values<br>FP arguments |
|                  | Compare Float < Compare Float ≤      | R<br>R   |                                                    |                | fs2-11         | Callee                   | FP saved registers                         |
| Catago           |                                      |          |                                                    | f18-27         |                |                          |                                            |
|                  | ration Classify Type                 | R        | FCLASS.{S D Q} rd,rs1                              | f28-31         | ft8-11         | Caller                   | FP temporaries                             |
|                  | ntion Read Status                    | R        | FRCSR rd                                           |                |                |                          |                                            |
|                  | Read Rounding Mode                   | R<br>R   | FRRM rd<br>FRFLAGS rd                              |                |                |                          |                                            |
|                  | Read Flags                           |          |                                                    |                |                |                          |                                            |
|                  | Swap Status Reg                      | R<br>R   | FSCSR rd,rsl<br>FSRM rd,rsl                        |                |                |                          |                                            |
| 5                | Swap Rounding Mode                   |          | · ·                                                |                |                |                          |                                            |
|                  | Swap Flags                           | R        | FSFLAGS rd,rsl                                     | I              |                |                          |                                            |

RISC-V calling convention and five optional extensions: 10 multiply-divide instructions (RV32M); 11 optional atomic instructions (RV32A); and 25 floating-point instructions each for single-, double-, and quadruple-precision (RV32F, RV32D, RV32Q). The latter add registers f0-f31, whose width matches the widest precision, and a floating-point control and status register fcsr. Each larger address adds some instructions: 4 for RVM, 11 for RVA, and 6 each for RVF/D/Q. Using regex notation, {} means set, so  $L\{D|Q\}$  is both LD and LQ. See risc.org. (8/21/15 revision)

rd,imm

FSFLAGSI

Swap Rounding Mode Imm

Swap Flags Imm

## 2. Processor Functional-Level Model



## 2.1. Transactions and Steps

- We can think of each instruction as a transaction
- Executing a transaction involves a sequence of steps

|                     | add | addi | mul | lw | sw | jal | jr | bne |
|---------------------|-----|------|-----|----|----|-----|----|-----|
| Fetch Instruction   |     |      |     |    |    |     |    |     |
| Decode Instruction  |     |      |     |    |    |     |    |     |
| Read Register File  |     |      |     |    |    |     |    |     |
| Register Arithmetic |     |      |     |    |    |     |    |     |
| Read Memory         |     |      |     |    |    |     |    |     |
| Write Memory        |     |      |     |    |    |     |    |     |
| Write Register File |     |      |     |    |    |     |    |     |
| Update PC           |     |      |     |    |    |     |    |     |

# 2.2. TinyRV1 Simple Assembly Example

| Static Asm Sequence | Instruction Semantics |
|---------------------|-----------------------|
| loop: lw x1, 0(x2)  |                       |
| add x3, x3, x1      |                       |
| addi x2, x2, 4      |                       |
| bne x1, x0, loop    |                       |

## Worksheet illustrating processor functional-level model



## Table illustrating processor functional-level model

| lw x1, 0(x2) add x3, x3, x1 addi x2, x2, 4 | 3 |
|--------------------------------------------|---|
|                                            |   |
| addi x2, x2, 4                             |   |
|                                            |   |
| bne x1, x0, loop                           |   |
| lw x1, 0(x2)                               |   |
| add x3, x3, x1                             |   |

# 2.3. TinyRV1 Vector-Vector Add Assembly and C Program

| C code for doing element-wise vector addition.  void vvadd( int* dest, int* src0, int* src1, int n ) {                                                                          |     |
|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----|
|                                                                                                                                                                                 |     |
| }                                                                                                                                                                               |     |
| Equivalent TinyRV1 assembly code. Arguments are passed in $x10-x1$ return value is stored to $x10$ , return address is stored in $x1$ , and temporaries are stored in $x5-x7$ . | .7, |
|                                                                                                                                                                                 |     |
|                                                                                                                                                                                 |     |
|                                                                                                                                                                                 |     |
|                                                                                                                                                                                 |     |
|                                                                                                                                                                                 |     |
|                                                                                                                                                                                 |     |
|                                                                                                                                                                                 |     |
|                                                                                                                                                                                 |     |
|                                                                                                                                                                                 |     |
|                                                                                                                                                                                 |     |

Note that we are ignoring the fact that our assembly code will not function correctly if n <= 0. Our assembly code would need an additional check before entering the loop to ensure that n > 0. Unless otherwise stated, we will assume in this course that array bounds are greater than zero to simplify our analysis.

# 2.4. TinyRV1 Mystery Assembly and C Program

What is the C code corresponding to the TinyRV1 assembly shown

| elow? Assume assembly implements a function. |  |  |  |  |  |
|----------------------------------------------|--|--|--|--|--|
|                                              |  |  |  |  |  |
|                                              |  |  |  |  |  |
|                                              |  |  |  |  |  |
|                                              |  |  |  |  |  |
|                                              |  |  |  |  |  |
|                                              |  |  |  |  |  |
|                                              |  |  |  |  |  |
|                                              |  |  |  |  |  |
|                                              |  |  |  |  |  |
|                                              |  |  |  |  |  |
| addi x5, x0, 0                               |  |  |  |  |  |
| dddi no, no, o                               |  |  |  |  |  |
| loop:                                        |  |  |  |  |  |
| lw x6, 0(x10)                                |  |  |  |  |  |
| bne x6, x12, foo                             |  |  |  |  |  |
| addi x10, x5, 0                              |  |  |  |  |  |
| jr x1                                        |  |  |  |  |  |
| foo:                                         |  |  |  |  |  |

# ioo: addi x10, x10, 4

```
addi x5, x5, 1
bne x5, x11, loop
```

addi x10, x0, -1 jr x1

# 3. Processor/Laundry Analogy

#### Processor

- Instructions are "transactions" that execute on a processor
- Architecture: defines the hardware/software interface
- Microarchitecture: how hardware executes sequence of instructions

## • Laundry

- Cleaning a load of laundry is a "transaction"
- Architecture: high-level specification, dirty clothes in, clean clothes out
- Microarchitecture: how laundry room actually processes multiple loads

# 3.1. Arch vs. µArch vs. VLSI Impl





## **ARM VLSI Implementation**



Samsung Exynos Octa



NVIDIA Tegra 2

# 3.2. Processor Microarchitectural Design Patterns



## Fixed Time Slot Laundry (Single-Cycle Processors)



## Variable Time Slot Laundry (FSM Processors)

#### 7pm 8pm 9pm 10pm 11pm 12am 1am Anne's Load Ben's Load Cathy's Load Dave's Load

## Pipelined Laundry



## 3.3. Transaction Diagrams



**W**: Washing



D: Drying



**F**: Folding



S: Storing





## **Key Concepts**

- Transaction latency is the time to complete a single transaction
- Execution time or total latency is the time to complete a sequence of transactions
- Throughput is the number of transactions executed per unit time

# 4. Analyzing Processor Performance

$$\frac{\text{Time}}{\text{Program}} = \frac{\text{Instructions}}{\text{Program}} \times \frac{\text{Avg Cycles}}{\text{Instruction}} \times \frac{\text{Time}}{\text{Cycle}}$$

- Instructions / program depends on source code, compiler, ISA
- Avg cycles / instruction (CPI) depends on ISA, microarchitecture
- Time / cycle depends upon microarchitecture and implementation

Using our first-order equation for processor performance and a functional-level model, the execution time is just the number of dynamic instructions.

| Microarchitecture      | CPI         | Cycle Time |
|------------------------|-------------|------------|
| Single-Cycle Processor | 1           | long       |
| FSM Processor          | >1          | short      |
| Pipelined Processor    | $\approx 1$ | short      |



Students often confuse "Cycle Time" with the execution time of a sequence of transactions measured in cycles. "Cycle Time" is the clock period or the inverse of the clock frequency.

## Estimating dynamic instruction count

Estimate the dynamic instruction count for the vector-vector add example assuming n is 64?

```
loop:

lw x5, 0(x11)

lw x6, 0(x12)

add x7, x5, x6

sw x7, 0(x10)

addi x11, x11, 4

addi x12, x12, 4

addi x10, x10, 4

addi x13, x13, -1

bne x13, x0, loop

jr x1
```

Estimate the dynamic instruction count for the mystery program assuming n is 64 and that we find a match on the final element.

```
addi x5, x0, 0
loop:
lw x6, 0(x10)
bne x6, x12, foo
addi x10, x5, 0
jr x1
foo:
addi x10, x10, 4
addi x5, x5, 1
bne x5, x11, loop
addi x10, x0, -1
jr x1
```