# ECE 3400 Guest Lecture Design Principles and Methodologies in Computer Architecture

**Christopher Batten** 

Computer Systems Laboratory
School of Electrical and Computer Engineering
Cornell University

Spring 2015

#### **Application**

Gap too large to bridge in one step (but there are exceptions, e.g., a magnetic compass)

**Technology** 

**Application** Algorithm Computer Engineering Programming Language **Operating System** Instruction Set Architecture Microarchitecture Register-Transfer Level Gate Level Circuits Devices Technology

#### Sort an array of numbers

2.6.3.8.4.5 -> 2.3.4.5.6.8

#### **Insertion sort algorithm**

- 1. Find minimum number in input array
- 2. Move minimum number into output array
- 3. Repeat steps 1 and 2 until finished

#### C implementation of insertion sort

```
void isort( int b[], int a[], int n ) {
  for ( int idx, k = 0; k < n; k++ ) {
    int min = 100;
    for ( int i = 0; i < n; i++ ) {</pre>
      if ( a[i] < min ) {</pre>
        min = a[i];
        idx = i;
    b[k]
           = min;
    a[idx] = 100;
```

Computer Engineering

**Application** Algorithm Programming Language **Operating System** Instruction Set Architecture Microarchitecture Register-Transfer Level Gate Level Circuits Devices Technology

#### Mac OS X, Windows, Linux

**Design Principles** 

Handles low-level hardware management







#### MIPS32 Instruction Set

Instructions that machine executes

```
blez
      $a2, done
      $a7, $zero
move
li
      $t4, 99
      $a4, $a1
move
      $v1, $zero
move
li
      $a3, 99
     $a5, 0($a4)
lw
addiu $a4, $a4, 4
slt
      $a6, $a5, $a3
      $v0, $v1, $a6
movn
addiu $v1, $v1, 1
      $a3, $a5, $a6
movn
```

Application
Algorithm
Programming Language
Operating System
Instruction Set Architecture
Microarchitecture
Register-Transfer Level
Gate Level
Circuits
Devices
Technology

How data flows through system



Boolean logic gates and functions



Combining devices to do useful work



Transistors and wires



Silicon process technology



#### What is Computer Architecture?



#### **Application Requirements**

- Suggest how to improve architecture
- Provide revenue to fund development

Computer engineers provide feedback to guide application and technology research directions

#### **Technology Constraints**

- Restrict what can be done efficiently
- New technologies make new arch possible

In its broadest definition, computer engineering is the development of the abstraction/implementation layers that allow us to execute information processing applications efficiently using available manufacturing technologies

4 / 33

## Computer Architecture in the ECE/CS Curriculum

Application
Algorithm
Programming Language
Operating System
Instruction Set Architecture
Microarchitecture
Register-Transfer Level
Gate Level
Circuits
Devices
Technology

CS 4410 Operating Systems CS 4420 Compilers

ECE 3140 Embedded Systems

➤ ECE 4750 Computer Architecture

ECE 2300 Digital Logic & Computer Org ECE 4740 Digital VLSI Design

#### **Related Graduate Courses**

- ECE 5760 Advanced Microcontroller Design
- ECE 5750 Advanced Computer Architecture
- ECE 5730 Memory Systems
- ECE 5770 Resilient Computer Systems
- ECE 5745 Complex Digital ASIC Design
- ECE 5775 High-Level Design Automation

#### Logic, State, and Interconnect



Digital systems are implemented with three basic building blocks

- Logic to process data
- State to store data
- Interconnect to move data

#### **Processors, Memories, and Networks**



In computer engineering we more generally think of using

- Processors for computation
- Memories for storage
- Networks for communication

Application

**Agenda** 

Algorithm

PL

What is Computer Architecture?

OS

Design Example

ISA

Design Principles

RTL

μArch

Design Methodologies

Gates

Circuits

**Devices** 

Technology

# What do computer architects actually do?

Design Example

#### **General Science**

Discover truths about nature



Analyze results and draw conclusions

experiment

#### **Computer Engineering**

Explore design space for a new system

Design and model baseline system

> Ask question about system

Test with experiment

Analyze results and draw conclusions

Build prototype or real system

Design and model alternative system





# **Example Design Problem: On-Chip** Interconnection Network

- Intel Sandybridge E Server-Class Processor
- 435 mm<sup>2</sup> in 32 nm technology with 2.27B transistors running at several GHz
- Eight cores and eight memory banks with an on-chip ring network









## **Modeling in Computer Architecture**

#### **Computer Engineering**

Explore design space for a new system

Design and model baseline system

> Ask question about system

Test with experiment

Analyze results and draw conclusions

or real system

Build prototype Design and model alternative system

```
// rdy is OR of the AND of regs and grants
assign in rdy = | (reqs & grants);
reg [2:0] reqs;
always @(*) begin
  if ( in val ) begin
    // eject packet if it is for this tile
    if ( dest == p router id )
      regs = 3'b01\overline{0};
    // otherwise, just pass it along ring
    else
      regs = 3'b001;
  end else begin
    // if !val, don't request any ports
    regs = 3'b000;
  end
end
```

Design Principles

Verilog · SystemVerilog · VHDL C++ · SystemC Bluespec · Chisel · Python

# Student-Based "Model" of On-Chip Network

- Processor/Memory Student hands packets to appropriate router based on routing algorithm; waits to receive two packets from some other tile, and then raises hand.
- CCW/CW Network Routers Student holds onto packets and hands them to the correct channel; if packet for that router's tile arrives, then hand packet to processor/memory.
- CCW/CW Network Channels Student walks one packet at a time from upstream router to downstream router.



**Design Principles** 

## Results from Verilog Model of On-Chip Network

Design Example

#### Greedy Routing Algorithm

|       | Counter   | From      |                   |
|-------|-----------|-----------|-------------------|
| cycle | Clockwise | Processor | ${\tt Clockwise}$ |
| 1:    |           |           |                   |
| 2:    |           | SSSS      |                   |
| 3:    |           | .S $$ SS  |                   |
| 4:    | SSSS.     | S.S#.S    |                   |
| 5:    | S.SS      | SSS.#     |                   |
| 6:    | SS.SSS##  | .#S#SS    |                   |
| 7:    | ##.SS#SS  | SSSS.#    |                   |
| 8:    | SSS##S##  | ##.SS#SS  |                   |
| 9:    | ##SSSSSS  | SSS.#     |                   |
| 10:   | SSS#SSSS  | ##SS.#.S  |                   |
| 11:   | SSSS##SS  | ##SS##    |                   |
| 12:   | SSSSSS#   | ##S.#.#S  |                   |
| 13:   | S#SSSSS   | #S.##S##  |                   |
| 14:   | SS.S#SS#  | ##S#S.#S  |                   |
| 15:   | SSSSS#SS  | ####\$#.  |                   |
| 16:   | SS#SSSSS  | ##S##.#S  |                   |
| 17:   | SSSSSSS   | #####.#.  |                   |
| 18:   | SSSSSS#   | ##S####S  |                   |
| 19:   | SSSSS##S  | #SS#SSS#  |                   |

**Experiment Execution Time** 138 cycles

#### Adaptive Routing Algorithm

|       | Counter   | From      |           |
|-------|-----------|-----------|-----------|
| cycle | Clockwise | Processor | Clockwise |
| 1:    |           |           |           |
| 2:    |           | SSSS      |           |
| 3:    |           | .S $$ SS  |           |
| 4:    | SSSS.     | S.S#.S    |           |
| 5:    | S.SS      | SSS.#     |           |
| 6:    | .S.SS##S  | .#SSSS    | S.        |
| 7:    | #S.S.SS.  | SSSS      | SSS       |
| 8:    | S.S.#SSS  | SS.SS#S.  | .SS#.     |
| 9:    | SSSSS.SS  | S.S       | SSSSS.    |
| 10:   | #S#SSS.#  | SSSS.S.S  | SS#.S     |
| 11:   | S.S.#S#S  | .SS.S.    | S#SSSS    |
| 12:   | #S#SS.S.  | S.S.S.#.  | .S.S#SSS  |
| 13:   | S#S#####  | .S.SSSSS  | SSSSSS    |
| 14:   | #SSSSSS   | S.S#.     | S.SSSSS.  |
| 15:   | SS##S##S  | .SSS.SS.  | SSSSSS    |
| 16:   | SSSSSS#   | S.###.SS  | .SSSSS.S  |
| 17:   | SSSSSS#S  | #.###.S.  | S.SSS.S.  |
| 18:   | SSSS#SSS  | #SS##S#S  | S#SSS#.#  |
| 19:   | S#SSSSS   | #S###.S#  | .SSS.S.S  |
|       |           |           |           |

**Experiment Execution Time** 93 cycles (1.48 $\times$  speedup)

1 column/channel: . = idle, S = send packet, # = channel busy

# But how do we design/model something so complex?

Design Example

#### **Computer Engineering**

Explore design space for a new system

Design and model baseline system

> Ask question about system

Test with experiment

Analyze results and draw conclusions

Build prototype or real system

Design and model alternative system



Fighter Airplane: ~100,000 parts

**Intel Sandy Bridge E:** 

2.27 Billion transistors

Application

**Agenda** 

Algorithm

PL

What is Computer Architecture?

OS

Design Example

ISA

μArch

Design Principles

**RTL** 

Gates

Circuits

**Devices** 

Technology

Design Methodologies

- Decompose into components with well-defined interfaces
- A modular router design can be decomposed into a control unit and a datapath interconnected with control/status signals



# **Design Principle: Hierarchy**

- Recursively apply modularity principle
- A hierarchical network design can be decomposed into routers, which is in turn decomposed into a control unit and datapath, which is in turn decomposed into queues, muxes, and registers



#### **Design Principle: Encapsulation**

- Hide implementation details from interfaces
- An encapsulated router design can hide the latency of the router microarchitecture along with any details related to stalls due to full queues or arbitration



#### **Design Principle: Regularity**

Design Principles

- Leverage structure at various levels of abstraction
- A regular router design can exploit logical structure to enable a single router to be instantiated eight times
- A regular network and router design can exploit physical structure to simplify the chip floorplan and layout



## **Design Principle: Extensibility**

- Include mechanisms/hooks to simplify future changes
- An extensible network and router design can enable easily implementing ring networks with various numbers of routers
- An extensible network and router design can enable easily changing the routing algorithm



# **Design Principles in Computer Architecture**

- Modularity Decompose into components with well-defined interfaces
- Hierarchy Recursively apply modularity principle
- Encapsulation Hide implementation details from interfaces
- Regularity Leverage structure at various levels of abstraction
- Extensibility Include mechanisms/hooks to simplify future changes

Application

**Agenda** 

Algorithm

PL

What is Computer Architecture?

OS

ISA

Design Example

μArch

Design Principles

**RTL** 

Design Methodologies

Gates

Circuits

**Devices** 

Technology

## Waterfall Development Methodology

- Traditional sequential development methodology; each stage is completed before beginning the next stage
- For example, most of the design is completed before beginning the development, and most of the development is completed before beginning testing



# **Agile Development Methodology**

- Emerging iterative development methodology; move rapidly through all stages and then iterate back again through the stages
- For example, a component is designed, developed, and tested before moving onto another component; or a minimal yet complete system is designed, developed, and tested before incrementally adding features



# **Agile Argues for Incremental Development**



# **Agile Argues for Test-Driven Development**

- Test Types
  - Unit tests vs. integration tests
  - Directed vs. random tests
  - Whitebox vs. blackbox tests
- Goal is to write tests first then implement design to pass these tests



- Write tests for higher level of abstraction, refine implementation until passes tests, add new tests
- Capture design bugs with new tests
- Use continuous integration to ensure that design changes do not result in regressions

## Waterfall vs. Agile Development Methodologies

Design Example

| Waterfall Methodology        | <b>Agile Methodology</b>   |
|------------------------------|----------------------------|
| Highly critical requirements | Less critical requirements |
| Less experienced engineers   | More experienced engineers |
| Requirements change rarely   | Requirements change often  |
| Large engineering team       | Small engineering team     |

A hybrid approach that includes aspects of both the traditional waterfall methodology and the emerging agile methodology is an attractive option for future hardware design projects.

#### Tools for a Productive Development Methodology

- Command line (Linux, Max OS X)
- Powerful text editor (Emacs, VI)
- Scripting language (Python, Ruby, Perl)
- Unit testing framework (py.test, UVM/OVM)
- Distributed version control (Git, Mercurial)
- Cloud-based collaboration (GitHub, Bitbucket)
- Continuous integration (TravisCI)

```
. c/git-hub/cornell-ece5745/ece5745-labs/pymtl/build
cbatten-mac % py.test ..
                    ====== test session starts ==
platform darwin -- Python 2.7.5 -- py-1.4.26 -- pytest-2.6.4
plugins: xdist
collected 1120 items / 4 errors
../examples/gcd/GcdUnitCL_test.py .....
../examples/gcd/GcdUnitFL test.py .....
../examples/gcd/GcdUnitMsg test.py ...
../examples/gcd/GcdUnitRTL test.py FFFFF
../examples/gcd/gcd sim test.py .....FF.
../examples/regincr/RegIncr2stage_test.py ....
../examples/regincr/RegIncrNstage test.py ....
../examples/regincr/RegIncr_extra_test.py ....
../examples/regincr/RegIncr test.py .
../examples/sort/MinMaxUnit test.py ..
../examples/sort/SortUnitCL test.py .....
../examples/sort/SortUnitFL test.py ....
../examples/sort/SortUnitFlatRTL_test.py .....
../examples/sort/SortUnitFlatRTL v test.py .
../examples/sort/SortUnitStructRTL test.py .....
../examples/sort/sort_sim_test.py .....
../lab1-imul/IntMulFL test.py ......
../lab1-imul/IntMulFixedLatCL_test.py .....
../lab1-imul/IntMulFixedLatRTL test.py ......
../lab1-imul/IntMulMsq test.py ...
../lab1-imul/IntMulNstageCL_test.py .....
../lab1-imul/IntMulNstageRTL test.py ......
../lab1-imul/IntMulNstageStepRTL test.py ..
../lab1-imul/IntMulScycleRTL test.py ......
../lab1-imul/IntMulVarLatCL test.py ......
../lab1-imul/IntMulVarLatCalcShamtRTL_test.py ...
../lab1-imul/IntMulVarLatRTL test.py ......
../lab1-imul/imul_sim_test.py .....
../lab2-sort/SortXcelCL test.py ......
../lab2-sort/SortXcelFL_test.py ......
../lab2-sort/SortXcelRTL test.py ......
```

# **Tools for a Productive Development Methodology**

- Command line (Linux, Max OS X)
- Powerful text editor (Emacs, VI)
- Scripting language (Python, Ruby, Perl)
- Unit testing framework (py.test, UVM/OVM)
- Distributed version control (Git, Mercurial)
- Cloud-based collaboration (GitHub, Bitbucket)
- Continuous integration (TravisCI)



## **Tools for a Productive Development Methodology**

- Command line (Linux, Max OS X)
- Powerful text editor (Emacs, VI)
- Scripting language (Python, Ruby, Perl)
- Unit testing framework (py.test, UVM/OVM)
- Distributed version control (Git, Mercurial)
- Cloud-based collaboration (GitHub, Bitbucket)
- Continuous integration (TravisCI)



#### Application

# **Take-Away Points**

Design Example

Algorithm

PL

OS

ISA

μArch

**RTL** 

Gates

Circuits

**Devices** 

Technology

- Computer engineering is an iterative process involving designing and modeling systems, evaluating trade-offs between various design alternatives, which in turn motivates designing and modeling new systems
- Design principles such as modularity, hierarchy, modularity, encapsulation, regularity, and extensibility and design methodologies such as waterfall and agile hardware development can help manage the significant complexity inherent in building modern computing systems

#### **ECE 4750 Computer Architecture**

Design Example

http://www.csl.cornell.edu/courses/ece4750

- Part 1: Fundamental Processors FSM processors; pipelined processors; structural, data, and control hazards
- Part 2: Fundamental Memories memory technology; cache hierarchies; pipelined cache microarchitecture
- Part 3: Fundamental Networks torus and butterfly topologies; routing algorithms; flow control; pipelined router microarchitecture



- Part 4: Advanced Processors superscalar execution; branch prediction; out-of-order execution; register renaming; memory disambiguation; VLIW, vector, and multithreaded processors
- Part 5: Advanced Memories non-blocking caches; memory coherence, synchronization, consistency; memory translation, protection, virtualization