# Celerity: An Open Source RISC-V Tiered Accelerator Fabric

Tutu Ajayi<sup>‡</sup>, Khalid Al-Hawaj<sup>†</sup>, Aporva Amarnath<sup>‡</sup>, Steve Dai<sup>†</sup>, Scott Davidson<sup>\*</sup>, Paul Gao<sup>\*</sup>, Gai Liu<sup>†</sup>, Atieh Lotfi<sup>\*</sup>, Julian Puscar<sup>\*</sup>, Anuj Rao<sup>\*</sup>, Austin Rovinski<sup>‡</sup>, Loai Salem<sup>\*</sup>, Ningxiao Sun<sup>\*</sup>, Christopher Torng<sup>†</sup>, Luis Vega<sup>\*</sup>, Bandhav Veluri<sup>\*</sup>, Xiaoyang Wang<sup>\*</sup>, Shaolin Xie<sup>\*</sup>, Chun Zhao<sup>\*</sup>, Ritchie Zhao<sup>†</sup>,

Christopher Batten<sup>†</sup>, Ronald G. Dreslinski<sup>‡</sup>, Ian Galton\*, Rajesh K. Gupta\*, Patrick P. Mercier\*, Mani Srivastava<sup>§</sup>, Michael B. Taylor\*, Zhiru Zhang<sup>†</sup>

\* University of California, San Diego

† Cornell University

‡ University of Michigan

§ University of California, Los Angeles

Hot Chips 29 August 21, 2017

## High-Performance Embedded Computing

- Embedded workloads are abundant and evolving
  - Video decoding on mobile devices
    - Increasing bitrates, new emerging codecs
  - Machine learning (speech recognition, text prediction, ...)
    - Algorithm changes for better accuracy and energy performance
  - Wearable and mobile augmented reality
    - Still new, rapidly changing models and algorithms
  - Real-time computer vision for autonomous vehicles
    - Faster decision making, better image recognition
- We are in the post-Dennard scaling era
  - Cost of energy > Cost of area
- How do we attain extreme energy-efficiency while also maintaining flexibility for evolving workloads?



### Celerity: Chip Overview

- TSMC 16nm FFC
- 25mm<sup>2</sup> die area (5mm x 5mm)
- ~385 million transistors
- 511 RISC-V cores
  - 5 Linux-capable "Rocket Cores"
  - 496-core mesh tiled array "Manycore"
  - 10-core mesh tiled array "Manycore" (low voltage)
- 1 Binarized Neural Network Specialized Accelerator
- On-chip synthesizable PLLs and DC/DC LDO
  - Developed in-house
- 3 Clock domains
  - 400 MHz DDR I/O
  - 625 MHz Rocket core + Specialized accelerator
  - 1.05 GHz Manycore array
- 672-pin flip chip BGA package
- · 9-months from PDK access to tape-out









## **Celerity Overview**

**Tiered Accelerator Fabric** 

Case Study: Mapping Flexible Image Recognition to a Tiered Accelerator Fabric

Meeting Aggressive Time Schedule

Conclusion









### Decomposition of Embedded Workloads



- General-purpose computation
- Operating systems, I/O, etc.

- Flexible and energy-efficient
- Exploits coarse- and fine-grain parallelism

- Fixed-function
- Extremely strict energy efficiency requirements

An architectural template that maps embedded workloads onto distinct tiers to maximize energy efficiency while maintaining flexibility.

**General-Purpose Tier** 

General-purpose computation, control flow and memory management

Flexible exploitation of coarse and fine grain parallelism



Fixed-function
specialized accelerators
for energy efficiency
requirements



### Mapping Workloads onto Tiers



### Celerity: General-Purpose Tier



#### General-Purpose Tier: RISC-V Rocket Cores

- Role of the General-Purpose Tier
  - General-purpose SPEC-style compute
  - Exception handling
  - Operating system (e.g. TCP/IP Stack)
  - Cached memory hierarchy for all tiers
- In Celerity
  - 5 Rocket Cores, generated from Chisel (<a href="https://github.com/freechipsproject/rocket-chip">https://github.com/freechipsproject/rocket-chip</a>)
    - 5-stage, in-order, scalar processor
    - Double-precision floating point
    - I-Cache: 16KB 4-way assoc.
    - D-Cache: 16KB 4-way assoc.
    - RV64G ISA
  - 0.97 mm<sup>2</sup> per Rocket core @ 625 MHz



## Celerity: Massively Parallel Tier



## Massively Parallel Tier: Manycore Array



- Role of the Massively Parallel Tier
  - Flexibility and improved energy efficiency over the general-purpose tier by massively exploiting parallelism
- In Celerity
  - 496 low power RISC-V Vanilla-5 cores
    - 5-stage, in-order, scalar cores
      - · Fully distributed memory model
      - 4KB instruction memory per tile
      - 4KB data memory per tile
    - RV32IM ISA
    - 16x31 tiled mesh array
    - Open source!
  - 80 Gbps full duplex links between each adjacent tile
  - 0.024mm<sup>2</sup> per tile @ 1.05 GHz

## Manycore Array (Cont.):

- XY-dimension network-on-chip (NoC)
  - Unlimited deadlock-free communication
  - Manycore I/O uses same network
- Remote store programming model
  - Word writes into other tile's data memory
  - MIMD programming model
    - Fine-grain parallelism through high-speed communication between tiles
- Token-Queue architectural primitive
  - Reserves buffer space in remote core
  - Ensures buffer is filled before accessed
  - Tight producer-consumer synchronization
  - Streaming programming model
    - Producer-consumer parallelism



## Manycore Array (Cont.)

|                                         | Configuration                                                | Normalized<br>Area (32nm)                               | Area<br>Ratio |
|-----------------------------------------|--------------------------------------------------------------|---------------------------------------------------------|---------------|
| Celerity Tile<br>@16nm                  | D-MEM = 4KB<br>I-MEM = 4KB                                   | 0.024 * (32/16) <sup>2</sup><br>= 0.096 mm <sup>2</sup> | 1x            |
| OpenPiton Tile<br>@32nm                 | L1 D-Cache = 8KB<br>L1 I-Cache = 8KB<br>L1.5/L2 Cache = 40KB | 1.17 mm <sup>2</sup> [1]                                | 12x           |
| Raw Tile<br>@180nm                      | L1 D-Cache = 32KB<br>L1 I-SRAM = 96KB                        | 16.0 * (32/180) <sup>2</sup><br>= 0.506 mm <sup>2</sup> | 5.25x         |
| MIAOW GPU<br>Compute Unit Lane<br>@32nm | VRF = 256KB<br>SRF = 2KB                                     | 15.0 / 16<br>= 0.938 mm <sup>2</sup> [2]                | 9.75x         |



## Celerity: Specialization Tier



#### Specialization Tier: Binarized Neural Network

- Role of the Specialization Tier
  - Achieves high energy efficiency through specialization
- In Celerity
  - Binarized Neural Network (BNN)
    - Energy-efficient convolutional neural network implementation
    - 13.4 MB model size with 9 total layers
      - 1 Fixed-point convolutional layer
      - · 6 Binary convolutional layers
      - 2 Dense fully connected layers
    - Batch norm calculations done after each layer
  - 0.356 mm<sup>2</sup> @ 625 MHz

#### Parallel Links Between Tiers



## **Celerity Overview**

Tiered Accelerator Fabric

Case Study: Mapping Flexible Image Recognition to a Tiered Accelerator Fabric

Meeting Aggressive Time Schedule

Conclusion









## Case Study: Mapping Flexible Image Recognition to a Tiered Accelerator Fabric



#### Three steps to map applications to tiered accelerator fabric:

- Step 1. Implement the algorithm using the general-purpose tier
- Step 2. Accelerate the algorithm using either the massively parallel tier **OR** the specialization tier
- Step 3. Improve performance by cooperatively using both the specialization **AND** the massively parallel tier



## Step 1: Algorithm to Application **Binarized Neural Networks**



- Training usually uses floating point, while inference usually uses lower precision weights and activations (often 8-bit or lower) to reduce implementation complexity
- Rastergari et al. [3] and Courbariaux et al. [4] have recently shown single-bit precision weights and activations can achieve an accuracy of 89.8% on CIFAR-10
- Performance target requires ultra-low latency (batch size of one) and high throughput (60 classifications/second)

## Step 1: Algorithm to Application Characterizing BNN Execution



- Using just the general-purpose tier is 200x slower than performance target
- Binarized convolutional layers consume over 97% of dynamic instruction count
- Perfect acceleration of just the binarized convolutional layers is still 5x slower than performance target
- Perfect acceleration of all layers using the massively parallel tier could meet performance target but with significant energy consumption







 Accelerator is configured to process a layer through RoCC command messages

## **BNN Specialized Accelerator**



- Accelerator is configured to process a layer through RoCC command messages
- Memory Unit starts streaming the weights into the accelerator and unpacking the binarized weights into appropriate buffers



- Accelerator is configured to process a layer through RoCC command messages
- Memory Unit starts streaming the weights into the accelerator and unpacking the binarized weights into appropriate buffers



- Accelerator is configured to process a layer through RoCC command messages
- 2. Memory Unit starts streaming the weights into the accelerator and unpacking the binarized weights into appropriate buffers



- Accelerator is configured to process a layer through RoCC command messages
- Memory Unit starts streaming the weights into the accelerator and unpacking the binarized weights into appropriate buffers
- 3. Binary convolution compute unit processes input activations and weights to produce output activations

### **BNN Specialized Accelerator**



- Accelerator is configured to process a layer through RoCC command messages
- Memory Unit starts streaming the weights into the accelerator and unpacking the binarized weights into appropriate buffers
- 3. Binary convolution compute unit processes input activations and weights to produce output activations



- Accelerator is configured to process a layer through RoCC command messages
- Memory Unit starts streaming the weights into the accelerator and unpacking the binarized weights into appropriate buffers
- Binary convolution compute unit processes input activations and weights to produce output activations

## Step 2: Application to Accelerator **Design Methodology**



```
void bnn::dma req() {
 while(1) {
 DmaMsg msg = dma req.get();
 for ( int i = 0; i < msq.len; i++ ) {
  HLS PIPELINE LOOP ( HARD STALL, 1 );
  int req type = 0;
  word t data = 0;
  addr t addr = msq.base + i*8;
  if ( type == DMA TYPE WRITE ) {
   data = msg.data;
    req type = MemReqMsg::WRITE;
   } else {
   req type = MemReqMsg::READ;
  memreq.put(MemReqMsg(req type,addr,data));
 dma resp.put(DMA REQ DONE);
```

#### **Design Methodology**



- HLS enabled quick implementation of an accelerator for an emerging algorithm
  - Algorithm to initial accelerator in weeks
  - Rapid design-space exploration
- HLS greatly simplified timing closure
  - Improved clock frequency by 43% in few days
  - Easily mitigated long paths at the interfaces with latency insensitive interfaces and pipeline register insertion
- HLS tools are still evolving
  - Six weeks to debug tool bug with datadependent access to multi-dimensional arrays

## **General-Purpose Tier for Weight Storage**



 The BNN specialized accelerator can use one of the Rocket cores' caches to load every layer's weights; but, it is inefficient due to off-chip traffic

## **General-Purpose Tier for Weight Storage**



 The BNN specialized accelerator can use one of the Rocket cores' caches to load every layer's weights; but, it is inefficient due to off-chip traffic

### Step 2: Application to Accelerator

# **General-Purpose Tier for Weight Storage**



- The BNN specialized accelerator can use one of the Rocket cores' caches to load every layer's weights; but, it is inefficient due to off-chip traffic
- A large L2 or more storage in the BNN specialized accelerator could improve performance

# **General-Purpose Tier for Weight Storage**



- The BNN specialized accelerator can use one of the Rocket cores' caches to load every layer's weights; but, it is inefficient due to off-chip traffic
- A large L2 or more storage in the BNN specialized accelerator could improve performance
- Instead, weights can be stored in the massively parallel tier



- The BNN specialized accelerator can use one of the Rocket cores' caches to load every layer's weights; but, it is inefficient due to off-chip traffic
- A large L2 or more storage in the BNN specialized accelerator could improve performance
- Instead, weights can be stored in the massively parallel tier
- Each core in the massively parallel tier executes a remoteload-store program to orchestrate sending weights to the specialization tier via a hardware FIFO



- The BNN specialized accelerator can use one of the Rocket cores' caches to load every layer's weights; but, it is inefficient due to off-chip traffic
- A large L2 or more storage in the BNN specialized accelerator could improve performance
- Instead, weights can be stored in the massively parallel tier
- Each core in the massively parallel tier executes a remoteload-store program to orchestrate sending weights to the specialization tier via a hardware FIFO



- The BNN specialized accelerator can use one of the Rocket cores' caches to load every layer's weights; but, it is inefficient due to off-chip traffic
- A large L2 or more storage in the BNN specialized accelerator could improve performance
- Instead, weights can be stored in the massively parallel tier
- Each core in the massively parallel tier executes a remoteload-store program to orchestrate sending weights to the specialization tier via a hardware FIFO



- The BNN specialized accelerator can use one of the Rocket cores' caches to load every layer's weights; but, it is inefficient due to off-chip traffic
- A large L2 or more storage in the BNN specialized accelerator could improve performance
- Instead, weights can be stored in the massively parallel tier
- Each core in the massively parallel tier executes a remoteload-store program to orchestrate sending weights to the specialization tier via a hardware FIFO



- The BNN specialized accelerator can use one of the Rocket cores' caches to load every layer's weights; but, it is inefficient due to off-chip traffic
- A large L2 or more storage in the BNN specialized accelerator could improve performance
- Instead, weights can be stored in the massively parallel tier
- Each core in the massively parallel tier executes a remoteload-store program to orchestrate sending weights to the specialization tier via a hardware FIFO



- The BNN specialized accelerator can use one of the Rocket cores' caches to load every layer's weights; but, it is inefficient due to off-chip traffic
- A large L2 or more storage in the BNN specialized accelerator could improve performance
- Instead, weights can be stored in the massively parallel tier
- Each core in the massively parallel tier executes a remoteload-store program to orchestrate sending weights to the specialization tier via a hardware FIFO



- The BNN specialized accelerator can use one of the Rocket cores' caches to load every layer's weights; but, it is inefficient due to off-chip traffic
- A large L2 or more storage in the BNN specialized accelerator could improve performance
- Instead, weights can be stored in the massively parallel tier
- Each core in the massively parallel tier executes a remoteload-store program to orchestrate sending weights to the specialization tier via a hardware FIFO



- The BNN specialized accelerator can use one of the Rocket cores' caches to load every layer's weights; but, it is inefficient due to off-chip traffic
- A large L2 or more storage in the BNN specialized accelerator could improve performance
- Instead, weights can be stored in the massively parallel tier
- Each core in the massively parallel tier executes a remoteload-store program to orchestrate sending weights to the specialization tier via a hardware FIFO

# Performance Benefits of Cooperatively Using the Massively Parallel and the Specialization Tiers

|                           | General-Purpose<br>Tier | Specialization Tier | Specialization +<br>Massively Parallel<br>Tiers |
|---------------------------|-------------------------|---------------------|-------------------------------------------------|
| Runtime per<br>Image (ms) | 4,024                   | 5.8                 | 3.3                                             |
| Speedup                   | 1x                      | ~700x               | ~1,220x                                         |

| General-Purpose Tier                      | Software implementation assuming ideal performance estimated with an optimistic one instruction per cycle       |
|-------------------------------------------|-----------------------------------------------------------------------------------------------------------------|
| Specialization Tier                       | Full-system RTL simulation of the BNN specialized accelerator running with a frequency of 625 MHz               |
| Specialization + Massively Parallel Tiers | Full-system RTL simulation of the BNN specialized accelerator with the weights being streamed from the manycore |

# **Celerity Overview**

Tiered Accelerator Fabric

Case Study: Mapping Flexible Image Recognition to a Tiered Accelerator Fabric

Meeting Aggressive Time Schedule

Conclusion









# How to make a complex SoC?

- Reuse
  - Open-source and third-party IP
  - Extensible and parameterizable designs
- Modularize
  - Agile design and development
  - Early interface specification
- Automate
  - Abstracted implementation and testing flows
  - Highly automated design





- Reuse
  - Open-source and third-party IP
  - Extensible and parameterizable designs
- Modularize
  - Agile design and development
  - Early interface specification
- Automate
  - Abstracted implementation and testing flows
  - Highly automated design





- Reuse
  - Open-source and third-party IP
  - Extensible and parameterizable designs
- Modularize
  - Agile design and development
  - Early interface specification
- Automate
  - Abstracted implementation and testing flows
  - Highly automated design



- Reuse
  - Open-source and third-party IP
  - Extensible and parameterizable designs
- Modularize
  - Agile design and development
  - Early interface specification
- Automate
  - Abstracted implementation and testing flows
  - Highly automated design





Reuse

Open-source and third-party IP

• Extensible and parameterizable designs

Modularize

- Agile design and development
- Early interface specification

- Abstracted implementation and testing flows
- Highly automated design





- Reuse
  - Open-source and third-party IP
  - Extensible and parameterizable designs
- Modularize
  - Agile design and development
  - Early interface specification
- Automate
  - Abstracted implementation and testing flows
  - Highly automated design



## Reuse

- Basejump: Open-source polymorphic HW components
  - Design libraries: BSG IP Cores, BGA Package, I/O Pad Ring
  - Test infrastructure: Double Trouble PCB, Real Trouble PCB
  - Available at <u>bjump.org</u>
- RISC-V: Open-source ISA
  - Rocket core: high performance RV64G in-order core
  - Vanilla-V: high efficiency RV32IM in-order core
- RoCC: Open-source on-chip interconnect
  - Common interface to connect all 3 compute tiers
- Extensible designs
  - BSG Manycore: fully parameterized RTL and APR scripts
- Third Party IP
  - ARM Standard Cells, I/O cells, RF/SRAM generators







## Modularize

- Agile design
  - Hierarchical design to reduce tool time
  - Optimize designs at the component level
  - Black-box designs for use across teams
  - SCRUM-like task management
  - Sprinting to "tape-ins"
- Establish interfaces early
  - Establish design interfaces early (RoCC, Basejump)
  - Use latency-insensitive interfaces to remove crossmodule timing dependencies
  - Identify specific deliverables between different teams (esp. analog→digital)





- Abstract implementation and testing flows
  - Develop implementation flow adaptable to arbitrary designs
  - Use validated IP components to focus only on integration testing
  - Use high-level testing abstractions to speed up test development (PyMTL)
- Automate design using tools
  - Use High-Level Synthesis to speed up designspace exploration and implementation
  - Use digital design flow to create traditionally analog components





# Synthesizable PLL

### Reuse

 Interfaces and some components reused from previous designs

### Modularize

- Controlled via SPI-like interface
- Isolated voltage domain for all 3 PLLs to remove power rail noise

- Fully synthesized using digital standard cells
- Manual placement of ring oscillators, auto-placement of other logic
- Very easy to create additional DCOs that cover additional frequency ranges



| Area             | 0.0059 mm <sup>2</sup> |
|------------------|------------------------|
| Frequency range* | 20 - 3000 MHz          |
| Frequency step*  | 2%                     |
| Period jitter*   | 2.5 ps                 |

<sup>\*</sup> Collected via SPICF on extracted netlist

# Synthesizable LDO

### Reuse

 Taped out and tested in 65nm [5], waiting on 16nm results

- Fully synthesized controller
- Custom power switching transistors
- Post-silicon tunable
- Compared to conventional N-bit digital LDOs:
  - 2<sup>N</sup>/N times smaller
  - 2<sup>N</sup>/N times faster
  - 2<sup>N</sup> times lower power
  - 2<sup>2N</sup>/N better FoM



| Controller Area | < 0.0023 mm <sup>2</sup> |
|-----------------|--------------------------|
| Decap Area      | < 0.0741 mm <sup>2</sup> |
| Voltage Range   | 0.45 – 0.85 V            |
| Peak Efficiency | > 99.8 %                 |

# **Celerity Overview**

Tiered Accelerator Fabric

Case Study: Mapping Flexible Image Recognition to a Tiered Accelerator Fabric

Meeting Aggressive Time Schedule

Conclusion









## Conclusion

 Tiered accelerator fabric: an architectural template for embedded workloads that enable performance gains and energy savings without sacrificing programmability

- Celerity: a case study for accelerating low-latency, flexible image recognition using a binarized neural network that illustrates the potential for tiered accelerator fabrics
- Reuse, modularization, and automation enabled an academic-only group to tape out a 16nm ASIC with 511 RISC-V cores and a specialized binarized neural network accelerator in only 9 months

# Acknowledgements

This work was funded by DARPA under the Circuit Realization At Faster Timescales (CRAFT) program



Special thanks to Dr. Linton Salmon for program support and coordination