## An Architectural Framework for Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware

### Tao Chen, Shreesha Srinath **Christopher Batten**, G. Edward Suh

Computer Systems Laboratory School of Electrical and Computer Engineering Cornell University

> 51st Int'l Symp. on Microarchitecture Fall 2018

Evaluation

### Accelerating *Static* Parallel Algorithms on Reconfigurable Hardware

| <pre>for (int i=0; i<n; +="" b[i];<="" c[i]="a[i]" i++)="" pre=""></n;></pre>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |                            | Emerging CPU+FPGA platforms<br>(Xilinx Zynq, Altera Cyclone SoC)                                                                                              |  |
|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------|--|
| Image: Weight of the second | High<br>∟evel<br>Synthesis | HLS maps parallelism statically to<br>highly pipelined and parallel PEs                                                                                       |  |
| General<br>Purpose<br>CPU                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     | onfig<br>ware              | <pre>kernel bid vvadd(global int* c,    global int* a,    global int* b, int n ) int id = get_global_id(0); if ( id &lt; n )     c[id] = a[id] + b[id];</pre> |  |
| Shared Mem S                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  | ys                         |                                                                                                                                                               |  |

# Programmers are increasingly moving from thread- to task-centric programming



- Task-parallel programming frameworks enable creating tasks dynamically as the program executes
  - Intel Cilk Plus, Intel C++ TBB, Microsoft's .NET TPL, Java's Fork/Join, OpenMP
  - Benefits of this approach:
    - hierarchical data structures
    - divide-and-conquer algos
    - adaptive algorithms
  - ▷ arbitrary nesting, composition
  - automatic load balancing
  - efficient in theory and practice

# Accelerating *Dynamic* Parallel Algorithms on Reconfigurable Hardware



**Motivation** 

**Computation Model** 

**Accelerator Architecture** 

**Design Methodology** 

Evaluation

# Accelerating *Dynamic* Parallel Algorithms on Reconfigurable Hardware



### Motivation

## **Computation Model**

**Accelerator Architecture** 

Design Methodology



Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware

```
Motivation • Computation Model • Accelerator Architecture Design Methodology Evaluation
```

#### **Example of Explicit Continuation Passing w/ Cilk**

```
int fib( int n )
                          task fib( cont int k, int n )
{
                             if (n < 2)
  if (n < 2)
                               send argument( k, n );
   return n;
  int x = \text{spawn fib}(n-1); else {
  int y = fib(n-2);
                               cont int x, y;
                               spawn next sum( k, ?x, ?y );
  sync;
                               spawn fib(x, n-1);
 return x + y;
}
                                          fib( y, n-2 );
                               spawn
                             }
                           }
                          task sum( cont int k, int x, int y )
                           {
                            send_argument( k, x+y );
                           }
```

- Cilk-1 used explicit continuation passing (JPDC'96)
- Cilk-5 used call/return semantics for parallelism (PLDI'98)
- Explicit continuation passing is an elegant match for hardware

# Accelerating *Dynamic* Parallel Algorithms on Reconfigurable Hardware



Motivation

**Computation Model** 

### Accelerator Architecture

**Design Methodology** 





















Accelerator Architecture

C. Batten

Motivation

**Computation Model** 

Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware

10 / 18

Evaluation

Design Methodology



Accelerator Architecture

Motivation











Accelerator Architecture

Design Methodology

Evaluation

Motivation

**Computation Model** 









#### "Lite" Architectural Template



Evaluation

# Accelerating *Dynamic* Parallel Algorithms on Reconfigurable Hardware



Motivation

**Computation Model** 

**Accelerator Architecture** 

Design Methodology



Evaluation

### **Design Methodology**





# Accelerating *Dynamic* Parallel Algorithms on Reconfigurable Hardware



Motivation

**Computation Model** 

**Accelerator Architecture** 

Design Methodology

| Motivation   | Computation Model | Accelerator Architecture Design Methodolog | yy • Evaluation • |  |  |
|--------------|-------------------|--------------------------------------------|-------------------|--|--|
| Applications |                   |                                            |                   |  |  |
| Name         | Suite             | Description                                | Pattern           |  |  |
| nw           | in-house          | Needleman-Wunsch Algorithm                 | data-flow         |  |  |
| quicksort    | in-house          | quicksort algorithm                        | fork/join         |  |  |
| cilksort     | Cilk apps         | parallel merge sort algorithm              | fork/join         |  |  |
| queens       | Cilk apps         | N-queens problem                           | fork/join         |  |  |
| knapsack     | Cilk apps         | 0-1 knapsack problem                       | fork/join         |  |  |
| uts          | UTS               | unbalanced tree search                     | fork/join         |  |  |
| bbgemm       | MachSuite         | blocked matrix multiplication              | data-parallel     |  |  |
| bfsqueue     | MachSuite         | breadth first search                       | data-parallel     |  |  |
| spmvcrs      | MachSuite         | sparse matrix-vector mult                  | data-parallel     |  |  |
| stencil2d    | MachSuite         | 3D stencil computation                     | data-parallel     |  |  |

- Optimized software baseline implemented using Intel Cilk Plus with ARM NEON auto-vectorization
- C++ application driver/worker implemented with design methodology

Motivation

### **Current and Future CPU+FPGA Platforms**





#### Current Zynq-7000 SoC Platform

- Prototype using Zedboard
- ▷ Two 667MHz ARM Cortex-A9 cores
- Xilinx 7-series integrated FPGA fabric (modest capacity, 142MHz)
- Xcel uses stream buffers
- $\,\triangleright\,$  Lower BW: FPGA  $\leftrightarrow$  coherent mem sys

#### Future CPU+FPGA Platform

- Simulation study using gem5
- Eight 1GHz ARM 4-way OOO cores
- Xilinx 7-series integrated FPGA fabric (larger capacity, 200MHz)
- Xcel uses coherent 2x-pumped 32KB L1\$
- $\triangleright \ \ \text{Higher BW: FPGA} \leftrightarrow \text{coherent mem sys}$

Motivation

Evaluation •

### Speedup on Future CPU+FPGA Platform



FlexArch is  $4 \times$  faster than 8 cores,  $24 \times$  faster than 1 core (geo mean)

- FlexArch is faster than LiteArch for more dynamic algorithms (load balancing)
- See paper for scalability, resource usage, energy efficiency, cache size study

Evaluation

# Accelerating *Dynamic* Parallel Algorithms on Reconfigurable Hardware



- Importance of exploring techniques for accelerating more complex applications on reconfigurable hardware
- We have described a promising approach to accelerate dynamic parallel algorithms
  - computation model using explicit continuation passing
  - accelerator architecture based on work stealing
  - design methodology combining a PyMTL-based architectural template with high-level synthesis