#### The Case for Malleable Stream Architectures

Christopher Batten<sup>1,3</sup>, Hidetaka Aoki<sup>2</sup>, Krste Asanović<sup>3</sup>

<sup>1</sup> Department of Electrical Engineering and Computer Science Massachusetts Institute of Technology, Cambridge, MA

> <sup>2</sup> Central Research Laboratory Hitachi, Ltd., Tokyo, Japan

<sup>3</sup> Department of Electrical Engineering and Computer Science University of California, Berkeley, CA

> Workshop on Streaming Systems November 8, 2008

## Key Characteristics of Stream Programs



#### **Types of Parallelism**

- DLP : Data-Level Parallelism
- KLP : Task-Level Parallelism
- KLP : Pipeline Parallelism

#### **Other Characteristics**

- Data-dependent control flow
- Communication patterns
- Real-time constraints

### Mapping Stream Programs to Stream Architectures



Temporal Data-Level Parallelism

Temporal Kernel-Level Parallelism

### Mapping Stream Programs to Stream Architectures



#### Mapping Stream Programs to Stream Architectures



## Comparison of Stream Program Mappings



#### Temporal DLP

- Temporally amortize control
  & synchronization overheads
- Efficiently saturate off-chip memory bandwidth



# Spatial DLP

- Spatially amortize control & synchronization overheads
- Efficiently saturate off-chip memory bandwidth
- Trivial load-balancing assuming no data-dependent control flow



# Temporal KLP

- Exploit producer-consumer locality to reduce buffering
- Reduce per-element latency

| $A_0$          | B <sub>0</sub> | P              |                |
|----------------|----------------|----------------|----------------|
| A1             | B1             | Ċ <sub>0</sub> | h              |
| A2             | B2             | C1             | Ď <sub>0</sub> |
| A <sub>3</sub> | B <sub>3</sub> | C2             | D <sub>1</sub> |
| A 4            | B4             | C <sub>3</sub> | D2             |
| Α <sub>5</sub> | $B_5$          | C4             | D3             |
| Α <sub>6</sub> | B <sub>6</sub> | C <sub>5</sub> | D4             |
| A <sub>7</sub> | B7             | $C_6$          | D 5            |
| A 8            | B <sub>8</sub> | C7             | D <sub>6</sub> |

## Spatial KLP

- Exploit producer-consumer locality to reduce buffering
- Reduce per-element latency
- Easy to map data-dependent control flow
- Good utilization for stateful kernels

### **Example Stream Processors**

#### **NVIDIA GTX 200**



- 30 Cores
- 8 Lane Vector Units
- Inter-kernel buffering usually stored in DRAM
- Difficult to exploit KLP spatially

#### SPI Storm-1



- 1 Core
- 16 Lane Vector Unit
- 32b Subword SIMD
- Inter-kernel buffering blocked in stream register file
- · Cannot exploit KLP spatially

#### Tilera TILE64



- 64 Cores
- 32b Subword SIMD
- Inter-kernel buffering routed through static network

## Our Position: Exploit DLP First Then KLP

# Programmers and architects should first leverage DLP execution whenever possible

Energy Efficiency • Memory Bandwidth Utiliation • Load Balancing

## Our Position: Exploit DLP First Then KLP

# Programmers and architects should first leverage DLP execution whenever possible

Energy Efficiency • Memory Bandwidth Utiliation • Load Balancing

# Programmers and architects must still be able to efficiently exploit KLP, but only after DLP

Minimize Buffering • Reduce Latency • Data-Dependent Conditionals

#### Maven: Malleable Array of Vector-Thread Engines

