# **Asymmetry-Aware Work-Stealing Runtimes**

# Christopher Torng, Moyang Wang, and Christopher Batten

School of Electrical and Computer Engineering Cornell University

> Cornell IAP Workshop October 2015



**Cornell University** 

**Christopher Batten** 

















- Work stealing has good performance, space requirements, and communication overheads in both theory and practice
- Supported in many popular concurrency plaforms including: Intel's Cilk Plus, Intel's C++ TBB, Microsoft's .NET Task Parallel Library, Java's Fork/Join Framework, and OpenMP

Big

**ARM Cores** 

# Static Asymmetry: Heterogeneous Multicore Systems



Processor Architecture and Circuit Design: A Marginal Cost Approach," ISCA, 2010.



Little

**ARM Cores** 

Samsung Exynos Octa Mobile Processor

Evaluation

# **Dynamic Asymmetry: Voltage/Frequency Scaling**



How can we use asymmetry awareness to improve the performance and energy efficiency of a work-stealing runtime?





# Talk Outline

**Motivation** 

First-Order Modeling

Asymmetry-Aware Work-Stealing Runtimes

Evaluation



Evaluation



System with four active big cores and four active little cores

**Christopher Batten** 



4B4L system with two active big cores and two active little cores

**Cornell University** 

**Christopher Batten** 



4B4L system with either one active big core OR one active little core



# Talk Outline

**Motivation** 

First-Order Modeling

# Asymmetry-Aware Work-Stealing Runtimes

Evaluation

# **AAWS Runtime Techniques**



### **Work-Pacing**

When all cores are busy, tune V&F of little and big cores (based on offline marginal utility analysis) to increase performance and energy efficiency

# **AAWS Runtime Techniques**



### Work-Pacing

When all cores are busy, tune V&F of little and big cores (based on offline marginal utility analysis) to increase performance and energy efficiency

#### **Work-Sprinting**

Rest waiting cores to generate power slack, tune V&F of busy cores (based on offline marginal utility analysis)

# **AAWS Runtime Techniques**



### **Work-Pacing**

When all cores are busy, tune V&F of little and big cores (based on offline marginal utility analysis) to increase performance and energy efficiency

### **Work-Sprinting**

Rest waiting cores to generate power slack, tune V&F of busy cores (based on offline marginal utility analysis)

### **Work-Mugging**

Move work from little to big cores by allowing big cores to preemptively mug tasks from little cores



# Talk Outline

**Motivation** 

First-Order Modeling

Asymmetry-Aware Work-Stealing Runtimes

### Evaluation

# **Evaluation Methodology: Modeling**

#### **Work-Stealing Runtime**

- State-of-the-art Intel TBB-inspired work-stealing scheduler
- Chase-Lev task queues with occupancy-based victim selection
- Automatic recursive decomposition of parallel loop task chunking
- Instrumented with activity hints

#### **Cycle-Level Modeling**

- Heterogeneous system modeled in gem5 cycle-approximate simulator
- Support for scaling per-core frequencies + central DVFS Controller
- Accounting for intercore interrupt latency during mugging

#### **Energy Modeling**

- Event-based energy modeling based on McPAT models and detailed RTL/gate-level sims (Synopsys ASIC toolflow, TSMC LP, 65 nm 1.0 V)
- Per-inst energy benchmarks to isolate event energy (e.g., rfile reads)

| Motivation                                             | First-Order Modeling                                                                                                              | Asymmetry-Aware Work-Stealing Runtimes           | Evaluation •     |
|--------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------|------------------|
| Performance of Convex Hull Application                 |                                                                                                                                   |                                                  |                  |
| Normalized Execution Time<br>ow Parallel High Parallel | - 1.00<br>- 0.92<br>0.75<br>0.71<br>0.75<br>0.71<br>Baseline<br>- 0.92<br>- 0.75<br>0.71<br>· · · · · · · · · · · · · · · · · · · | Work-Pacing                                      | 8% reduction ↔   |
| Low Z                                                  |                                                                                                                                   | 25%<br>reduction<br>Waiting<br>0.70 V 0.91 V 1.0 | 29%<br>reduction |
| Cornell University                                     |                                                                                                                                   | Christopher Batten                               | 16 / 18          |

#### **Energy-Efficiency and Performance Results**



- baseline
- baseline
  +work-pacing
- baseline
  +work-pacing
  +work-sprinting
  - baseline +work-pacing +work-sprinting +work-mugging

### **Energy-Efficiency and Performance Results**





## **Take-Away Point**

Holistically combining

- work-stealing runtimes
- static asymmetry
- dynamic asymmetry

through the use of

- work-pacing
- work-sprinting
- work-mugging

can improve both performance and energy efficiency in future multicore systems