#### Dynamic Cache Partitioning for Simultaneous Multithreading Systems

G. Edward Suh Larry Rudolph Srinivas Devadas LCS, MIT

# Simultaneous Multithreading (SMT) Systems

- Combines superscalar architecture with multithreaded architectures
- Low IPC comes from two sources
  - Data dependencies
  - Data delay (memory bottleneck)
- SMT relieves the dependency problem
  - Helps to hide the memory latency
- SMT increases the total footprint
  - Puts more pressure on the memory system

# **Strategy – Partition the Cache**

- Control the amount of data for each thread
  minimizes the number of misses
  - On-line monitoring of thread characteristics
    - Marginal gain; g<sub>i</sub>(x): Additional hits by increasing the cache space from x blocks to x+1 blocks
  - Deciding cache allocation to each thread
    - Based on the marginal gain of each thread
  - Partitioning mechanism
    - Augmented LRU replacement policy

- Cache: 4-way associative, 8192 sets
- 2 simultaneous threads
- Add 4 counters for each thread





- Cache: 4-way associative, 8192 sets
- 2 simultaneous threads
  Add 4 counters for each
  Thread 1 Hit on the 3<sup>rd</sup>



**MRU Block** 

**Counters for Thread 1** 



Counters for Thread 2

- Cache: 4-way associative, 8192 sets
- 2 simultaneous threads
- Add 4 counters for each thread



- Cache: 4-way associative, 8192 sets
- 2 simultaneous threads
- Add 4 counters for each thread



PDCS 2001, Anaheim, CA

Aug 24, 2001



Unassigned Blocks : 8192\*4



Allocation to Thread 1 : 0

Allocation to Thread 2 : 0



Allocation to Thread 1 : 0

Allocation to Thread 2 : 8192











#### **Example: Augmented LRU**



## **Experimental Setup**

- On-line L2 cache partitioning
- Combine SimpleScalar with a cache simulator
- System configuration
  - Executes up to 4 threads simultaneously
  - 4 ALUs and 1 Multiplier
  - 32-KB 8-way L1 caches (latency 1 cycle)
  - Various size 8-way L2 caches (latency 10 cycles)
- Benchmarks
  - SPEC CPU2000; art and mcf

#### **Experimental Results**

**IPC Improvement (Partitioned IPC/LRU IPC)** 



# **Discussion of Results**

- Small caches
  - Nothing helps: should change the workload
- Medium caches
  - Partitioning helps
  - Improvement related to latency (more than linear)
- Large caches
  - Partitioning does not help: All workloads fit into the cache

#### **Relevant Cache Sizes**

- Partitioning helps for medium size caches
- Relevant cache sizes depend on the characteristics of threads and the number of active threads



**IPC Improvement: 2 threads** 





# Summary

- Simultaneous Multithreading may significantly degrade the cache performance
- Smart partitioning can relieve the problem for medium size caches
  - The relevant size varies depending on the characteristics and the number of threads
- Cache-Aware thread scheduling is needed for small caches