1 Mem+Net: Banked Memory Systems
2 Proc+Net: Message-Passing Systems
3 Proc+Mem+Net: Shared-Memory Systems
4 Memory Synchronization, Consistency, and Coherence
   4.1. Memory Synchronization
   4.2. Memory Consistency
   4.3. Memory Coherence
1. Mem+Net: Banked Memory Systems

- SWMR/MWSR buses used for memory request/response messages
- Addresses usually cache-line interleaved across banks
- Address indicates destination for req message
- Assume 16 × 16B cache lines, address mapping:

Assume all transactions hit in the cache, unpipelined FSM cache with TC/RD states on hit path, single-cycle request/response bus.

<table>
<thead>
<tr>
<th>rd 0x1000</th>
</tr>
</thead>
<tbody>
<tr>
<td>rd 0x1010</td>
</tr>
<tr>
<td>rd 0x1014</td>
</tr>
<tr>
<td>rd 0x1018</td>
</tr>
<tr>
<td>rd 0x1020</td>
</tr>
</tbody>
</table>
• We can use queues to help decouple the network from the cache banks
• We queue up transactions destinated for the same bank (bank conflicts) to enable moving on to other transactions

Assume all transactions hit in the cache, unpipelined FSM cache with TC/RD states on hit path, single-cycle request/response bus.

```
rd 0x1000
rd 0x1010
rd 0x1014
rd 0x1018
rd 0x1020
```
2.  **Proc+Net: Message-Passing Systems**

- Use explicit messages to communicate between the processors
- Each processor has its own local memory that is not accessible by other processors

```
1 // Assumes four processors and arrays have 64 elements
2
3 // Processor 0 executes this function
4 vvadd_p0( int* dest, int* src0, int* src1 ) {
5      send( 1, src0[16], 16 ); send( 1, src1[16], 16 ); // Distribute
6      send( 2, src0[32], 16 ); send( 2, src1[32], 16 ); // source
7      send( 3, src0[48], 16 ); send( 3, src1[48], 16 ); // data
8
9      vvadd_serial( dest, src0, src1, 16 );
10
11     recv( 1, dest[16], 16 ); // Collect
12     recv( 2, dest[32], 16 ); // result
13     recv( 3, dest[48], 16 ); // data
14 }
15
16 // Processors 1-3 execute this function
17 vvadd_pN() {
18     int local_dest[16]; int local_src0[16]; int local_src1[16];
19
20     recv( 0, local_src0, 16 );
21     recv( 0, local_src1, 16 );
22
23     vvadd_serial( local_dest, local_src0, local_src1, 16 );
24
25     send( 0, local_dest, 16 );
26 }
```
3. Proc+Mem+Net: Shared-Memory Systems

- Processors implicitly communicate through a globally shared memory
- Can map a high-level message passing framework to shared memory

```c
// Assumes four processors and arrays have 64 elements
int done[4] = { 0, 0, 0, 0 };

// Processor 0 executes this function
vvadd_p0(int* dest, int* src0, int* src1) {
    vvadd_serial(dest, src0, src1, 16);
    // Wait for other processors to finish
    for (int i = 1; i < 4; i++) {
        while (done[i] != 1) {
        }
    }
}

// Processors 1-3 executes this function
vvadd_pN() {
    int idx = 16 * processor_id;
    vvadd_serial(dest[idx], src0[idx], src1[idx], 16);
    done[processor_id] = 1;
}
```
4. Memory Synchronization, Consistency, and Coherence

- Memory Synchronization: How processors “hand-shake” at certain points to reach an agreement or commit to a certain sequence of actions

- Memory Consistency: The order in which a single processor appears to update memory addresses (consistency is usually focused on how the architecture handles memory transactions to different addresses)

- Memory Coherence: All processors always have the same view of a given memory address (coherence is usually focused on how the architecture handles memory transactions to the same memory address)
4.1. Memory Synchronization

Assume we wish to have processors 0–2 be able to send a single word of data to processor 3.

```
// Processor sends single word of data to processor 3
void send( int data ) {
    while ( *lock_ptr != 0 ) { }
    *lock_ptr = 1;
    while ( *flag_ptr != 0 ) { }
    *flag_ptr = 1;
    *buf_ptr = data;
    *lock_ptr = 0;
}

// Processor 3 receives single word of data
int recv() {
    while ( *flag_ptr != 1 ) { }
    int data = *buf_ptr;
    *flag_ptr = 0;
    return data;
}
```
Multiple processors can try to send data to processor 3 at the same time.

```assembly
# assembly for beginning of send function
addiu r3, r0, 1       # initialize r3 to 1

loop:
lw     r1, 0(r2)     # r2 is lock_ptr
bne    r1, r0, loop  # check if lock is set
sw     r3, 0(r2)     # set lock
```

Effective lw/sw interleaving:
- P0: lw r1 → P1: lw r1 → P0: sw r3 →
  P0: sw r4 → P1: sw r3 → P1: sw r4
Need to provide **mutual exclusion** to ensure only one processor is updating the flag at any given time.

- Carefully crafted software solutions
- Hardware support through special instructions

- **Summary** : Atomic fetch & or
- **Assembly** : amo.or r_dst, r_addr, r_src
- **Semantics** :
  ```
  temp = M_4B[ R[r_addr] ]
  M_4B[ R[r_addr] ] = temp | R[r_src]
  R[r_dst] = temp
  ```
- **Format** : R-Type

```
 31  26  25  21  20  16  15  11  10  6  5  0
+-----------------+-------+-------+-------+-------+--------+
|   op  |   rs  |   rt  |   rd  |   sa  |   cmd  |
| 100111 |   addr   |   src  |   dst | 00000 | 000100 |
+-----------------+-------+-------+-------+-------+--------+
```

Atomic instructions are a series of operations that all perform atomically with respect to other memory operations. The amo.and instruction will perform a fetch and an OR operation which looks like they both happened at once to other memory operations.

```python
1 # assembly for send function
2 addiu r3, r0, 1        # initialize r3 to 1
3 loop:
4    amo.or r1, r2, r3    # atomic test-and-set
5    bne r1, r0, loop    # check if got lock
```
4.2. Memory Consistency

- A **memory consistency model** is part of the instruction set, and specifies the valid order in which a microarchitecture can update memory.

- **Sequential consistency** requires that a microarchitecture ensure that all updates to memory appear to happen in *program order*. 

```assembly
1  # Assembly frag from send
2  sw r1, 0(r2)  # write data_ptr
3  sw r3, 9(r4)  # write flag_ptr
1  # Assembly frag from recv
2  lw r2, 0(r3)  # read flag_ptr
3  lw r4, 0(r5)  # read buf_ptr
```
4.3. Memory Coherence

- Cache coherence should ideally not be exposed in the instruction set
- **Cache coherence protocol** ensures that all processors see an updated view of any given memory address

```assembly
# Assembly frag from send
sw r1, 0(r2)  # write data_ptr
sw r3, 9(r4)  # write flag_ptr

# Assembly frag from recv
loop:
lw r2, 0(r3)  # read flag_ptr
bne r2, r1, loop
lw r4, 0(r5)  # read buf_ptr
```