

Cornell University Computer Systems Laboratory



ACCELERATING SEED LOCATION FILTERING IN DNA READ MAPPING USING A COMMERCIAL COMPUTE-IN-SRAM ARCHITECTURE

Courtney Golden<sup>1</sup>, Dan Ilan<sup>2</sup>, Nicholas Cebry<sup>1</sup>, and Christopher Batten<sup>1</sup> <sup>1</sup>Cornell University, <sup>2</sup>GSI Technologies Inc.

## **Genomics Acceleration**

O PE

Wait not

+ I/S/D

## **Compute-in-Memory**



Darwin



Shouji

GenAx





SOAP3



Darwin-WGA



 $(u, V_l, V_D$  M' refers to M(i-1, j)to the base cases.









28nm 6T SRAM



--THE COLUMN TWO -And an and one ---Sec. 1 ------Taking Taplana

Duality Cache



EVE



28nm Vector CRAM

| * 축 축 축 축 축 축 축 축                                    |  |  |  |
|------------------------------------------------------|--|--|--|
| $\begin{array}{cccccccccccccccccccccccccccccccccccc$ |  |  |  |
| T <sub>1</sub> T <sub>2</sub>                        |  |  |  |
| ComputeDRAM                                          |  |  |  |



## USING COMPUTE-IN-MEMORY FOR GENOMICS





#### GenCache [MICRO '19]

- Proposes using tightly-coupled compute-in-SRAM accelerator with hardware extensions for filtering
- Simulation-based study showing 5.26x speedup over basic accelerator with no in-cache operations



- Proposes using loosely-coupled compute-in-DRAM accelerator with custom hardware for filtering
- Simulation-based study showing 1.81-3.65x speedup over CPU

**Our Goal:** Explore the potential for accelerating seed location filtering in DNA read mapping using a general-purpose commercial-scale compute-in-SRAM architecture



## **GSI GEMINI PLATFORM: ASSOCIATIVE PROCESSING UNIT (APU)**



Device DRAM: 16 GB Compute-Enabled SRAM: 262 KB On-Chip Memory: 12 MB On-Chip Memory Bandwidth: 26 TB/s Compute Speed: 400 MHz Power (TDP): 60 W IC Process: TSMC 28nm





compute-enabled SRAM array

Leverages memory bitlines, peripheral logic, and highlyefficient search and update operations



Cornell University Computer Systems Laboratory

Accelerating Seed Location Filtering in DNA Read Mapping Using a Commercial Compute-in-SRAM Architecture

- Motivation
- APU Microarchitecture
- APU Microcoding
- APU for Myers' Acceleration









System Overview



Cornell University Computer Systems Laboratory





APU Core Logical View



Cornell University Computer Systems Laboratory





APU Core Logical View



Cornell University Computer Systems Laboratory





APU Core Logical View



Cornell University Computer Systems Laboratory





APU Core Logical View



Cornell University Computer Systems Laboratory





APU Core Logical View



Cornell University Computer Systems Laboratory





Cornell University Computer Systems Laboratory





Cornell University Computer Systems Laboratory





Cornell University Computer Systems Laboratory





Cornell University Computer Systems Laboratory

APU

APU







**Cornell University** Computer Systems Laboratory







Cornell University Computer Systems Laboratory







Cornell University Computer Systems Laboratory

MOTIVATION • APU MICROARCHITECTURE • APU MICROCODING • APU FOR MYERS' ACCELERATION

Page 6 of 18





Cornell University Computer Systems Laboratory



Accelerating Seed Location Filtering in DNA Read Mapping Using a Commercial Compute-in-SRAM Architecture

- Motivation
- APU Microarchitecture
- APU Microcoding
- APU for Myers' Acceleration











Cornell University Computer Systems Laboratory







Cornell University Computer Systems Laboratory

MOTIVATION • APU MICROARCHITECTURE • APU MICROCODING • APU FOR MYERS' ACCELERATION

Page 8 of 18







Cornell University Computer Systems Laboratory

MOTIVATION • APU MICROARCHITECTURE • APU MICROCODING • APU FOR MYERS' ACCELERATION

Page 8 of 18







Cornell University Computer Systems Laboratory

MOTIVATION • APU MICROARCHITECTURE • APU MICROCODING • APU FOR MYERS' ACCELERATION

Page 8 of 18



















Cornell University Computer Systems Laboratory



APL\_FRAG \_frag\_vadd(vdst, vsrc0, vsrc1):
 // ---- bit 0 --- // vdst = vsrc0 ^ vsrc1
 0x0001: RL = VRF[vsrc0];
 0x0001: RL ^= VRF[vsrc1];
 0x0001: VRF[vdst] = RL;
 // cout = vsrc0 \* vsrc1
 0x0001: RL = VRF[vsrc0, vsrc1];
 ...

| vsrc0 | vsrc1 | vdst | c_out |
|-------|-------|------|-------|
| 0     | 0     | 0    | 0     |
| 0     | 1     | 1    | 0     |
| 1     | 0     | 1    | 0     |
| 1     | 1     | 0    | 1     |





Cornell University Computer Systems Laboratory

MOTIVATION • APU MICROARCHITECTURE • APU MICROCODING • APU FOR MYERS' ACCELERATION

Page 9 of 18



APL\_FRAG \_frag\_vadd(vdst, vsrc0, vsrc1):

```
. . .
// ---- bit 1 ----
// vdst = a \wedge b \wedge cin
(0x0001<<1): RL = VRF[vsrc0];
(0x0001<<1): RL ^= VRF[vsrc1];
(0x0001<<1): RL ^= RL_N;
(0x0001<<1): VRF[vdst] = RL;
// cout = a*b + b*cin + a*cin
(0x0001<<1): RL = VRF[vsrc0, vsrc1];
(0 \times 0001 << 1): VRF[temp_0] = RL;
(0x0001<<1):
              RL = VRF[vsrc1];
(0x0001<<1):
              RL &= RL N:
(0x0001<<1):
              VRF[temp_1] = RL;
(0x0001<<1):
              RL = VRF[vsrc0];
(0x0001<<1):
              RL \&= RL_N;
(0x0001<<1): RL |= VRF[temp_0];
(0x0001<<1): RL |= VRF[temp 1]:
. . .
```





Cornell University Computer Systems Laboratory

MOTIVATION • APU MICROARCHITECTURE • APU MICROCODING • APU FOR MYERS' ACCELERATION

Page 10 of 18



APL\_FRAG \_frag\_vadd(vdst, vsrc0, vsrc1): Bit 0 2 3 Slice 0 . . . // ---- bit 1 ----Decodel  $// vdst = a \wedge b \wedge cin$  $(0 \times 0001 << 1)$ : RL = VRF[vsrc0]; RL ^= VRF[vsrc1]; (0x0001<<1): (0x0001<<1): RL  $\wedge$ = RL\_N; R/W (0x0001<<1): VRF[vdst] = RL;Bit // cout = a\*b + b\*cin + a\*cinSlice 1 (0x0001<<1): RL = VRF[vsrc0, vsrc1];  $(0 \times 0001 << 1): VRF[temp_0] = RL;$ Decoder (0x0001<<1): RL = VRF[vsrc1];(0x0001<<1): RL &= RL N: (0x0001<<1):  $VRF[temp_1] = RL;$ (0x0001<<1): RL = VRF[vsrc0];R/W -R/W R/W R/W (0x0001<<1): RL &=  $RL_N$ ; (0x0001<<1): RL |= VRF[temp\_0]; (0x0001<<1): RL |= VRF[temp\_1]; Bit . . . Slice 15





Cornell University Computer Systems Laboratory



APL\_FRAG \_frag\_vadd(vdst, vsrc0, vsrc1): . . . // ---- bit 1 ----// vdst =  $a \wedge b \wedge cin$  $(0 \times 0001 << 1)$ : RL = VRE[vsrc0]: (0x0001<<1): RL ^= VRF[vsrc1]; (0x0001<<1): RL ^= RL\_N; (0x0001<<1): VRF[vdst] = RL; // cout = a\*b + b\*cin + a\*cin (0x0001<<1): RL = VRF[vsrc0, vsrc1];  $(0 \times 0001 << 1): VRF[temp_0] = RL;$ (0x0001<<1): RL = VRF[vsrc1];(0x0001<<1): RL &= RL N: (0x0001<<1):  $VRF[temp_1] = RL;$ (0x0001<<1): RL = VRF[vsrc0]; (0x0001<<1): RL &= RL\_N; (0x0001<<1): RL |= VRF[temp\_0]; (0x0001<<1): RL |= VRF[temp\_1]; . . .





Cornell University Computer Systems Laboratory



Bit Cell APL\_FRAG \_frag\_vadd(vdst, vsrc0, vsrc1): Bit 2046 2047 0 2 3 4 5 6 Slice 0 Global . . . Vertical // ---- bit 1 ----Lines Decodel ... Read/Write // vdst =  $a \wedge b \wedge cin$ Enables (0x0001<<1): RL = VRF[vsrc0]; (0x0001<<1): RL ^= VRF[vsrc1]; Read/Write Logic (0x0001<<1): RL ^= RL\_N; -<mark>.</mark> /W -<mark>T</mark>W <u>./w</u> -<mark>?/W</mark> -<u>//</u>W - <u>7</u>W <u>R/W</u> ₹/W -<mark>-</mark>/W ζ/W (0x0001<<1): VRF[vdst] = RL; Global Bit Horizontal Slice 1 // cout = a\*b + b\*cin + a\*cin Line **Global Vertical Line (GVL)** (0x0001<<1): RL = VRF[vsrc0, vsrc1];  $(0 \times 0001 << 1): VRF[temp_0] = RL;$ • (0x0001<<1): RL = VRF[vsrc1];(0x0001<<1): RL &= RL N: (0x0001<<1):  $VRF[temp_1] = RL;$ <u>|</u>... (0x0001<<1): RL = VRF[vsrc0];-R/W -R/W R/W R/W R/W HR/W R/W R/W R/W R/W (0x0001<<1): RL &=  $RL_N$ ; (0x0001<<1): RL |= VRF[temp\_0]; : (0x0001<<1): RL |= VRF[temp\_1]; Bit . . . Slice 15 Decoder ...





MOTIVATION • APU MICROARCHITECTURE • APU MICROCODING • APU FOR MYERS' ACCELERATION

...

R/W

**SRAM Bank Physical View** 

R/W



APL\_FRAG \_frag\_vadd(vdst, vsrc0, vsrc1): . . . // ---- bit 1 ----// vdst =  $a \wedge b \wedge cin$ (0x0001<<1): RL = VRF[vsrc0]; (0x0001<<1): RL ^= VRF[vsrc1];  $(0 \times 0001 << 1)$ : RL ^= RL\_N; (0x0001<<1): VRF[vdst] = RL; // cout = a\*b + b\*cin + a\*cin(0x0001<<1): RL = VRF[vsrc0, vsrc1];  $(0 \times 0001 << 1): VRF[temp_0] = RL;$ (0x0001<<1): RL = VRF[vsrc1];(0x0001<<1): RL &= RL N: (0x0001<<1):  $VRF[temp_1] = RL;$ (0x0001<<1): RL = VRF[vsrc0]; (0x0001<<1): RL &= RL\_N; (0x0001<<1): RL |= VRF[temp\_0]; (0x0001<<1): RL |= VRF[temp\_1];





. . .



Bit Cell APL\_FRAG \_frag\_vadd(vdst, vsrc0, vsrc1): Bit 2046 2047 0 2 3 4 5 6 RBL WBL WBLB Slice 0 Global . . . WE0 Vertical // ---- bit 1 ----Lines RE0 Decoder ... Read/Write // vdst =  $a \wedge b \wedge cin$ Bit 0 : Enables (0x0001<<1): RL = VRF[vsrc0]; WE1 RE1 Read/Write (0x0001<<1): RL ^= VRF[vsrc1]; ... Logic  $(0 \times 0001 << 1)$ : RL ^= RL\_N; HR/W HR/W HR/W -R/W R/W R/W Bit 1 (0x0001<<1): VRF[vdst] = RL; Global Bit WE2 Horizontal // cout = a\*b + b\*cin + a\*cin Slice 1 RE2 I I Line **Global Vertical Line (GVL)** RL = VRF[vsrc0, vsrc1]; (0x0001<<1): Bit 2  $(0 \times 0001 << 1): VRF[temp_0] = RL;$ Decoder .... • 24x (0x0001<<1): RL = VRF[vsrc1]; (0x0001<<1): RL &= RL N: WE23 RE23 (0x0001<<1):  $VRF[temp_1] = RL;$ ... (0x0001<<1): RL = VRF[vsrc0];Bit 23 R/W R/W R/W (0x0001<<1): RL &=  $RL_N$ ; (0x0001<<1): RL |= VRF[temp\_0]; : Read Write (0x0001<<1): RL |= VRF[temp\_1]; Logic Logic Bit . . . Slice 15 Read Source Decoder ... Latch Mux  $RL_w$ ... 

R/W

**SRAM Bank Physical View** 



**Cornell University** Computer Systems Laboratory

MOTIVATION • APU MICROARCHITECTURE • APU MICROCODING • APU FOR MYERS' ACCELERATION

R/W

R/W

RL,

RL\_

RL。

Global Horizontal Line (GHL)

Bit Processor Circuitry



APL\_FRAG \_frag\_vadd(vdst, vsrc0, vsrc1): . . . // ---- bit 1 ----// vdst =  $a \wedge b \wedge cin$ (0x0001<<1): RL = VRF[vsrc0]; (0x0001<<1): RL ^= VRF[vsrc1]; (0x0001<<1): RL ^= RL\_N; (0x0001<<1): VRF[vdst] = RL; // cout = a\*b + b\*cin + a\*cinRL = VRF[vsrc0, vsrc1]; (0x0001<<1):  $(0 \times 0001 << 1): VRF[temp_0] = RL;$ (0x0001<<1): RL = VRF[vsrc1];(0x0001<<1): RL &= RL N: (0x0001<<1):  $VRF[temp_1] = RL;$  $(0 \times 0001 << 1)$ : RL = VRF[vsrc0]; (0x0001<<1): RL &= RL\_N; (0x0001<<1): RL |= VRF[temp\_0]; (0x0001<<1): RL |= VRF[temp\_1];

Actual implementation used is a more sophisticated carry-select approach

Ripple-carry:**215 cycles**Carry-select:**12 cycles** 





. . .

Cornell University Computer Systems Laboratory

MOTIVATION • APU MICROARCHITECTURE • APU MICROCODING • APU FOR MYERS' ACCELERATION

Page 10 of 18







key:

~key:

0xFFFF:

Oxffff:

Key:

Result:







Cornell University Computer Systems Laboratory

MOTIVATION • APU MICROARCHITECTURE • APU MICROCODING • APU FOR MYERS' ACCELERATION

key:

~key:

**OXFFFF:** 

OxFFFF:

Key:

Result:







Cornell University Computer Systems Laboratory

MOTIVATION • APU MICROARCHITECTURE • APU MICROCODING • APU FOR MYERS' ACCELERATION





Cornell University Computer Systems Laboratory

MOTIVATION • APU MICROARCHITECTURE • APU MICROCODING • APU FOR MYERS' ACCELERATION







MOTIVATION • APU MICROARCHITECTURE • APU MICROCODING • APU FOR MYERS' ACCELERATION







Accelerating Seed Location Filtering in DNA Read Mapping Using a Commercial Compute-in-SRAM Architecture

- Motivation
- APU Microarchitecture
- APU Microcoding
- APU for Myers' Acceleration





## **GENOMICS PIPELINE**

#### **Reference Genome**

••• CTAGTTCCCTCGGAGAGTCTCCCCTTTTTGGGAAAATCTAGACTCTAGAGGAGACTCTCAAGAGATCTCTGATGTCCTGATAGCTATC •••



#### Final accurate alignment uses gap-affine distance



TGAC

Cornell University Computer Systems Laboratory

L16

MOTIVATION • APU MICROARCHITECTURE • APU MICROCODING • APU FOR MYERS' ACCELERATION



Page 13 of 18

## Myers' BIT-PARALLEL ALGORITHM







Myers' Algorithm:



| Δv <sub>j</sub> | Ρv | Μv |
|-----------------|----|----|
| -1              | 0  | 1  |
| 0               | 0  | 0  |
| +1              | 1  | 0  |

let m = length(query)let n = length(candidate)for q in [0...num\_queries-1]: precompute peq for c in [0...num\_candidates-1]:  $Pv = 1^m$  $Mv = 0^m$ score, min\_score = m for j in [0...n-1]: eq = peq[seed(j), i]Xv = eq | Mv $Xh = ((eq \& Pv) + Pv \land Pv) | eq$  $Ph = Mv | \sim (Xh | Pv)$ Mh = Pv & Xhif last bit of Ph is 1, score += 1 if last bit of Mh is 1, score -= 1 if score < min\_score, min\_score = score</pre> shift Ph save old MSB of Ph shift Mh save old MSB of Mh  $Pv = Mh \mid \sim (Xv \mid Ph)$ Mv = Ph & Xv



Cornell University Computer Systems Laboratory

MOTIVATION • APU MICROARCHITECTURE • APU MICROCODING • APU FOR MYERS' ACCELERATION

1

3

9

10

11

12

13

14

15

16

17

18

19

20

21 22

23

24

25

26

27

28

29

#### DATA LAYOUT ON APU







Cornell University Computer Systems Laboratory











Cornell University Computer Systems Laboratory



```
APL_FRAG search(vdst, vsrc, key):
APL_FRAG _frag_bitwise_or(vdst, vsrc0, vsrc1):
   0xFFFF: RL = VRF[vsrc0];
                                                                      RL = VRF[vsrc];
                                                          key:
                                                                                                                 let m = length(query)
   0xFFFF: RL |= VRF[vsrc1];
                                                          ~key:
                                                                      RL = ~VRF[vsrc];
                                                                                                                 let n = length(candidate)
   0xFFFF: VRF[vdst] = RL;
                                                          Oxffff:
                                                                      GVL = RL;
                                                          OxFFFF:
                                                                      VRF[vdst] = GVL;
                                                                                                               3
                                                                                                                  for g in [0...num_queries-1]:
APL_FRAG vadd(vdst, vsrc0, vsrc1):
                                                                                                                   precompute peq
  // ---- bit 0 ----
  // vdst = vsrc0 XOR vsrc1
                                                                                                                    for c in [0...num_candidates-1]:
  0x0001: RL = VRF[vsrc0];
                                                                                                                      Pv = 1^m
  0x0001: RL ^= VRF[vsrc1]:
                                                                                                                      Mv = 0 \wedge m
  0x0001: VRF[vdst] = RL;
                                                                                                                     score, min_score = m
                                                                                                              9
  // cout = vsrc0 AND vsrc1
                                                                                                              10
  0x0001: RL = VRF[vsrc0, vsrc1];
                                                                                                              11
                                                                                                                     fo<u>r j in [0...n-1]:</u>
                                                                                                                          eq = peq[seed(j), i]
                                                                                                              12
  // ---- bit 1 ----
                                                                                                              13
                                                                                                                          Xv = eq Mv
  // vdst = a \wedge b \wedge cin
                                                                                                              14
                                                                                                                          Xh = ((eq \& Pv) + Pv \land Pv) | eq
  (0x0001<<1): RL = VRF[vsrc0];
                                                                                                              15
                                                                                                                          Ph = MV | \sim (Xh | PV)
  (0x0001<<1): RL ^= VRF[vsrc1];
                                                                                                              16
                                                                                                                          Mh = Pv \& Xh
  (0x0001<<1): RL ^= RL_N;
                                                                                                              17
  (0x0001<<1): VRF[vdst] = RL;
                                                                                                                          if last bit of Ph is 1, score += 1
                                                                                                              18
  // \text{ cout} = a*b + b*cin + a*cin
                                                                                                                          if last bit of Mh is 1, score -= 1
                                                                                                              19
  (0x0001<<1): RL = VRF[vsrc0, vsrc1];
                                                                                                              20
                                                                                                                          if score < min score. min score = score
  (0x0001 << 1): VRF[temp_0] = RL;
                                                                                                              21
  (0x0001<<1): RL = VRF[vsrc1];
                                                                                                                          shift Ph
                                                                                                              22
  (0x0001<<1): RL &= RL_N;
                                                                                                              23
                                                                                                                          save old MSB of Ph
  (0x0001<<1): VRF[temp_1] = RL;
                                                                                                              24
  (0x0001<<1): RL = VRF[vsrc0];
                                                                                                              25
                                                                                                                          shift Mh
  (0x0001<<1): RL &= RL N:
                                                                                                              26
                                                                                                                          save old MSB of Mh
  (0x0001<<1): RL |= VRF[temp_0];
                                                                                                              27
  (0x0001<<1): RL |= VRF[temp_1];
                                                                                                              28
                                                                                                                          Pv = Mh \mid \sim (Xv \mid Ph)
                                                                                                                          Mv = Ph \& Xv
  . . .
                                                                                                              29
```









Cornell University Computer Systems Laboratory



```
APL_FRAG _frag_bitwise_or(vdst, vsrc0, vsrc1):
   0xFFFF: RL = VRF[vsrc0];
   0xFFFF: RL |= VRF[vsrc1];
   0xFFFF: VRF[vdst] = RL;
APL_FRAG vadd(vdst, vsrc0, vsrc1):
  // ---- bit 0 ----
  // vdst = vsrc0 XOR vsrc1
  0x0001: RL = VRF[vsrc0];
  0x0001: RL ^= VRF[vsrc1]:
  0x0001: VRF[vdst] = RL;
  // cout = vsrc0 AND vsrc1
  0x0001: RL = VRF[vsrc0, vsrc1];
  // ---- bit 1 ----
  // vdst = a \wedge b \wedge cin
  (0x0001<<1): RL = VRF[vsrc0];
  (0x0001<<1): RL ^= VRF[vsrc1];
  (0x0001<<1): RL ^= RL_N;
  (0x0001<<1): VRF[vdst] = RL;
  // \text{ cout} = a*b + b*cin + a*cin
  (0x0001<<1): RL = VRF[vsrc0, vsrc1];
  (0x0001 << 1): VRF[temp_0] = RL;
  (0x0001<<1): RL = VRF[vsrc1];
  (0x0001<<1): RL &= RL_N;
  (0x0001<<1): VRF[temp_1] = RL;
  (0x0001<<1): RL = VRF[vsrc0];
  (0x0001<<1): RL &= RL N:
  (0x0001<<1): RL |= VRF[temp_0];
  (0x0001<<1): RL |= VRF[temp_1];
```

```
APL_FRAG search(vdst, vsrc, key):
             RL = VRF[vsrc];
  key:
             RL = ~VRF[vsrc];
  ~key:
  Oxffff:
             GVL = RL;
  OxFFFF:
             VRF[vdst] = GVL;
APL_FRAG set_scalar(vdst, key):
  0xFFFF: RL = 0;
          RL = 1;
  kev:
  0xFFFF: VRF[vdst] = RL;
APL_FRAG save_last_bit(vdst, msk_b16, msk_bit):
  msk_b16: GVL = RL;
  msk_bit: VRF[vdst] = GVL;
```

```
0 let m = length(query)
  let n = length(candidate)
   for q in [0...num_queries-1]:
     precompute peq
     for c in [0...num_candidates-1]:
       Pv = 1^m
       Mv = 0^m
       score, min_score = m
9
10
11
       for j in [0...n-1]:
12
           eq = peq[seed(j), i]
13
           Xv = eq | Mv
14
           Xh = ((eq \& Pv) + Pv \land Pv) | eq
15
           Ph = MV | \sim (Xh | PV)
16
           Mh = Pv \& Xh
17
           if last bit of Ph is 1, score += 1
18
           if last bit of Mh is 1, score -= 1
19
           if score < min_score, min_score = score</pre>
20
21
22
            shift Ph
23
           save old MSB of Ph
24
25
            shift Mh
26
           save old MSB of Mh
27
28
           Pv = Mh \mid \sim(Xv \mid Ph)
           Mv = Ph \& Xv
29
```



. . .

MOTIVATION • APU MICROARCHITECTURE • APU MICROCODING • APU FOR MYERS' ACCELERATION

3



APL\_FRAG \_frag\_bitwise\_or(vdst, vsrc0, vsrc1): 0xFFFF: RL = VRF[vsrc0];0xFFFF: RL |= VRF[vsrc1]; 0xFFFF: VRF[vdst] = RL; APL\_FRAG vadd(vdst, vsrc0, vsrc1): // ---- bit 0 ----// vdst = vsrc0 XOR vsrc1 0x0001: RL = VRF[vsrc0]; 0x0001: RL ^= VRF[vsrc1]: 0x0001: VRF[vdst] = RL; // cout = vsrc0 AND vsrc1 0x0001: RL = VRF[vsrc0, vsrc1]; // ---- bit 1 ----// vdst =  $a \wedge b \wedge cin$ (0x0001<<1): RL = VRF[vsrc0]: (0x0001<<1): RL ^= VRF[vsrc1]; (0x0001<<1): RL ^= RL\_N; (0x0001<<1): VRF[vdst] = RL; // cout = a\*b + b\*cin + a\*cin(0x0001<<1): RL = VRF[vsrc0, vsrc1];  $(0x0001 << 1): VRF[temp_0] = RL;$ (0x0001<<1): RL = VRF[vsrc1]; (0x0001<<1): RL &= RL\_N; (0x0001<<1): VRF[temp\_1] = RL; (0x0001<<1): RL = VRF[vsrc0]; (0x0001<<1): RL &= RL N: (0x0001<<1): RL |= VRF[temp\_0]; (0x0001<<1): RL |= VRF[temp\_1];

```
APL_FRAG search(vdst, vsrc, key):
             RL = VRF[vsrc];
  key:
             RL = ~VRF[vsrc];
  ~key:
  Oxffff:
             GVL = RL;
  OxFFFF:
             VRF[vdst] = GVL;
                                                     3
APL_FRAG set_scalar(vdst, key):
  0xFFFF: RL = 0;
          RL = 1;
  kev:
  0xFFFF: VRF[vdst] = RL;
                                                    9
                                                    10
APL_FRAG save_last_bit(vdst, msk_b16, msk_bit):
                                                    11
  msk_b16: GVL = RL;
                                                    12
  msk_bit: VRF[vdst] = GVL;
                                                    13
                                                    14
                                                    15
APL_FRAG lsl_with_cin(vsrc, cin):
                                                    16
  0xFFFF: RL = VRF[vsrc];
                                                    17
  0xFFFF: VRF[vsrc] = RL_N;
                                                    18
  0x0001: RL = VRF[cin]:
                                                    19
  0x0001: VRF[vsrc] = RL:
                                                    20
                                                    21
                                                    22
                                                    23
                                                    24
                                                    25
                                                    26
                                                    27
                                                    28
```

0 let m = length(guery) let n = length(candidate)for q in [0...num\_queries-1]: precompute peq for c in [0...num\_candidates-1]:  $Pv = 1^m$  $Mv = 0^m$ score, min\_score = m for j in [0...n-1]: eq = peq[seed(j), i]Xv = eq | Mv $Xh = ((eq \& Pv) + Pv \land Pv) | eq$  $Ph = MV | \sim (Xh | PV)$ Mh = Pv & Xhif last bit of Ph is 1. score += 1if last bit of Mh is 1, score -= 1 if score < min score. min score = score shift Ph save old MSB of Ph shift Mh save old MSB of Mh  $Pv = Mh \mid \sim (Xv \mid Ph)$ Mv = Ph & Xv29



. . .

**R**ESULTS







Cornell University Computer Systems Laboratory

MOTIVATION • APU MICROARCHITECTURE • APU MICROCODING • APU FOR MYERS' ACCELERATION

Page 17 of 18

# Accelerating Seed Location Filtering in DNA Read Mapping Using a Commercial Compute-in-SRAM Architecture

- APU is an interesting platform for applications with massive parallelism, bitwise operations, narrow bitwidths, and high data reuse
- 2. APU is a promising way to accelerate filtering and other genomics workloads
- 3. APU's massive parallelism might prompt rethinking of traditional genomics algorithms



