

FROM CHIPS TO SYSTEMS - LEARN TODAY, CREATE TOMORROW

## UMOC: Unified Modular Ordering Constraints to Unify Cycle- and Register-Transfer-Level Modeling

Shunning Jiang, Yanghui Ou, Peitian Pan, Christopher Batten Computer Systems Laboratory

Cornell University

DEC 5 - 9, 2021 🔶 SAN FRANCISCO, CALIFORNIA

## Hardware Design Trend

- Hardware Specialization!
- Heterogeneous System-on-Chips (SoC)





A12 Bionic – Apple













Figure 3.1.2: Xbox Series X SoC architecture block diagram.





Figure 3.1.2: Xbox Series X SoC architecture block diagram.





Figure 3.1.2: Xbox Series X SoC architecture block diagram.







## Cycle-Level Simulators/Models for Design Space Exploration





### Cycle-Level Simulators/Models for Design Space Exploration





### Cycle-Level Simulators/Models for Design Space Exploration



Cycle-level (CL) modeling:

- Approximate timing behaviors
- Analytical area, energy, timing models
   CL models provide
   valuable insights to help
   make first-order design
   decisions

(e.g., cycle-level "N-cycle hit-latency cache")



Composing Cycle-Level/RTL Models for Design Space Exploration





### Composing Cycle-Level/RTL Models for Design Space Exploration



CL/RTL Composition:

- Use some CL models for faster overall simulation
- Gradually replacing CL models with RTL models







```
void Proc::tick()
{
    writeback();
    mem();
    execute();
    decode();
    fetch();
}
```



```
void Accel::tick()
{
   work();
   interface();
}
```



```
void Proc::tick()
{
    writeback();
    mem();
    execute();
    decode();
    fetch();
}
void Accel::tick()
{
```

```
work();
interface();
```

```
Processor
                 enq deq
                            Neng deg
                                           eng
                                                              enq deq
                                                  deq
                                                                       write
           fetch
                        • decode
                                      • execute
                                                     memory
                                                                       back
                                             Accelerator
                                enq deq
                                               enq deq/
                                                              enq deq
                                      binterface
                                                        work
void Tile::tick()
  // modular
```

modular but inaccurate

accel.tick();

proc.tick();







```
void Proc::tick()
{
    writeback();
    mem();
    execute();
    decode();
    fetch();
}
void Accel::tick()
{
    work();
    interface();
}
```





No seamless CL/RTL compositions



- No seamless CL/RTL compositions
- PyMTL: manually CL ordering mixed with event-driven RTL

### @s.tick\_cl

#### def block():

# TODO: we might want to see if this ticking order makes sense

### if s.10\_enabled:

s.icache\_mem\_req\_adapter.xtick()
s.icache mem resp adapter.xtick()

req\_adapter.xtick()
resp\_adapter.xtick()

### if s.tmu:



- No seamless CL/RTL compositions
- PyMTL: manually CL ordering mixed with event-driven RTL
- SystemC: RTL/CL communication need to go through a clock edge

### @s.tick\_cl

### def block():

# TODO: we might want to see if this ticking order makes sense

### if s.10\_enabled:

s.icache\_mem\_req\_adapter.xtick()
s.icache mem resp adapter.xtick()

req\_adapter.xtick()
resp\_adapter.xtick()

### if s.tmu:



- No seamless CL/RTL compositions
- PyMTL: manually CL ordering mixed with event-driven RTL
- SystemC: RTL/CL communication need to go through a clock edge
- ... other ad-hoc approaches

@s.tick\_cl

### def block():

# TODO: we might want to see if this ticking order makes sense

### if s.10\_enabled:

s.icache\_mem\_req\_adapter.xtick()
s.icache mem resp adapter.xtick()

```
req_adapter.xtick()
resp_adapter.xtick()
```

### if s.tmu:



- No seamless CL/RTL compositions
- PyMTL: manually CL ordering mixed with event-driven RTL
- SystemC: RTL/CL communication need to go through a clock edge
- ... other ad-hoc approaches

### @s.tick\_cl

#### def block():

# TODO: we might want to see if this ticking order makes sense

### if s.10\_enabled:

s.icache\_mem\_req\_adapter.xtick()

```
s.icache_mem_resp_adapter.xtick()
```

```
req_adapter.xtick()
resp_adapter.xtick()
```

### if s.tmu:



# Unified Modular Ordering Constraints (UMOC)

Unified abstraction for signal-based RTL modeling and method-based CL modeling



# Unified Modular Ordering Constraints (UMOC)

Unified abstraction for signal-based RTL modeling and method-based CL modeling

$$\left. \begin{array}{c} x \text{ is a combinational wire} \\ A \text{ writes signal } x \\ B \text{ reads signal } x \end{array} \right\} \Longrightarrow \begin{array}{c} A \text{ precedes } B \\ (A < B) \end{array}$$



## Unified Modular Ordering Constraints (UMOC) Unified abstraction for signal-based RTL modeling and method-based CL modeling





## Unified Modular Ordering Constraints (UMOC) Unified abstraction for signal-based RTL modeling and method-based CL modeling

```
subcomponent subcomponent_instance_name (
         x is a combinational wire
                                                                                    ( clk_sub
                                                                                                ), // input
                                                                           .clk
                                                  A precedes B
                                                                                    (rst_n
                                                                           .rst_n
                                                                                                ), // input
              A writes signal x
                                                     (A < B)
                                                                           .data_rx ( data_rx_1 ), // input [9:0]
                                                                           .data_tx ( data_tx ) // output [9:0]
               B reads signal x
                     void Proc::decode() {
                       auto i = FD q.dequeue();
                       if (i.is accel inst)
                         Accel q.enqueue(...);
void Proc::tick()
                       DX g.engueue(...);
  writeback();
                     void Proc::execute() {
  mem();
                       auto i = DX q.dequeue();
  execute();
                       switch (i.type) {
  decode();
                          . . .
  fetch();
                       XM q.enqueue(...);
```



## Unified Modular Ordering Constraints (UMOC) Unified abstraction for signal-based RTL modeling and method-based CL modeling

```
subcomponent subcomponent_instance_name (
        x is a combinational wire
                                                                                   ( clk_sub
                                                                                              ), // input
                                                                         .clk
                                                 A precedes B
                                                                                  (rst_n
                                                                         .rst_n
                                                                                              ), // input
             A writes signal x
                                                    (A < B)
                                                                         .data_rx ( data_rx_1 ), // input
                                                                                                         [9:0]
                                                                         .data_tx ( data_tx ) // output [9:0]
              B reads signal x
                     void Proc::decode() {
                      auto i = FD q.dequeue();
                      if (i.is accel inst)
                         Accel q.enqueue(...);
                                                    q.dequeue precedes q.enqueue
void Proc::tick()
                      DX g.engueue(...);
                                                                                                 A precedes B
                                                           A \text{ calls } q. dequeue
                                                                                                   (A < B)
 writeback();
                     void Proc::execute() {
 mem();
                      auto i = DX q.dequeue();
 execute();
                                                           B calls q.enqueue
                      switch (i.type) {
  decode();
  fetch();
                      XM q.enqueue(...);
```



• We insert a queue with deq<enq to accelerator



- We insert a queue with deq<enq to accelerator</li>
  - The interface process invokes deq



- We insert a queue with deq<enq to accelerator</li>
  - The interface process invokes deq
  - Expose its enq method to the parent tile



- We insert a queue with deq<enq to accelerator</li>
  - The interface process invokes deq
  - Expose its enq method to the parent tile
  - Pass the enq method to processor



- We insert a queue with deq<enq to accelerator
  - The interface process invokes deq
  - Expose its enq method to the parent tile
  - Pass the enq method to processor
  - The decode process invokes enq



- We insert a queue with deq<enq to accelerator</li>
  - The interface process invokes deq
  - Expose its enq method to the parent tile
  - Pass the enq method to processor
  - The decode process invokes enq
- Global scheduler: interface before decode



## Seamless CL/RTL Composition





## Seamless CL/RTL Composition

- Creating the Unified Directed Graph (UDG)
  - Edges include implicit and explicit ordering constraints





## Seamless CL/RTL Composition

- Creating the Unified Directed Graph (UDG)
  - Edges include implicit and explicit ordering constraints
  - Loops between RTL processes are allowed





### Seamless CL/RTL Composition

- Creating the Unified Directed Graph (UDG)
  - Edges include implicit and explicit ordering constraints
  - Loops between RTL processes are allowed
  - CL processes are not allowed to appear in any loop



# Scheduling the Unified Directed Graph

- Some properties of the UDG:
  - CL processes execute exactly once per cycle
  - RTL processes need to execute until value stabilize



# Scheduling the Unified Directed Graph

- Some properties of the UDG:
  - CL processes execute exactly once per cycle
  - RTL processes need to execute until value stabilize
- If the UDG has no cycle
  - Topological sort that statically schedules all processes in the DAG



# Scheduling the Unified Directed Graph

- Some properties of the UDG:
  - CL processes execute exactly once per cycle
  - RTL processes need to execute until value stabilize
- If the UDG has no cycle
  - Topological sort that statically schedules all processes in the DAG
- If the UDG has cycle
  - Strongly connected components (SCC) algorithm
    - » Shrink cycles into a single node
  - Execute the DAG of SCCs
    - » Topological sort of the DAG
    - » Iteratively Execute the SCC





# UMOC Implemented in PyMTL3

- PyMTL3 is a state-of-the-art Python-based hardware generation and simulation framework
- PyMTL3 is very extensible thanks to modular framework architecture
  - Frontend: Embedded domain specific language (EDSL) modeling primitives
  - IR: Native in-memory intermediate representation (NIMIR)
  - Backend: Passes that systematically manipulate NIMIR
- UMOC implemented in PyMTL3:
  - EDSL modeling primitives
  - NIMIR data structures
  - Graph generation and scheduling passes



## UMOC Implementation of PyMTL3 EDSL Primitives

- UMOC PyMTL3 EDSL primitives:
  - Inherit from Component
  - InPort, OutPort, Wire, CalleePort, CallerPort
  - @update\_ff, @update, @update\_once, @method\_port
  - add\_constraints

```
class RegIncrCLRTL( Component ):
                                  class RegIncrCL( Component ):
class RegIncrRTL( Component ):
                                                                       def construct( s ):
                                    def construct( s ):
 def construct( s ):
                                                                         s.write = CalleePort()
                                       # Model sequential behavior!
    s.in = InPort (32)
                                                                         s.out = OutPort(32)
                                       s.add_constraints(
    s.out = OutPort(32)
                                         M(s.read) < M(s.write),
                                                                         s.r1 = RegIncrCL()
    s.reg = Wire(32)
                                                                         s.r2 = RegIncrRTL()
    @update_ff
                                    @method_port
                                                                         connect( s.write, s.r1.write )
    def seq_reg():
                                    def read( s ):
                                                                         connect( s.out, s.r2.out )
      s.reg <<= s.in_
                                      return s.v + 1
                                                                         Qupdate_once
    Qupdate
                                    @method_port
                                                                         def send_to_r2():
    def comb out():
                                    def write( s, v ):
                                                                           s.r2.in_ @= s.r1.read()
      s.out @= s.reg + 1
                                       s.v = v
```





• Supporting UMOC in PyMTL3 NIMIR elaboration



- Supporting UMOC in PyMTL3 NIMIR elaboration
  - Collecting all the update blocks and ordering constraints



- Supporting UMOC in PyMTL3 NIMIR elaboration
  - Collecting all the update blocks and ordering constraints
  - Exposing all metadatas with APIs



- Supporting UMOC in PyMTL3 NIMIR elaboration
  - Collecting all the update blocks and ordering constraints
  - Exposing all metadatas with APIs
- UMOC passes



- Supporting UMOC in PyMTL3 NIMIR elaboration
  - Collecting all the update blocks and ordering constraints
  - Exposing all metadatas with APIs
- UMOC passes
  - GenUDGPass to generate the unified directed graph
    - Update blocks as vertices, explicit/implicit constraints as edges
  - UMOCSchedulingPass to schedule the UDG



- Supporting UMOC in PyMTL3 NIMIR elaboration
  - Collecting all the update blocks and ordering constraints
  - Exposing all metadatas with APIs
- UMOC passes
  - GenUDGPass to generate the unified directed graph
    - Update blocks as vertices, explicit/implicit constraints as edges
  - UMOCSchedulingPass to schedule the UDG
- The user only need to set local explicit ordering constraints. No global scheduling required.





• 5-stage RTL Processor, 3-stage CL processor





- 5-stage RTL Processor, 3-stage CL processor
- RTL/CL checksum accelerators





- 5-stage RTL Processor, 3-stage CL processor
- RTL/CL checksum accelerators
- Manual == Execute everything in accelerator before processor (or vice versa)





- 5-stage RTL Processor, 3-stage CL processor
- RTL/CL checksum accelerators
- Manual == Execute everything in accelerator before processor (or vice versa)

```
void Tile::tick()
{
    // modular
    accel.tick();
    proc.tick();
}
```



- 5-stage RTL Processor, 3-stage CL processor
- RTL/CL checksum accelerators
- Manual == Execute everything in accelerator before processor (or vice versa)

| Mechanism                                                                                                       | Composition          | #Cycles | Deviation | Remarks          |
|-----------------------------------------------------------------------------------------------------------------|----------------------|---------|-----------|------------------|
| Event-driven                                                                                                    | RTL Proc + RTL Accel | 565     | -         | baseline         |
| UMOC                                                                                                            | RTL Proc + RTL Accel | 565     | 0%        | same as baseline |
| UMOC                                                                                                            | CL Proc + CL Accel   | 541     | 4%        | due to 3-stage   |
| Manual Proc <accel< td=""><td>CL Proc + CL Accel</td><td>416</td><td>26%</td><td>modular sub-tick</td></accel<> | CL Proc + CL Accel   | 416     | 26%       | modular sub-tick |
| Manual Accel <proc< td=""><td>CL Proc + CL Accel</td><td>416</td><td>26%</td><td>modular sub-tick</td></proc<>  | CL Proc + CL Accel   | 416     | 26%       | modular sub-tick |
| UMOC                                                                                                            | CL Proc + RTL Accel  | 541     | 4%        | same as CL+CL    |
| UMOC                                                                                                            | RTL Proc + CL Accel  | 565     | 0%        | same as RTL+RTL  |



# CL/RTL Compositions Helps Chip Tape-outs

- Main-memory only needs CL
- CL shared MDU/FPU for DSE
- CL cache for DSE
- CL on-chip networks for DSE
- Processor IP already developed



### Takeaways & Conclusion

- UMOC's explicit ordering constraints achieves model fidelity and scheduling modularity at once.
- UMOC's implicit & explicit constraints achieves seamless CL/RTL composition.
- UMOC has been implemented in PyMTL3. Many IPs have been built using UMOC scheme.

