# ECE 5745 Complex Digital ASIC Design Topic 10: CMOS Sequential State

## School of Electrical and Computer Engineering Cornell University

revision: 2022-02-22-20-38

| 1 | Basic Flip-Flop                   |    |  |  |
|---|-----------------------------------|----|--|--|
| 2 | Delay                             | 4  |  |  |
|   | 2.1. Setup Time                   | 4  |  |  |
|   | 2.2. Clock-to-Q Propagation Delay | 6  |  |  |
|   | 2.3. Hold Time                    | 7  |  |  |
|   | 2.4. Internal Clock Delay         | 8  |  |  |
| 3 | Energy                            | 10 |  |  |

Copyright © 2022 Christopher Batten. All rights reserved. This handout was prepared by Prof. Christopher Batten at Cornell University for ECE 5745 Complex Digital ASIC Design. Download and use of this handout is permitted for individual educational noncommercial purposes only. Redistribution either in part or in whole via both commercial or non-commercial means requires written permission.

# 1. Basic Flip-Flop

• Basic transmission gate master/slave flip-flop



- Why do we add extra inverters at the input and output?
- We want to characterize the: setup time, clock-to-Q propagation delay, and hold time
- Logical effort will not directly be of much use since transmission gates couple the delay of various gates together!
- We will need to directly use RC modeling, but first we need an RC model of a transmission gate

#### **RC Model for Transmission Gate**



- Assume size of NMOS and PMOS are both minimum width
- Assume a transistor passing a weak value has 2x worse effective resistance





2. Delay 2.1. Setup Time

## 2. Delay

**Setup Time** Latest the input can change before edge

while still capturing the value

**Hold Time** Earliest the input can change after edge

while still capturing the value

**Propagation Delay** Delay after clock rises until output is stable

**Contamination** Minimum delay after clock rises until

**Delay** output changes

# 2.1. Setup Time

 To calculate the setup time we ask ourselves, "How far does the input signal need to propagate so that we can reliably flip the master latch before the clock edge?"





# 2.2. Clock-to-Q Propagation Delay

• To calculate the propagation delay, we ask ourselves, "How long does it take after the rising edge to propagate the internal state?"



2. Delay 2.3. Hold Time

#### 2.3. Hold Time

• Hold time is how long we need to keep the input stable *after* the rising edge in order to prevent corrupting the state



- If we assume  $\theta$  and  $\overline{\theta}$  change instantaneously on the rising clock edge, then we actually have a *negative* hold time
- If the input changes right after the edge then by the time it gets to the first transmission gate (gate B) that gate is already open and input signal cannot corrupt the state
- In fact, the input can change a little *before* the edge since it takes some time to propagate through the first inverter

## 2.4. Internal Clock Delay

• What if we assume there is some delay between the clock input pin of our cell and the actual  $\theta$  and  $\overline{\theta}$  signals?



- How does this impact setup time, propagation delay, hold time?
- Remember that all three metrics are defined with respect to the clock pin of the cell *not* the internal  $\theta$  and  $\overline{\theta}$  signals
- First calculate the delay through the local clock tree (clock insertion delay), then factor this into these three metrics

### Calculate the delay through the local clock tree

• What is the output load on the local clock tree?

• Essentially what we have done is shift the sampling window



# 3. Energy

- Dynamic energy in a flip-flop comes from two sources:
  - toggling the data lines
  - toggling the clock
- First, let's label all of the gate and parasitic caps



• Estimate worst case energy per write/read access by calculating the maximum switched cap while assuming every node toggles



- Let's assume the following:
  - clock rate of 500MHz
  - data activity factor of 0.1
  - clock activity factor of 2 (toggles twice per cycle!)

$$\begin{split} E_{\text{data}} &= \alpha \frac{1}{2} C V_{dd}^{\ 2} = (0.1) \frac{1}{2} \times 52 C \times 0.5 \frac{\text{fF}}{\text{C}} \times (1 \text{V})^2 \\ P_{\text{data}} &= \alpha f \frac{1}{2} C V_{dd}^{\ 2} = (0.1) (0.5 \times 10^9) \frac{1}{2} \times 52 C \times 0.5 \text{fF/C} \times (1 \text{V})^2 \\ E_{\text{clock}} &= \alpha \frac{1}{2} C V_{dd}^{\ 2} = (2.0) \frac{1}{2} \times 20 C \times 0.5 \text{fF/C} \times (1 \text{V})^2 \\ P_{\text{clock}} &= \alpha f \frac{1}{2} C V_{dd}^{\ 2} = (2.0) (0.5 \times 10^9) \frac{1}{2} \times 20 C \times 0.5 \text{fF/C} \times (1 \text{V})^2 \\ &= 5 \ \mu\text{W} \end{split}$$

- Once we factor in activity factor, clock energy/power is significantly higher than data energy/power
- Let's estimate the total data/clock energy for a simple processor



• With 512 bits, the total data/clock energy/power is:

$$E_{\text{data}} = 512 \times 1.3 \text{ fJ}$$
 = 666 fJ  
 $P_{\text{data}} = 512 \times 0.65 \text{ }\mu\text{W}$  = 333  $\mu\text{W}$   
 $E_{\text{clock}} = 512 \times 10 \text{ fJ}$  = 5.1 pJ  
 $P_{\text{clock}} = 512 \times 5 \text{ }\mu\text{W}$  = 2.6 mW

 This ignores the clock tree; let's try and estimate the power consumed in the clock tree



- What is the total clock load?
- How should we size the inverters in the clock tree to reduce delay?
- Note that skew is much more important than absolute delay, but we still need a relatively fast tree to avoid very bad slew rates



WHAT ABOUT A foun-level clock tree?



so our four-level clock the is ~590 fasta but at what cost?

$$C_{1M,C3} = (1/3.52)$$
 96C = 27.3C 27C

 $C_{1M,C2} = (1/3.52)$  27.3 Cx4 = 31C 30C

 $C_{1M,C2} = (1/3.52)$  31C x4 = 35C 36C

 $C_{1M,C3} = (1/3.52)$  35C = 9.94C 9C

TOTAL SWITCHES CAP?

$$C_{SU,(2)} = 2+C \times 2 \times 16 = 864C$$
 $C_{SU,(2)} = 30C \times 2 \times 4 = 240C$ 
 $C_{SU,(4)} = 36C \times 2 \times 1 = 42C$ 
 $C_{SU,(6)} = 9C \times 2 \times 1 = 18C$ 

TOTAL CEN for 4 STAGE THE IS 1194 C

50 4 STAJE THE IS 570 FASHER BUT N 1.7 X MORE EVERTY DOWN ANER LET'S STICK WITH THE TIME STAJE!



#### · Clock gating

- reduces activity on 12 of clock tree
- reduces activity on local intra-cell clocking logic
- increases switched cap when register is enabled
- Estimate extra switched cap due to gating cell on clock path when register is continuously enabled



- Estimate switched cap when continuously disabled
  - Extra 20C switched cap on global clock tree still toggles
  - Final clock inverter in clock tree does not toggle (576C)
  - Intra-cell clock logic does not toggle ( $512 \times 20C = 10,240C$ )

• Let's put all of this together to estimate the impact of clock gating

|                  | No Clock<br>Gating | w/ Clock<br>Gating<br>(enabled) | w/ Clock<br>Gating<br>(disabled) |
|------------------|--------------------|---------------------------------|----------------------------------|
| Data             |                    |                                 |                                  |
| Intra Cell Clock |                    |                                 |                                  |
| L0,L1 Clk Tree   |                    |                                 |                                  |
| L2 Clk Tree      |                    |                                 |                                  |
| Gating Logic     |                    |                                 |                                  |
| Total            |                    |                                 |                                  |

- In the worst case where we add clock gating but never actually gate any registers, we add a power overhead of:
- In the best case where we add clock gating and we gate the clock every cycle, we reduce the total power by: