# Impact of Micro-Architectural Optimizations on Post Layout Power

## Major Project Report

Submitted in partial fulfillment of the requirements for the degree of

Master of Technology in Electronics & Communication Engineering (VLSI Design)

By

Jitender (14MECV06)



Electronics & Communication Engineering Branch Electrical Engineering Department Institute of Technology Nirma University Ahmedabad-382 481 May 2016

# Impact of Micro-Architectural Optimizations on Post Layout Power

## Major Project Report

Submitted in partial fulfillment of the requirements for the degree of

Master of Technology in Electronics & Communication Engineering (VLSI Design)

By

## Jitender

#### (14 MECV06)

Under the guidance of

#### **External Project Guide:**

Internal Project Guide:

Mrs. Divya Parihar Member Consulting Staff Calypto System Division Mentor Graphics Crop. Noida

**Dr. N.M. Devashrayee** PG Coordinator VLSI Design Nirma University Ahmedabad



Electronics & Communication Engineering Branch Electrical Engineering Department Institute of Technology Nirma University Ahmedabad-382 481 May 2016

iii

14 MECV06

Certificate

This is to certify that the Major Project entitled "Impact of Micro-Architectural Optimizations on Post Layout Power" submitted by Jitender (14MECV06), towards the partial fulfillment of the requirements for the degree of Master of Technology in VLSI Design, Nirma University, Ahmedabad is the record of work carried out by him under our supervision and guidance. In our opinion, the submitted work has reached a level required for being accepted for examination. The results embodied in this major project, to the best of our knowledge, haven't been submitted to any other university or institution for award of any degree or diploma.

Date:

Internal Guide

**Dr. N.M. Devashrayee** (Professor, EC)

Director

**Dr. P. N. Tekwani** (Head of EE Dept.) (Director, IT-NU) Program Co-ordinator

Place: Ahmedabad

**Dr. N. M. Devashrayee** (Professor,EC)



# Declaration

This is to certify that

- 1. The thesis comprises my original work towards the degree of Master of Technology in VLSI Design at Nirma University and has not been submitted elsewhere for a degree.
- 2. Due acknowledgment has been made in the text to all other material used.

- Jitender 14MECV06

## Acknowledgements

It gives me immense pleasure to acknowledge **Mr. Abhishek Ranjan**, Engineering Director, and Mrs. Divya Parihar, Member Consulting Staff, Calypto System Division, Mentor Graphics Crop. Noida, for providing me a platform for exploring my abilities.

It is my pleasure to express my deep sense of gratitude towards **Dr.N.M.Devashrayee**, PG Coordinator (VLSI Design), Institute of Technology Nirma University for being a timely guide and for being source of inspiration during the project.

I would also like to thanks to **Dr. P.N.Tekwani**, Head of Electrical Engineering Department for allowing me to take thesis work and for his guidelines during the review process.

I would like to express gratitude and sincere thanks to my Calypto System Division Team Members Ms. Kriti Kumari, Ms. Neha Babel, Mrs. Sakshi Choudhary, Mr. Roopendra Singh Yadav and Mrs. Mahima Jain for helping me to carry out my final year.

> - Jitender 14MECV06

## Abstract

Power consumption is one of the top concerns of Very Large Scale Integration (VLSI) circuit design, for which Complementary Metal Oxide Semiconductor (CMOS) is the primary technology. Today's focus on low power is not only because of the recent growing demands of mobile applications. Even before the mobile era, power consumption has been a fundamental problem. To solve the power dissipation problem, many researchers have proposed different ideas from the device level to the architectural level and above. However, there is no universal way to avoid tradeoffs between power, delay and area, and thus designers are required to choose appropriate techniques that satisfy application and product needs.[1] In general, low power VLSI Design can be achieved at all levels of the VLSI Design (system, algorithm, architecture, circuit, logic, device, & technology levels). But optimizations for low power VLSI Design done at higher abstraction results in comparatively higher power savings. This report presents implementation of Shift Register to Circular Buffer Micro-Architectural optimization for low power VLSI Design. Shift Register to Circular Buffer Micro-Architectural optimization reduces the switching activity of the flops& thus reduces the dynamic power consumption. Circular Buffer is the functional equivalent of the Shift Register with less flops toggling. In Shift Register (Serial In Serial Out) all registers toggle even though only one register is read/written. But Circular Buffer implementation is done such that only one register toggle at any time thus consumes less power. In this work the impact of optimization is analyzed not only on RTL power but also on post layout power. The implementation, overhead& impact of Shift Register to Circular Buffer Micro-Architectural optimization on post layout power is presented in this report.

# Contents

| Ce           | ertifi                | cate                                | iii |  |  |  |
|--------------|-----------------------|-------------------------------------|-----|--|--|--|
| D            | eclar                 | ation                               | iv  |  |  |  |
| A            | ckno                  | wledgements                         | v   |  |  |  |
| $\mathbf{A}$ | bstra                 | $\mathbf{ct}$                       | vi  |  |  |  |
| $\mathbf{A}$ | bbre                  | viations                            | x   |  |  |  |
| 1            | Intr                  | oduction                            | 1   |  |  |  |
|              | 1.1                   | Motivation                          | 1   |  |  |  |
|              | 1.2                   | Objective                           | 2   |  |  |  |
| <b>2</b>     | Literature Survey     |                                     |     |  |  |  |
|              | 2.1                   | Sources of Power Dissipation        | 5   |  |  |  |
|              |                       | 2.1.1 Factors of Power Dissipation  | 6   |  |  |  |
|              | 2.2                   | Clock Gating                        | 6   |  |  |  |
|              |                       | 2.2.1 Combinational Clock Gating    | 6   |  |  |  |
|              |                       | 2.2.2 Sequential Clock Gating       | 7   |  |  |  |
|              | 2.3                   | Clock Gating Efficiency             | 8   |  |  |  |
| 3            | Circular Buffer       |                                     |     |  |  |  |
|              | 3.1                   | Implementation                      | 11  |  |  |  |
|              | 3.2                   | Flop's Efficiency Change from SR2CB | 11  |  |  |  |
| 4            | Flow & EDA Tools Used |                                     |     |  |  |  |
|              | 4.1                   | Methodology Adopted                 | 13  |  |  |  |
|              | 4.2                   | Flow Used                           | 13  |  |  |  |
| <b>5</b>     | Exp                   | periment Results & Observations     | 17  |  |  |  |
|              | 5.1                   | Experiment Results                  | 17  |  |  |  |
|              | 5.2                   | Observations                        | 18  |  |  |  |

|   | 5.2.1                         | Clock tree configuration of SR vs CB $\ldots \ldots \ldots \ldots \ldots \ldots \ldots \ldots$ | 18              |
|---|-------------------------------|------------------------------------------------------------------------------------------------|-----------------|
|   | 5.2.2                         | Area Overheads from SR2CB Optimization                                                         | 18              |
|   | 5.2.3                         | Combinational Power Overhead from SR2CB Optimization                                           | 19              |
|   | 5.2.4                         | Analysis of Power Numbers                                                                      | 19              |
| 6 | <b>Conclusio</b><br>6.1 Concl | <b>n</b><br>usion                                                                              | <b>23</b><br>23 |
| R | eferences                     |                                                                                                | 25              |

# List of Figures

| 1.1 | Shift Register                                                                                    | 2  |
|-----|---------------------------------------------------------------------------------------------------|----|
| 2.1 | Combinational Clock Gating                                                                        | 7  |
| 2.2 | Sequential Clock Gating                                                                           | 7  |
| 2.3 | Clock Gating Efficiency                                                                           | 9  |
| 2.4 | Average CG Efficiency for design                                                                  | 9  |
| 3.1 | Circular Buffer                                                                                   | 12 |
| 3.2 | RTL Snippet of Circular Buffer                                                                    | 12 |
| 4.1 | Experiment Methodology                                                                            | 13 |
| 4.2 | RTL to Post Layout Netlist Flow                                                                   | 14 |
| 4.3 | Power Analysis Flow                                                                               | 15 |
| 5.1 | Comparison b/w SR & CB Clock Tree                                                                 | 18 |
| 5.2 | Area Reports of SR & CB                                                                           | 18 |
| 5.3 | Combinational Power Reports for SR & CB $\ldots \ldots \ldots \ldots \ldots \ldots \ldots \ldots$ | 19 |

## Abbreviations

- **SR** Shift Register
- **CB** Circular Buffer
- SR2CB Shift Register to Circular Buffer
- **SISO** Serial-In Serial-Out
- EDA Electronic design automation
- ${\bf SPEF}\,$  Standard Parasitic Exchange Format
- PA Power Analysis

# Introduction

## 1.1 Motivation

Historically, VLSI designers have focused on increasing the speed and reducing the area of digital systems. However, the evolution of portable systems and advanced Deep Sub-Micron fabrication technologies, has brought power dissipation as another critical design factor. Today, we are interested in power from a number of points of view. In portable applications, products normally run off batteries. While battery technology has improved markedly over the years, it remains that a battery of a certain weight and size has a certain energy capacity. For example, a pair of rechargeable AA batteries has an energy capacity of about 7 W-hr, and a good lithium-ion laptop battery has an energy density of about 80 W-hr/lb. Inevitably, the battery runs down and needs recharging or replacement. Product designers are interested in extending the lifetime of the battery while simultaneously adding features and reducing size, so creating low-power IC designs is key. Low-power design reduces cooling cost and increases reliability especially for high-density systems. Moreover, it reduces the weight and size of portable devices.

Low power VLSI design can be achieved at all levels of the VLSI Design (system, algorithm, architecture, circuit, logic, device,& technology levels). But optimizations for low power VLSI Design done at higher abstraction results in comparatively higher power savings. And SR2CB is a Micro-Architectural level optimization thus much power can be saved using this optimization. Shift Register (Serial In Serial Out) is a group of flops connected in a chain so that the output from one flop becomes the input of the next flop. All flops are driven by common clock, and all are set or reset simultaneously. They are used to store and delay data by one clock time for each stage.



Figure 1.1: Shift Register

## 1.2 Objective

The main advantage of SISO Shift Register is that, it is very easy to implement but if it's seen from power consumption prospective it has one of the biggest drawback is it's higher switching activity. In SISO Shift Register all registers toggle even though only one register is read/written. And as we know

#### $Power \propto SwitchingActivity$

To get an output form the Shift Register, the number of flops toggles is equal to the depth or length of shift register. So to minimize the power consumption switching activity need to minimize. To minimize the switching activity in shift register if clock gating is applied then it will not help much because clock gating technique is useful where the flops are not enable most of the time. But as shift registers used in data paths of VLSI Design, the flops of shift register are always enabled means flops are enabled all the time. And shift register is supposed to transfer data on every clock cycle so through clock gating, switching activity of shift register can't be minimized.

As the most popular technique to reduce dynamic power or switching activity is not very effective in shift register's case so need to think of different technique to reduce switching activity of shift register. Circular buffer which is a functional equivalent design for the shift register with lower switching activity can be used in place of shift registers to achieve low power VLSI Design.

The objective behind the experiment is to transform the RTL of SR in to CB and take both the Design down to full VLSI ASIC Design flow. And using the RTL, Netlist & their corresponding SPEFs generate the power reports using power analysis EDA tool. The transformed design (CB) power should come less than that of original design (SR) power at all levels (RTL, Pre-Layout& Post-Layout).

# Literature Survey

#### 2.1 Sources of Power Dissipation

Power dissipation in CMOS circuits comes from two components:

- Dynamic dissipation due to
  - 1. charging and discharging load capacitances as gates switch
  - 2. "short-circuit" current while both pMOS and nMOS stacks are partially ON
- Static dissipation due to
  - 1. subthreshold leakage through OFF transistors
  - 2. gate leakage through gate dielectric
  - 3. junction leakage from source/drain diffusions
  - 4. contention current in ratioed circuits

Putting this together gives the total power of a circuit

#### TotalPower = DynamicPower + StaticPower

Power can also be considered in active, standby, and sleep modes.

- *Active power* is the power consumed while the chip is doing useful work. It is usually dominated by switching power.
- *Standby power* is the power consumed while the chip is idle. If clocks are stopped and ratioed circuits are disabled, the standby power is set by leakage.
- In sleep mode, the supplies to unneeded circuits are turned off to eliminate leakage. This drastically reduces the *sleep power* required, but the chip requires time and energy to wake up so sleeping is only viable if the chip will idle for long enough.[2]

#### 2.1.1 Factors of Power Dissipation

$$P_{switching} = \alpha C_L V_{dd}^2 f$$
$$P_{sc} = \beta / 12 (V_{dd} - 2V_t)^3 \tau f$$
$$P_{leakage} = V_{dd} I_{leakage}$$

Thus Power Dissipation depends on following factors:

- Supply Voltage
- Physical Capacitance
- Switching Activity
- Threshold Voltage

## 2.2 Clock Gating

Clock gating is a popular technique used in many synchronous circuits for reducing dynamic power dissipation. Clock gating saves power by adding more logic to a circuit to prune the clock tree. Pruning the clock disables portions of the circuitry so that the flip-flops in them do not have to switch states. Switching states consumes power. When not being switched, the switching power consumption goes to zero, and only leakage currents are incurred.

#### 2.2.1 Combinational Clock Gating

Combinational clock gating identifies the condition when the data is held in a register and shuts off the clock to the register during that period. This leads to reduction of dynamic power consumed by the register and the clock network driving the register. Consider the following example.

In this example (Fig.2.1), the netlist generated by the RTL synthesis tool from the Verilog code snippet is shown in top right. Register Q loads new data when EN signal is ON otherwise, it holds the data. Opportunities to insert combinational clock gating can be found by looking for conditional assignments in the code. Clock gating logic is substituted when code like if (cond) out  $\leq$  in is present. Power aware logic synthesis tools identify RTL coding patterns and make the appropriate substitution. This is shown in the circuit in the bottom.



Figure 2.1: Combinational Clock Gating

As shown in Figure 2.1, when an "explicit" clock enable exists in the RTL code, synthesis tools may choose between two possible implementations. The implementation as shown in Figure 2.1(a), is a "re-circulating register" implementation, where the enable is used to either select a new data value or re-circulate the previous data value. The implementation as shown in Figure 2.1(b) is a "gated clock" implementation. When the enable is off, the clock is disabled. The output of the two implementations will always be identical, but the timing and power behavior will be different.

#### 2.2.2 Sequential Clock Gating

Clock gating based on sequential analysis involves identifying new enable conditions and then using them to gate the clock. These enable conditions can be generated to hold the data if new data is not required in the downstream logic or when data is stable or invalid. We will be referring to this transformation as sequential clock gating transformation. Consider the example of such a transformation.



Figure 2.2: Sequential Clock Gating

In the RTL code snippet on the top left corner in Figure 2.2, registers q0 and q1 are latching a new data value every cycle. Hence, when taken through low-power RTL synthesis tools, they would not have clock gating. If we observe carefully, we notice that if en is ON, then the data latched into q1 in the previous cycle is not used. Thus, we can hold the previous data on q1 during that cycle.

By performing this sequential reasoning, we can identify  $\sim$ SEL as the new enable condition for register q1. Similar sequential analysis will identify SEL as the new enable condition for register q0 and essentially generate the RTL code snippet in the bottom left. When this code is taken through low-power synthesis tool, it will insert appropriate clock gating logic for register q0 and q1.

## 2.3 Clock Gating Efficiency

System and sequential clock gating offer higher power saving potential as it tends to be more global in nature. Moreover, sequential clock gating is the most effective way of reducing peak power. Historically, this is mainly achieved by making such changes manually. This process is often difficult and error-prone, partly due to the difficulty in recognizing such opportunities, and partly due to the difficulty in implementing the gating logic without introducing functional errors. Also, there is no easy way to verify functional correctness since most functional test benches may not yield adequate coverage. PowerPro's tight integration with SLEC (Sequential Logic Equivalence Checker) provides the breakthrough technology and integration required to resolve this impasse.

A typical metric used to measure the effectiveness of clock gating is the percentage of registers in the design that are clock gated. While this gives designers an indication of the clock gating in the design, it has poor correlation to power savings. Dynamic power consumption depends on the switching activity of a given node over a simulation test -bench. Clock gating efficiency, on the other hand, takes this aspect into account, making it a better indicator of the actual dynamic power consumption ion.

Figure 2.3 shows a block with a single register. Since the only register in the block is clock gated, the block is 100% clock gated. However, since the enable signal is gating the clock input to the register and the clock is inactive for 3 of the 10 cycles, the clock gating efficiency is 30%. The percent of clock gated registers is not as good an indicator of power savings as is the clock gating efficiency.

Above Figure 2.3 shows a block with a single register. Since the only register in the block is clock gated, the block is 100% clock gated. However, since the enable signal is gating the clock input to the register and the clock is inactive for 3 of the 10 cycles,



Figure 2.3: Clock Gating Efficiency

the clock gating efficiency is 30% of the time. The percent of clock gated registers is not as good an indicator of power savings as is the clock gating efficiency. Estimating power depends on representative switching activity.



Figure 2.4: Average CG Efficiency for design

A simulator can generate a switching activity file based on a given test-bench. This is only as representative as the test-bench itself, so selection of a representative testbench is critical to good power estimation. Clock gating efficiency is defined as the percentage of time a register is gated for a given switching activity. When looking at an entire design, the average clock gating efficiency can be computed as the average of clock gating efficiencies for all registers in the design for a given simulation test bench.

Figure 2.4 shows the average clock gating efficiency for the entire design over a simulation trace. Improving the clock gating efficiency in turn means reduced switching,

which can save dynamic power. A designer's goal is to improve the average clock gating efficiency as much as possible. It is not practical to achieve 100%, which means the design is idle and non-functional all the time.

# **Circular Buffer**

## 3.1 Implementation

Circular Buffer is a functional equivalent sequential circuit of SISO shift register used to store & delay data by one clock time for every stage. Input data is fed to all flops input but a counter and decoder is used to determine what flop to be written to / read from. Although input data is fed to the input of all flops but what flop is enabled for data to be written is controlled by the counter with decoder and same way output data read out from which flop is also controlled such that output data cycle accuracy remain same as of shift register's output data.

## 3.2 Flop's Efficiency Change from SR2CB

For the low power VLSI Design prospective Flop's efficiency can be defined as the percentage of the time the flop is not active. So if a flop's efficiency is 0% which means the flop is enabled all the time.

CB's Flop Efficiency = 100 - Flop's ON time percentage Flop's ON time percentage = SR's flop ON time percentage/No. of SR flops

Let's say we have a shift register of length/depth 4 & each flop has efficiency 0%, means all flops are ON for the entire simulation time. Now if we implement corresponding circular buffer of length/depth 4 then each flop will be ON only for 25% of the entire simulation time & thus each flop will become 75% efficient.



Figure 3.1: Circular Buffer

```
always @(posedge clk)
```

begin

cb\_flops [counter] <= input\_data;

if (counter < (number of flops -1))

```
counter <= counter + 1;
```

else

counter <= 0;

end

assign output = cb\_flops[counter];

Figure 3.2: RTL Snippet of Circular Buffer

# Flow & EDA Tools Used

## 4.1 Methodology Adopted

The basic idea behind the experiment is to transform the RTL of SR in to CB and take both the Design down to full VLSI ASIC Design flow. And using the RTL, Netlist with their corresponding SPEFs generate the power reports using power analysis EDA tool.

The transformed design (CB) power should come less than that of original design (SR) power at all levels (RTL, Pre-Layout & Post-Layout).



Figure 4.1: Experiment Methodology

## 4.2 Flow Used

The following flow has been used to perform the experiment:

- 1. Designed a SISO shift register and its functional equivalent circular buffer in Verilog HDL. And generated the random traces (input vectors for SR design).
- 2. To make sure that the circular buffer is functional equivalent to the shift register or not, verified the designs with formal equivalence checking EDA tool SLEC.
- 3. SR and CB designs are passed to the RTL synthesis EDA tool RealTime Designer and through the standard flow of RealTime Designer generated the synthesized netlist of both SR and CB.
- 4. The synthesized netlist of SR and CB along with the timing and area constraints of the designs are provided as the input to the place & route EDA tool Olympus-Soc and through the standard flow of Olympus-SoC generated the post layout netlist and SPEFs of both SR and CB.
- 5. RTL Designs of SR and CB along with their corresponding SPEFs & SR trace file is given to the RTL power optimization & power analysis EDA tool PowerPro and through the standard power analysis flow of PowerPro generated the power reports of SR and CB.
- 6. For the synthesized netlist & post layout netlist of SR and CB along with their corresponding SPEFs & SR trace files power reports of SR and CB is generated using PowerPro standard power analysis flow for Netlist.



Figure 4.2: RTL to Post Layout Netlist Flow

The SR and CB designs are implemented using parameterized module so that experiment can be performed for various combinations of width & length of flops (number of flops in SR/CB). To see the behavior of power savings from SR2CB optimization when SR designs are 0%, 25%, 50% & 75% efficient, the enable signal trace is custom modified using shell scripting.



Figure 4.3: Power Analysis Flow

# Experiment Results & Observations

#### 5.1 Experiment Results

The experiment for SR2CB optimization is performed over various combinations of width, length (Number of flops in SR/CB) & flop's efficiency 0%, 25%, 50% & 75%. The individual reports of power and area are processed through shell scripting & useful data has been extracted in excel format.

Next few pages have the printouts of the various reports excel formatted data. To analyze the experiment data graphs is also attached in the coming sheets.

Looking at the graphs it can be clearly seen that as the flops of SR become more efficient, power savings from SR2CB transformation reduces.

## 5.2 Observations

#### 5.2.1 Clock tree configuration of SR vs CB

| by CGICs. A few buffers are directly                                                                         | CB Clock Tree<br>Buffers: Most of the buffers are directly<br>driven by clock. A few buffers are<br>driven by CGICs & i.e. why less impact<br>of efficiency change on clock<br>network power for CB design. |
|--------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| с ,                                                                                                          | Clock Nets: Switching activity on clock<br>nets & their length directly impact<br>their power.                                                                                                              |
| CGICs: Common enable drives the all<br>flops of SR design so only one CGIC will<br>be inserted in SR Design. | CGICs: Enable condition for every flop is<br>different i.e. why individual cgic will be<br>inserted for every flop in CB Design (if<br>cg min size criteria is satisfied).                                  |

Figure 5.1: Comparison b/w SR & CB Clock Tree

#### 5.2.2 Area Overheads from SR2CB Optimization

Although SR2CB optimization reduces the dynamic power of the design by minimizing the switching activity of flops but there is an area overhead for this optimization. The area overhead results due to the implementation of circular buffer consist of additional elements like Counter, Decoder & Mux which were not present in shift register design.

| SR_Design            | SR_Total_Area | CB_Design            | CB_Total_Area | Difference |
|----------------------|---------------|----------------------|---------------|------------|
| rtl_sr_param.0_4_2   | 71.68         | rtl_cb_param.0_4_2   | 122.04        | 50.36      |
| rtl_sr_param.0_4_4   | 136.96        | rtl_cb_param.0_4_4   | 251.84        | 114.88     |
| rtl_sr_param.0_4_6   | 200.96        | rtl_cb_param.0_4_6   | 396.16        | 195.2      |
| rtl_sr_param.0_4_8   | 264.96        | rtl_cb_param.0_4_8   | 466.88        | 201.92     |
| rtl_sr_param.0_4_12  | 395.52        | rtl_cb_param.0_4_12  | 762.24        | 366.72     |
| rtl_sr_param.0_4_16  | 526.08        | rtl_cb_param.0_4_16  | 907.2         | 381.12     |
| rtl_sr_param.0_4_32  | 1044.48       | rtl_cb_param.0_4_32  | 1684.16       | 639.68     |
| rtl_sr_param.0_4_50  | 1628.16       | rtl_cb_param.0_4_50  | 2718.08       | 1089.92    |
| rtl_sr_param.0_4_64  | 2078.72       | rtl_cb_param.0_4_64  | 3208.64       | 1129.92    |
| rtl_sr_param.0_4_80  | 2598.4        | rtl_cb_param.0_4_80  | 4546.56       | 1948.16    |
| rtl_sr_param.0_4_100 | 3246.08       | rtl_cb_param.0_4_100 | 5250.56       | 2004.48    |

Figure 5.2: Area Reports of SR & CB

rtl\_sr\_param.0\_4\_2 signifies that the rtl design is parameterized of flop's efficiency 0%, flop's width 4 & flop's length (number of flops) is 2.

The above table shows the area reports of SR & their corresponding CB along with the difference in CB design area and SR design area. Thus it's clearly evident from the experimental results that the power reduction from SR2CB optimization comes at the cost of increase in the design area.

## 5.2.3 Combinational Power Overhead from SR2CB Optimization

The major contributor for the area overhead is the combinational part of the CB design (combinational counter, decoder & output mux).

As the extra combinational elements are there in CB which are not there in SR design results in combinational power overhead.

| SR Design            |             | CB Design            |             | Difference (CB - SR) |
|----------------------|-------------|----------------------|-------------|----------------------|
| (eff wid len)        | Comb(in uW) | (eff wid len)        | Comb(in uW) | (in uW)              |
| rtl_sr_param.0_4_2   | 0.129168    | rtl_cb_param.0_4_2   | 31.1522     | 31.023               |
| rtl_sr_param.0_4_4   | 0.232865    | rtl_cb_param.0_4_4   | 56.7259     | 56.493               |
| rtl sr param.0 4 6   | 0.336455    | rtl cb param.0 4 6   | 64.3336     | 63.9971              |
| rtl sr param.0 4 8   | 0.440046    | rtl cb param.0 4 8   | 80.3565     | 79.9165              |
| rtl_sr_param.0_4_12  | 0.647227    | rtl_cb_param.0_4_12  | 116.328     | 115.681              |
| rtl_sr_param.0_4_16  | 0.854477    | rtl_cb_param.0_4_16  | 141.543     | 140.689              |
| rtl sr param.0 4 32  | 1.68302     | rtl cb param.0 4 32  | 234.522     | 232.839              |
| rtl sr param.0 4 50  | 2.61473     | rtl cb param.0 4 50  | 320.39      | 317.775              |
| rtl_sr_param.0_4_64  | 3.33939     | rtl_cb_param.0_4_64  | 387.209     | 383.87               |
| rtl_sr_param.0_4_80  | 4.16747     | rtl_cb_param.0_4_80  | 489.749     | 485.582              |
| rtl sr param.0 4 100 | 5.20244     | rtl cb param.0 4 100 | 675.042     | 669.84               |

Figure 5.3: Combinational Power Reports for SR & CB

The above table shows the combinational power overhead when SR is transformed into their corresponding CB. Although there is combinational power overhead in SR2CB optimization but it's evident from the experiment results that register power saving compensate the effect of combinational power overhead.

#### 5.2.4 Analysis of Power Numbers

Clock Network Power Numbers for Synthesized Netlist comes very off than RTL & Post Layout Netlist power numbers.

*Reason*: In any sequential design the major component of power is usually clock network power as clock network as highest toggle density.

And in clock network power the major component of power is Buffers as they

have highest toggle density & also to maintain minimum skew they need to have high driving strength i.e. why there load capacitance is also high.

But in Synthesized Netlist there is no information about buffers because it only inserts CGICs so the clock network power will always be very less as compared to RTL & Post layout Netlist.

 As per expectation Change in Register power from SR to CB is not exactly proportional to the change in efficiency from SR to CB.

*Reason*: As we move from SR to CB, flops become more efficient & hence power should save as per efficiency change but because of additional wr\_ptr\_reg (stores the data from counter) which is not as much efficient as the other flops of CB.

The wr\_ptr\_reg efficiency is equal to the SR flop's efficiency & its load capacitance value is higher than other CB flops as it drives a decoder of size  $\log_2 N * N$  where N is the total number of flops in SR design.

Thus because of wr\_ptr\_reg register power saving is not exactly proportional to the efficiency change from SR to CB design's flops.

 Clock Network power saving from SR to CB is less than that of register power saving.

*Reason*: Entire Clock Network (i.e. buffers, clock nets & CGICs) doesn't get more efficient as we move from SR to CB but only CGICs of CB's clock network get a little bit efficient. Because in CB there are very few buffers which will be at the fanout of CGICs.

And the mostly buffers of CB's clock network which are the dominant component of clock network power are directly driven by clock & hence not get efficient.

But still we get clock network power saving from SR to CB that's because in most of the cases the number of buffers in CB will be less than SR & the tech cells for CB buffers will also be of smaller area & lesser driving strength. Thus because of smaller buffers CB clock network will consume lesser power compared to SR clock network.

Clock Network power saving from SR to CB is further reduced when SR flops are more efficient as compare to the scenario when SR flops are 0 % efficient.
*Reason*: In SR most of the buffers are at the fanout of CGIC but this is not the case with CB buffers as most of them are directly driven by clock.

So when SR flops are more efficient then because of clock tree configuration for most of the simulation time almost all the buffers will be off but this is not the case with CB buffers as they are not at the fanout of CGICs.

Thus because of Buffers configurations SR clock network power will reduce significantly from 0% efficient to higher efficient flops but the CB clock network power will not reduce as much. That's why the difference of SR clock network power & CB clock network power will be less for higher efficiency as compared to 0% efficiency.

# Conclusion

## 6.1 Conclusion

The experiment is performed over unit designs of SR and CB. And through the experiment a proper methodology has been developed with promising results. Analysis of experiment results clearly show the register power savings for all combination of width & length of designs. Undoubtedly SR2CB optimization leads to register power saving of the design & as the number of flops in the SR design increases, register power savings also increases. Higher the number of flops in SR design more will be the register power saving from SR2CB micro-architectural optimization.

Even though there is combinational power overhead associated with the SR2CB optimization but that overhead is compensated by the register power savings & for most of SR designs we can get overall design power saving. The SR2CB optimization leads to more power saving when SR flops are less efficient means SR flops are ON all the time. As the flops of SR become more efficient the power saving reduces.

# References

- N. Vyagrheswarudu, S. Das and A. Ranjan "PowerAdviser: An RTL Power Platform for Interactive Sequential Optimizations", Design Automation and Test in Europe Conference (DATE), pp.550-553.
- [2] J. Sukumar, S. Das and A. Ranjan, "*RTL Power Optimization in Sequential Analysis Platforms*", poster presentation in DAC, 2010.
- [3] PowerPro CG, Calypto Design Systems Inc. http://www.calypto.com/