Development of an implementation-centric
energy-evaluation framework for MIPS-I
pipelines

Master’s thesis in Embedded Electronic System Design

Daniel Moreau

Department of Computer Science & Engineering
CHALMERS UNIVERSITY OF TECHNOLOGY
Gothenburg, Sweden 2016


© Daniel Moreau, 2016.

Supervisor: Per Larsson-Edefors, Department of Computer Science & Engineering
Examiner: Sven Knutsson, Department of Computer Science & Engineering

Department of Computer Science & Engineering
Division of Computer Engineering
VLSI Research Group
Chalmers University of Technology
SE-412 96 Gothenburg
Telephone +46 31 772 1000

Cover: Placed and routed five stage pipeline built at Chalmers University of Tech-
nology.

Typeset in LATEX
Gothenburg, Sweden 2016

ii


Daniel Moreau
Master’s thesis in Embedded Electronic System Design
Chalmers University of Technology

Abstract
An RTL-based energy evaluation framework dubbed CREEP (Chalmers Energy
Evaluation framework for Pipelines) is implemented and evaluated. The frame-
work consists of pipeline RTL and the architectural simulator SimpleScalar. Power
estimates are extracted from the RTL and combined with performance counters
generated by SimpleScalar. The combination lends SimpleScalar accurate energy
estimates otherwise reserved to low level circuit analysis. The framework has been
used to characterize several different embedded processor configurations. Addition-
ally, a case study of the framework was used to implement and evaluate a speculative
way-halting technique called SHA which pointed to a 25.6% energy reduction in a
conventional four-way data cache.

iii


Acknowledgements
I want to thank my supervisor Per Larsson-Edefors for his guidance and support
throughout my work. I also want to thank Alen Bardizbanyan, without his help and
technical expertise this work would would not have been possible. Lastly, I want
to thank my family and girlfriend for their continued support and encouragement
during hard times.

Daniel Moreau, Gothenburg, January 2016

v


Contents

List of Figures x

List of Tables xi

Acronyms xiii

1 Introduction 1
1.1 Goals and challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.1.1 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Related works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Background 7
2.1 CMOS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.1.1 Power dissipation . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.2 Speed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2 IC design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3 Pipeline design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.3.1 MIPS I instruction set architecture . . . . . . . . . . . . . . . 13
2.3.2 A MIPS I pipeline . . . . . . . . . . . . . . . . . . . . . . . . 15

2.3.2.1 Caching . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.4 Existing pipeline evaluation method . . . . . . . . . . . . . . . . . . . 24

2.4.1 Architectural simulator . . . . . . . . . . . . . . . . . . . . . . 25
2.4.2 RTL design and verification . . . . . . . . . . . . . . . . . . . 26

2.4.2.1 Design and verification flow . . . . . . . . . . . . . . 27
2.4.3 Ad-hoc combination of RTL and simulator . . . . . . . . . . . 28

3 A unified evaluation framework 29
3.1 Framework workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2 Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4 Implementation 33
4.1 Implementation of framework components . . . . . . . . . . . . . . . 33

4.1.1 RTL modifications . . . . . . . . . . . . . . . . . . . . . . . . 33
4.1.1.1 Design verification flow . . . . . . . . . . . . . . . . . 34
4.1.1.2 Design power estimations . . . . . . . . . . . . . . . 36

4.1.2 SimpleScalar . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

vii


Contents

4.1.3 Configurability . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.2 Combining RTL and SimpleScalar . . . . . . . . . . . . . . . . . . . . 40
4.3 Framework automation . . . . . . . . . . . . . . . . . . . . . . . . . . 42

5 Results and discussion 45
5.1 User-centric overview . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.2 Demonstration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

5.2.1 MiBench execution time . . . . . . . . . . . . . . . . . . . . . 48
5.2.2 Power and performance . . . . . . . . . . . . . . . . . . . . . . 48
5.2.3 Energy distribution . . . . . . . . . . . . . . . . . . . . . . . . 52

5.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.3.1 Evaluation of verification methods . . . . . . . . . . . . . . . . 55
5.3.2 Achievement of goals . . . . . . . . . . . . . . . . . . . . . . . 56

5.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

6 Case study 61
6.1 SHA - practical way-halting . . . . . . . . . . . . . . . . . . . . . . . 61

7 Conclusion 67

Bibliography 72

viii


List of Figures

2.1 Schematic view of CMOS inverter . . . . . . . . . . . . . . . . . . . . 8
2.2 Example of a CMOS circuit consisting of multiple inverters . . . . . . 10
2.3 MIPS I R-type instruction format. . . . . . . . . . . . . . . . . . . . . 14
2.4 MIPS I I-type instruction format. . . . . . . . . . . . . . . . . . . . . 14
2.5 MIPS I J-type instruction format. . . . . . . . . . . . . . . . . . . . . 14
2.6 MIPS I memory is byte addressable [1]. . . . . . . . . . . . . . . . . . 15
2.7 A MIPS I 5SP [2] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.8 A MIPS I 5SP augmented with a control unit [2] . . . . . . . . . . . 17
2.9 Instruction sequence in a 5SP. . . . . . . . . . . . . . . . . . . . . . . 18
2.10 The 5SP augmented with a hazard detection unit [2] . . . . . . . . . 19
2.11 The 5SP with stall support [2] . . . . . . . . . . . . . . . . . . . . . . 20
2.12 The 5SP with branch resolution in the ID stage [2] . . . . . . . . . . 20
2.13 The 5SP with branch resolution in ID stage with added forwarding

paths [2] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.14 A conceptual overview of a memory-hierarchy [1] . . . . . . . . . . . 23
2.15 Modular structure of SimpleScalar . . . . . . . . . . . . . . . . . . . . 25
2.16 Microarchitectural overview of the 5SP. . . . . . . . . . . . . . . . . . 27
2.17 Memory hierarchy of the 5SP. . . . . . . . . . . . . . . . . . . . . . . 27

3.1 The methodology embodied in the CREEP framework. . . . . . . . . 30

5.1 CREEP package overview . . . . . . . . . . . . . . . . . . . . . . . . 46
5.2 The workflow of the framework showing the RTL and simulator com-

ponents and the central CREEP.pl script. . . . . . . . . . . . . . . . 46
5.3 Per-benchmark execution time for standard 16kB 4-4 configuration. . 49
5.4 Miss rates of the different configurations. . . . . . . . . . . . . . . . . 49
5.5 Execution time versus power of the different configurations. . . . . . . 50
5.6 Absolute energy of the different configurations. . . . . . . . . . . . . . 53
5.7 Energy distribution for a) Unscaled and without way-prediction b)

Scaled and without way-prediction c) scaled with way-prediction . . . 53
5.8 Energy distribution of three 8kB 1-1 configurations with different L2

latencies, a) 8kB 1-1 10 cycles b) 8kB 1-1 12 cycles c) 8kB 14 cycles 54
5.9 Energy distribution of three 8kB 2-2 configurations with different L2

latencies, a) 8kB 2-2 10 cycles b) 8kB 2-2 12 cycles c) 8kB 2-2 14
cycles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

ix


List of Figures

5.10 Energy distribution of three 8kB 4-4 configurations with different L2
latencies, a) 8kB 4-4 10 cycles b) 8kB 4-4 12 cycles c) 8kB 4-4 14
cycles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

5.11 Energy distribution of three 8kB configurations, a) 8kB 1-1 b) 8kB
1-4 c) 8kB 4-4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

5.12 Energy distribution the different 16kB configurations: a) 16kB 4-4
b) 16kB 2-4 c) 16kB 2-2 . . . . . . . . . . . . . . . . . . . . . . . . . 55

5.13 Energy distribution 32kB configuration. . . . . . . . . . . . . . . . . . 55

6.1 Overview of SHA. [3] . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
6.2 AGU address calculation showing the address fields of interest. [3] . . 62
6.3 SHA energy for the MiBench suite used in the CREEP framework. [3] 64
6.4 SHA energy compared to STA and a conventional baseline cache. [3] . 65

x


List of Tables

4.1 Summary of relevant SimpleScalar configurable settings . . . . . . . . 39
4.2 MiBench benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . 39

5.1 Cache parameters for the selected configurations . . . . . . . . . . . . 48
5.2 Power estimates obtained from the RTL during previous projects [4] . 56

6.1 L1 DC Component Energy. [3] . . . . . . . . . . . . . . . . . . . . . . 63
6.2 Components Accessed for Each Case. [3] . . . . . . . . . . . . . . . . 63
6.3 Components Accessed on Miss Events. [3] . . . . . . . . . . . . . . . . 64

xi


List of Tables

xii


Acronyms

5SP 5-stage pipeline. 2, 3, 4, 21, 24, 26, 33, 38, 67

AGU address generation unit. 41, 61
ALU arithmetic logic unit. 4, 26, 36, 37, 41, 58

CAM content addressable memory. 4
CMOS complementary metal–oxide–semiconductor. 1, 7, 8, 9, 10, 12
CPI cycles per instruction. 16, 18, 19
CREEP Chalmers RTL-based energy evaluation framework for pipelines. 2, 3, 4,

5, 6, 7, 33, 42, 43, 45, 54, 55, 57, 67

DC Synopsys Design Compiler. 35, 42
DTLB data translation lookaside buffer. 54, 55

EDA electronic design automation. 10, 11, 12, 34, 35, 36
EX execute. 16, 18, 19, 21, 26, 41

GP general purpose. 11

HDL hardware description language. 10, 11, 34, 45

IC integrated circuit. 1, 7, 10, 11, 12, 33, 35
ID instruction decode. 16, 18, 19, 20, 21, 26, 41
IF instruction fetch. 16, 18, 19, 26
ILP instruction-level parallelism. 15, 21
IPC instruction per cycle. 16, 48
ISA instruction set architecture. 13, 15, 16, 22, 33, 34, 67
ITRS International Technology Roadmap for Semiconductors. 4

L1 level-one. 22, 23, 47, 48, 50, 67
L1DC level-one data cache. 2, 22, 26, 33, 35, 36, 39, 47, 50, 51, 52, 53, 54, 55, 56,

62, 67
L1IC level-one instruction cache. 2, 22, 26, 33, 39, 42, 47, 50, 51, 52, 53
L2 level-two. 3, 22, 23, 24, 26, 39, 47, 48, 50, 51, 57
LP low power. 11, 35
LRU least recently used. 23, 26, 54
LSU load store unit. 41

MEM memory access. 16, 18, 26, 41

xiii


Acronyms

nMOS n-type metal–oxide–semiconductor. 7, 8, 10
NOP no-operation. 18, 26
NP nondeterministic polynomial time. 12

OOM orders of magnitude. 22
OP operation. 13

PC program counter. 15, 19, 26
pMOS p-type metal–oxide–semiconductor. 7, 8, 10
PnR place and route. 11, 12, 27, 29, 35, 37, 41, 58, 67
PT Synopsys PrimeTime. 36, 37, 40, 42

RAM random access memory. 4
RAW read after write. 18, 21
RISC reduced instruction set computing. 13
RT register transfer. 11, 67
RTL register-transfer level. 1, 2, 3, 4, 5, 10, 11, 12, 24, 25, 26, 27, 28, 29, 30, 31,

33, 34, 35, 36, 38, 39, 41, 42, 43, 45, 46, 54, 57, 58, 61, 67

SAIF switching activity interchange format. 35, 36, 42, 45
SPEF standard parasitic exchange format. 57
SRAM static random access memory. 27, 33, 34, 35, 37, 46, 50, 51, 52, 57, 61

VCD value change dump. 35, 57
VLSI very-large-scale integration. 1, 4

WB write-back. 16, 18, 26, 41

xiv


1
Introduction

In the early days of integrated circuit (IC) design, computer architectures were devel-
oped with a focus on achieving high performance. Other design factors such as cost,
area and power were also considered but only as limiting factors. However, in the late
1990’s it became apparent that this design philosophy was unsustainable. Comple-
mentary metal–oxide–semiconductor (CMOS) technology scaling allowed for higher
densities and increasing clock frequencies, but performance-centered designs that
tried to leverage these advances became hard or impossible to cool cost-effectively
[5].

Currently energy efficiency is next to performance the major focal points in very-
large-scale integration (VLSI) design. The driving forces behind this are increased
portability, the power wall and environmental concerns. For portable battery-
powered devices lower energy consumption directly translates into a more well-
received product. The power wall, a direct consequence of discontinued Dennard
scaling, means that technology scaling no longer is the obvious answer to increased
performance and lower power [6][7]. Lastly, it is becoming painfully obvious that
the rate at which the global energy consumption increases is not sustainable. ICs
contribute to a considerable chunk of this increase [8].

To facilitate energy efficient design evaluation frameworks at the software and register-
transfer level (RTL) are required to make vital early estimations. Early estimations
are perhaps the most important estimations as changes at the architectural level
have a larger impact on the final energy and performance numbers than changes
at the circuit level. As such, these frameworks allow for accurate estimation and
thereby more predictable prototyping results. Furthermore, these tools are sig-
nificantly faster than those available at the circuit level which is essential when
exploring a complex design space [9][10]. These framework have traded speed for
accuracy, often adopting parameterizable models obtained through analytical or em-
pirical studies of the underlying hardware. However, by adopting such models many
of the existing frameworks neglects the impact of design integration, i.e., the syn-
ergy between the integrated parts of a design. We propose an open source energy
evaluation framework for pipelines that facilitates software and hardware co-design.

1


1. Introduction

The framework, named Chalmers RTL-based energy evaluation framework for pipelines
(CREEP), extends down to the RTL which yields high accuracy and allows for de-
tailed pipeline studies at the system level.

1.1 Goals and challenges

The aim of this thesis is to develop and demonstrate CREEP, a framework that esti-
mates energy usage of integrated processor pipelines. The framework is based on an
existing methodology that has been used in several published papers at Chalmers.
The two major components that will be used to create the framework are; 1) Pipeline
and cache RTL [11] and 2) A version of the SimpleScalar simulator [12]. The RTL
and the simulator exist prior to the framework development but they are two sepa-
rate and incoherent components combined in an ad-hoc manner. Hence, there are
many challenges to address throughout the work, some of which are listed below:

• Create a scalable energy estimation framework from an ad-hoc methodology.
Trade-offs between energy estimation accuracy and scalability are required
throughout the development.

• The existing pipeline RTL code represents an in-order 5-stage pipeline (5SP)
augmented with level-one instruction cache (L1IC) and level-one data cache
(L1DC). The RTL code has been used in several projects that necessitated
changes and quick patches in the code. As such, the code needs to be cleaned
to make the RTL presentable and to reintroduce some features such as config-
urable cache sizes that is necessary for the framework. Additionally, a scalable
power estimation methodology that includes the impact on power due to in-
tegration aspects needs to be implemented.

• The SimpleScalar simulator represents a more complex pipeline than the RTL.
Thus, the simulator shall be modified to match the RTL code as closely as
possible. This requires changes to the source code which must be verified
to work as intended. Furthermore, the simulator will be modified to track
additional resource usage information.

• The pipeline RTL and SimpleScalar components will then be combined by
mapping resource usage obtained from the simulator to the RTL power es-
timations. The challenge here is to do the mapping in such a way that the
framework estimates the processor energy adequately.

• Lastly, the framework will be automated to make it more approachable. The
automation also serves the purpose of keeping the components coherent, which
will make the results generated by the framework reproducible.

2


1. Introduction

1.1.1 Goals

Several concrete goals related to CREEP have been identified and these are sum-
marized below:

• Present a coherent and scalable framework with accuracy close to a placed and
routed pipeline design.

• The framework should be automated through scripts which will make CREEP
more user-friendly.

• The framework should support limited configuration, e.g., different cache con-
figurations and processor speeds.

• Evaluate the applicability of the framework in a case study.

• Present the framework in a suitable forum to introduce it to the community.

1.2 Limitations

The framework development is complex and limitations need to be imposed on the
development.

• The framework is only guaranteed to work as is. As such, any modifications
made by the user to any of the framework components are not covered by the
standard framework workflow.

• CREEP will be limited to the provided 5SP. Any changes to the RTL code
are not guaranteed to work and the user needs to verify the changes in the
context of the framework.

• CREEP does not include a level-two (L2) cache and does not attempt to
approximate the impact of lower levels in the memory hierarchy on power
dissipation and performance.

• CREEP is limited to the SimpleScalar source and configuration provided with
the framework. Any changes to these components need to be verified and
integrated into the framework by the user.

• This thesis will use the RTL for the pipeline and will only modify it to suit
the needs of the framework. No performance enhancements will be done and
the CREEP configurations are limited to run at 400MHz.

• The RTL will not be fabricated and no silicon of said design will be produced.

3


1. Introduction

1.3 Related works

Energy evaluation frameworks have over time evolved from small frameworks limited
to specific structures within a processor, to large and complex system-level frame-
works. Depending on how the frameworks obtain circuit-level energy estimations
they can be divided into analytical or empirical frameworks [5]. Analytical tools
generally have the advantage of being more generally applicable to different archi-
tectures whilst empirical methods are best suited for the type of architectures from
which they were derived. In this section previous energy evaluation frameworks and
methods are presented.

The first high-level energy framework, CACTI, was released 1996 specifically target-
ing cache structures [13]. CACTI uses analytical models to estimate both power and
delay within the cache structure. It has since its release been updated regularly to
include leakage power, other types of memory cells, device scaling effects based on
International Technology Roadmap for Semiconductors (ITRS) predictions and wire
effects on delay and power [14]. The reason it targeted caches was that a significant
amount of total chip power, up to 40%, was dissipated by the caches in embedded
processors [9]. Furthermore, caches are highly regular structures thus less complex
analytical models are needed to accurately estimate energy consumption and delay.
It has allowed computer architects to explore trade-offs in the memory hierarchy de-
sign [13]. In contrast to CACTI, CREEP also models the datapath with which the
caches are integrated. Hence CREEP provides an estimate for a complete integrated
pipeline.

WATTCH and SimplePower, both released in 2000, analytically modeled power for
a whole processor. WATTCH was one of the first tools to link a traditional archi-
tectural performance simulator, SimpleScalar [12], to analytical power models [15].
It bases its power estimations on a collection of parametrized power models for
different hardware structures (for example random access memory (RAM), content
addressable memory (CAM), other array structures, latches, buses, caches arith-
metic logic unit (ALU)s) and per-cycle resource usage counts generated through
cycle-level simulations using the SimpleScalar architectural simulator [9] [15]. Sim-
plePower is an execution-driven, cycle-accurate RTL energy estimation tool that
uses a combination of analytical and transition sensitive energy models [16] [17].
The SimplePower framework is built around a five stage datapath with instruction
fetch, decode, execution, memory and write-back stages [16]. Transition-sensitive
models are defined for each functional unit in the datapath and the models contain
switch capacitance on a per-input basis obtained from VLSI layouts and extensive
circuit simulation [17]. Models are provided for several technology nodes. Simple-
Power uses a combination of analytical and transition-sensitive energy models for the
memory system. The analytical models are reserved for the memory arrays whereas
the transition-sensitive models are used for the connecting buses [17]. In contrast to
the functional units, the switching capacitance of these buses is based on pessimistic
assumptions rather than HSPICE simulations [17]. The control path of the 5SP has
been neglected because developing transition-sensitive models for this was consid-

4


1. Introduction

ered extremely difficult. SimplePower leverages on the SimpleScalar simulator by
using the same ISA and compiler. The framework simulates the generated executa-
bles providing cycle-by-cycle energy values based on the aforementioned models [16].
Both of these were fast and usefully accurate to quantify potential power savings in
architecture design. However, compared to CREEP, WATTCH and SimplePower,
while being more flexible, again fail to capture the integration aspect that CREEP
addresses.

McPAT, another analytical tool, is an abbreviation of multicore power area and
timing. The framework was released 2009 and it estimates power, area and timing
which enables architects to use metrics that relate performance to both area and
power [10]. In contrast to SimplePower and WATTCH, McPAT is compatible with
any performance simulator through an XML interface. Furthermore, McPAT is built
on more accurate analytical models compared to WATTCH and these models also
include static and short-circuit power. Just as the name implies it also handles
the complexities of multicore architectures. Similarly to McPAT, CREEP provides
a system perspective but does so more accurately as power estimates are obtained
from an RTL implementation and not analytical models. However, CREEP supports
less complex systems as it targets simple embedded processors.

One empirical framework of interest is IBM’s PowerTimer [15]. The major difference
from the previous approaches that are based on analytical models is primarily the
formation of the energy models. PowerTimer’s models are based on empirical data
collected from existing microprocessors. These models are then scaled to capture
device scaling. PowerTimer takes a bottom-up approach and the energy models
are derived from circuit-level power simulation data. Low-level circuit macros are
analyzed and used to generate higher-level energy models for microarchitectural
units [15]. These models are then controlled by two sets of parameters; 1) technology
and circuit parameters, 2) microarchitectual parameters such as buffer sizes, pipeline
latencies and bandwidth values. The microarchitectual parameters are also used in a
stand-alone performance simulator. By connecting the performance simulator with
the energy models a total or cycle-by-cycle energy evaluation can be performed.
IBM’s PowerPC architecture was used to create the energy models and as such
was best suited for design exploration within that microarchitecture. CREEP is
likewise limited to the specific architecture implemented in RTL. Both frameworks
work at the system level but PowerTimer chooses to distance itself from the physical
implementation through parameterized models which lends it greater flexibility at
the expense of accuracy.

Yet another example of an empirical framework was proposed by Aziz et al. [18].
This framework is used for marginal-cost analysis. Their approach was to first
create architectural models using design space sampling and statistical inference to
capture the multi-dimensional space of microarchitectural parameters. The energy-
delay trade-offs of the composing circuit blocks that formed the architectures were
then stored in a circuit library. The created joint architecture-circuit design space
was then combined with exploration engine which is given an optimization objective

5


1. Introduction

and resource budgets [18]. The exploration engine then searches the design space to
find the most efficient configuration under the given constraints. As it is a high-level
framework, it is more flexible than CREEP but trades aspects such as accuracy and
integration to achieve this flexibility.

Rance Rodriges et al. conducted a study in [19] on the usage of performance counters
and how they can be used to estimate power in microprocessors. While performance
counters have been widely used to estimate power online in situ the counters used
vary widely between processor architectures. Rance Rodrices et al. attempts to
identify a set architecture-agnostic counters that estimate processor power with low
error. Two architectures, Intel Atom and Nehalem, at opposite ends of the de-
sign spectrum were used to select performance counters which in both architectures
showed a strong correlation to power. Using SESC architectural performance simu-
lator and WATTCH as reference they concluded that #Fetched instructions, #L1 hit
and #Dispatch stall counters was sufficient to approximate processor power with an
average error of 5%. Furthermore, the chosen set of counters variation between pro-
cessor architectures only had a small impact, 3%, on the estimation accuracy. While
the objective of this work is different from CREEP, it indicates what performance
counters are relevant for a selection of architectures, albeit at higher performance
design point, and can serve as an inspiration for CREEP.

A high-level estimation methodology and the associated tool, SoftExplorer, was
presented in [20]. The methodology models a processor through functional analysis
and a parametric software model is used to capture the software’s impact on power.
The processor model can be as coarse grained as a functional block diagram. The
parametric software model accepts relevant algorithmic parameters such as cache
miss rate. The first step in the methodology is to cluster the processor model
into functional blocks that are concurrently activated when code is running. The
relevant consumption parameters are chosen as the links between the functional
blocks. The second step is to characterize the processor model’s power consumption
as the architectural and algorithmic parameters are varied. Lastly, a curve fitting
of the graphical representation of the characterized power is performed through
regression analysis. SoftExplorer was compared to SimplePower where the tool was
found to be significantly faster and within 2.4% of the estimates. Compared to
CREEP, SoftExplorer sacrifices accuracy for flexibility and speed and neglects the
integration aspect covered by CREEP.

6


2
Background

This chapter will provide the reader with the basic knowledge to understand the
concepts used to develop Chalmers RTL-based energy evaluation framework for
pipelines (CREEP). First, the basics of CMOS logic with a focus on implemen-
tation, i.e., power and speed will be discussed in Sec. 2.1. Since complementary
metal–oxide–semiconductor (CMOS) is the primary fabrication technology used to
implement integrated circuits (ICs), CMOS speed and power are central to this
work. Secondly, the basics of IC design with a focus on cell-based CMOS designs
are presented in Sec. 2.2. The foundations of computer architecture with a focus on
pipeline design and physical implementation are presented in Sec. 2.3. Lastly, the
existing ad-hoc methodology which this work is based on is presented in Sec. 2.4.

2.1 CMOS

The abbreviation CMOS stems from the structure of the device as it was composed
of at least one n-type metal–oxide–semiconductor (nMOS) and one
p-type metal–oxide–semiconductor (pMOS) transistor [21]. The simplest CMOS
circuit, the CMOS inverter, is shown in Fig. 2.1. The arrangement of the inverter
is such that the input of the CMOS inverter is connected to the gate terminal of
both transistors. Whilst the transistors’ behavior in reality is more complex, they
can ideally be viewed as switches that close and open when a voltage transition is
detected on the gate. The behaviors of the nMOS and pMOS are opposite that of
each other, i.e., when a potential VDD (supply voltage, logic 1) is asserted on gate
terminal the nMOS closes and the pMOS opens. Conversely, when no potential
or GND (ground, logic 0) is present on the gate the nMOS opens and the pMOS
closes.

2.1.1 Power dissipation

The power dissipation of a CMOS circuit is generally considered to be composed of
three components; 1) Dynamic power, 2) Short-circuit power and 3) Static power [22].
The total gate power dissipation is given as the sum of these components as shown
in Eq. 2.1.

7


2. Background

VDD

GND

OutIn

Cout

Ishort

In

Ip

Figure 2.1: Schematic view of CMOS inverter

Ptotal = Pdynamic + Pshort + Pstatic (2.1)

The dynamic power dissipation is by far the most dominant source of power con-
sumption in a CMOS circuit [22]. Dynamic power is also called switching power
because power is consumed when the gate is switching, i.e., charging or discharging
the gate output capacitance Cout to VDD or GND [21]. The output capacitance
consists of several components; Cint, Cwire and Cload as shown in Eq. 2.2 [22].

Cout = Cint + Cwire + Cload (2.2)

The internal capacitance Cint is related to the structure of the gate and include
parasitic capacitances. Cwire is the capacitance of the wire that connects the output
of the device to the input of another CMOS gate which in turn constitutes the
Cload capacitance. Consider Fig 2.1 where a voltage transition from VDD to GND is
asserted on the input. The nMOS transistor opens and the pMOS transistor closes.
A current Ip flows from the voltage supply to the output capacitance which charges
the capacitance. The amount of charge pulled from the supply is given by CoutVDD
and the energy drawn from it by CoutV 2

DD. However, half of the energy drawn from
the supply is dissipated as heat in the resistance posed by the pMOS transistor
so the energy in the output capacitance is given by Ec = 1/2CoutV 2

DD. When the
input voltage later is increased to VDD the pMOS opens, the nMOS closes, and
the output capacitance is discharged as a current In flows to ground. The stored
energy in the capacitance Ec is dissipated in the resistance posed by the nMOS
transistor. If this circuit is operated at a clock frequency f and and the output
switches with a probability of α the total dynamic power drawn from the supply is
given by Eq. 2.3 [21].

Pdynamic = CoutV
2
DDαf (2.3)

8


2. Background

The short circuit power dissipation Pshort is nowadays considered a small compo-
nent of the total power [22]. The power is dissipated when the output of the gate
switches. The nMOS and pMOS devices are in reality not behaving as ideal switches
and require a finite time to open and close. This time is determined by how long the
input voltage remains between the transistors threshold voltage Vtn, and VDD −Vtp,
where Vtn and Vtp are the threshold voltages of the nMOS and pMOS transistors re-
spectively. Threshold voltage is the minimum gate to source potential that is needed
to create a conducting path in the transistor, i.e., close the switch. Consequently
there is a small period of time when both transistors are on and a current Ishort
shown in Fig. 2.1 is allowed to pass from the supply to ground, which consumes a
small amount of power.

The last component is the static power dissipation that is intermediate in size com-
pared to the previous components [22]. It is smaller than the dynamic power and
has historically been negligible. It is called static because it is omnipresencent in
all CMOS circuits that are powered. The static power stems from a collection of
different currents passing between the various terminals of the devices most notably
source to drain. The leakage power is closely connected to the threshold voltage and
the temperature of the device [21]. As the feature size of the transistor is shrinking
below 65nm, leakage power is increasing and in more recent technology nodes it has
become a considerable contributor to the total power dissipation.

2.1.2 Speed

As discussed in the previous section, a transition on the output of a CMOS gate does
not happen instantaneously as the current would have to be infinite in magnitude.
Naturally, this is not the case in a real CMOS circuit. Instead, the current is
determined by the transistor’s ability to drive it, which due to nonlinear I-V and C-V
characteristics is no simple thing [21]. However, the transistors can be approximated
by an RC-delay model that allows the transistors to be viewed as simple RC circuits,
which most electrical engineers are familiar with. R is the transistor’s effective
resistance that is the product of the Vds and Ids, i.e., the potential between the drain
and source terminal and the current passing through the drain source junction [21].
The capacitance is the output capacitance of the CMOS circuit (see Sec. 2.1.1).
The transfer function of the equivalent RC circuit is given in Eq. 2.4 and the step
response in Eq. 2.5.

H(s) = 1
1 + sRC

(2.4)

Vout(t) = VDDe
−t/τ (2.5)

Solving the step response for Vout(t) = 1/2VDD gives the propagation delay through
the CMOS circuit shown in Eq. 2.6.

9


2. Background

tpd = RC ln 2 (2.6)

The propagation delay is an approximation of how fast the output of the CMOS
circuit transitions from VDD to 1/2VDD when an input step is asserted on the input
of the circuit. A non-trivial CMOS circuit is composed of many CMOS circuits
which are connected as shown in Fig. 2.2 and the propagation delay from input In1
to the final output Out n can become significant.

In1 In2 In n Out n

Figure 2.2: Example of a CMOS circuit consisting of multiple inverters

To manage the delay, the current drive capability of the transistors in the CMOS
gate can be increased. This is done through transistor sizing whereby the widths
of the pMOS and nMOS transistors are increased [21]. Essentially this reduces
the effective resistance experienced by the current and a larger current is allowed
through the circuit. However, increasing the transistor size also causes an increase of
the gate capacitance, i.e., the output capacitance experienced by the driving gate in
the circuit resulting in higher power dissipation. Moreover, gates that are unsized
and connected to the resized gates will have to charge a larger load capacitance,
which slows down unsized parts of the design.

2.2 IC design

Modern ICs are immensely complicated circuits often composed of several millions,
if not billions, of transistors. Designing such complex beasts is without computer
aid simply beyond the capabilities of a human designer. To facilitate IC develop-
ment software assistance is key throughout the design process. The software tools
providing this assistance are collectively refered to as electronic design automation
(EDA). The term EDA spans a wide range of functionality required throughout the
design of an IC, which will be the focus of this section.

Designing ICs is complex and it was discovered early on that doing so at the gate
level, even with the aid of EDAs specializing in the practice, was too cumbersome. As
a response, tools were developed to create gate-level representations, called netlists,
from a specification at a higher level of abstraction through a process called logic
synthesis. These abstractions are usually expressed in a hardware description lan-
guage (HDL) such as verilog or VHDL. These design languages allow the designers

10


2. Background

to express the behavior of the logic circuits at the register-transfer level (RTL) in
the sense that an assignment to a register expresses functionality.

The process of designing an IC is composed of several stages and for digital circuits
these are design, functional verification, logic synthesis and place and route (PnR).
The initial design stage is followed by functional verification, which is first done
at the register transfer (RT) level and infers testing that design described in HDL
matches the expected functional behavior. This is normally done at the cycle level
by applying stimuli to the design whereby the logic transitions of the output can
be observed and compared with the desired behavior. The stimuli is commonly
supplied by a testbench that provides input from a set of test vectors [23]. The
test vectors can be selected with the intent of testing specific functionality (directed
testing) or randomly to test corner-cases [23]. A key concern when selecting test
vectors is coverage that can be defined as how large part of the design that has been
tested (in percent). The RTL verification is facilitated by an HDL simulator tool.
There are many different HDL simulators available such as ModelSim from Mentor
Graphics, IES from Cadence and VCS from Synopsys [24][25][26].

After the RTL has been verified the design is brought through a cell-based logic
synthesis with the aid of a synthesis tool. The designer supplies the RTL design
together with design constraints with regards to timing which will guide the synthesis
tool through the multiple stage process that is cell-based logic synthesis [23]. The cell
library, which contains the standard gates (cells) used for synthesis, is provided by
silicon foundries such as ST Microelectronic or TSMC. The cell libraries are unique
to each manufacturer as they are tightly knit to their manufacturing processes. For
each cell in the library, parameters such as size, internal power dissipation, leakage
power and input pin capacitance are defined [23]. Normally several libraries are
necessary to fully evaluate a process technology. The libraries are optimized for
different design points, e.g., low power (LP) and general purpose (GP). The GP cell
library is optimized for performance and the LP cell library for low power designs.
Furthermore, the GP and LP libraries are further divided into sub-libraries with
different threshold voltages, which allows for fine grained control of performance and
power dissipation. Higher performance can be achieved by using a low-threshold
voltage version but at the price of higher leakage power dissipation. Conversely,
for design where power dissipation is a cause for concern, a high-threshold voltage
version is a good choice as these are slower but have lower leakage-power dissipation.
It is up to the designer to choose a library that suits the application at hand. Small
variations in the manufactured design can have a large impact on cells’ behavior.
To capture these variations, design corners are used. The worst-case corner contains
cells that have the worst possible (and still producing working devices) variations
that affect speed negatively. Conversely, the best-case corner cell library has the
best variations. Naturally, there is a nominal cell library that falls in between the
two. Moreover, the cell libraries have been characterized for different temperatures
and voltages. Temperature and voltage depend on in situ conditions and also affect
the behavior of the final circuit. As such, every cell library exist in several models
with different temperatures and voltages.

11


2. Background

Synthesis is a complex process and EDA tools that specialize in the practice are avail-
able from different suppliers such as Encounter RTL Compiler from Cadence, Design
compiler from Synopsys and HDL Designer from Mentor Graphics [27][28][29]. The
different tools provide similar functionality but differ in the algorithms and heuris-
tics used during the synthesis process. The synthesis results in a gate-level netlist,
a sequence of standard cell logic gates realizing the functionality of the RTL code.
In contrast to the HDL description of the design that solely captures the intended
functionality, the netlist also includes parameters such as area, timing and power.
The synthesis tool strives to meet the imposed timing constraint using the cells from
the specified libraries. It accomplishes this through static timing analysis which al-
lows it to find and balance the critical paths in the design [23]. This balancing
act entails selecting gates with sufficient current drive capabilities for the entire cir-
cuit to switch within the timing constraint. As such, the same design will produce
different gate-level netlists with different area and power. At the synthesis level,
the functional verification amounts to ensuring that the netlist behaves the same as
the RTL design. This is achieved through equivalence checking or simulation-based
methods as described for RTL verification.

Lastly, the design netlist is brought through PnR which is a physical design phase
composed of three steps; 1) Floorplaning where the design’s blocks are organized,
2) Placement of standard cells and iterative optimization of placement and 3) Rout-
ing of standard cell interconnects, power lines and clock tree [23]. The process is
strictly guided by design rules imposed to ensure that the placed and routed design
is manufacturable. The most significant changes to the netlist are the addition of
wires and clock tree. Wires constitute a part of the nodal capacitance described in
Sec. 2.1.1 which in some cases necessitates larger, more powerful gates to be used.
The addition of wire capacitance and larger gates with higher internal capacitances
increases the power of the design. Furthermore, the clock tree is a significant con-
tributor to the design power and is only included after the design has been placed
and routed. This stage relies on EDA tools, such as Encounter from Cadence, that
specialize on PnR as the burden of placing thousands of gates is simply beyond the
capability of a human designer [30].

Implementation power closure is important for power constrained circuits, e.g.,
portable embedded processors. As such, methods for obtaining power closures are
also included in many IC design flows. The power dissipation of CMOS-based cir-
cuits comes from active device switching and leakage where the former are the main
contributor as discussed in Sec. 2.1.1. The switching powers are then summed over
all capacitive nodes in the design. While the power estimates could be done prior to
the PnR, the power would be underestimated as the nodal capacitance is greatly in-
creased by the wire capacitances. The power also depends on the switching activity
(see Sec. 2.1.1) of these nodes and there are different techniques used to approximate
it. One such technique is probabilistic testing where the input statistics, asserted
by a designer, are propagated to each node in the circuit [22]. However, this creates
an nondeterministic polynomial time (NP) complete problem so the scope at which
this is done must be limited, e.g., parts of the design are analyzed instead of the

12


2. Background

whole design.
Another is use-case based switching activity which is facilitated through simulation-
based methods as described for RTL verification [22].

2.3 Pipeline design

The most fundamental parts of computer architecture are the instructions that define
what a computer is capable of and the microarchitecture that decides how it executes
the instructions. To that end the structure of a complete set of instructions called
instruction set architecture (ISA) and more specifically the MIPS I ISA, is presented
in Sec. 2.3.1. Microarchitectural concepts relevant to this thesis such as pipelining
and caching are then discussed in Sec. 2.3.2.

2.3.1 MIPS I instruction set architecture

All computer programs are made up of instructions which are the basic operations
carried out by a processor [2]. Instructions are usually very frugal and each of them
normally does one basic operation, e.g., memory access, arithmetic or flow control.
To make up a complex computer program many different instructions are needed.
The instructions are grouped together to form a set of instructions possibly unique
to their implementing architecture thus forming an ISA.

The MIPS I ISA, first released in 1982, was developed by John Hennessy and his col-
leagues at Stanford [31][2]. MIPS I was one of the first successful reduced instruction
set computing (RISC) ISAs built on four main principles; simplicity favors regular-
ity, make the common case fast, smaller is faster and that good design demands good
compromises. Derivatives of the MIPS ISA are still used today by CISCO (routers),
Nintendo and Sony (hand held gaming consoles) and Silicon Graphics among others.

MIPS I instructions can have three different encoding formats referred to as R, I
and J-type instructions in the literature [2]. By only allowing a limited number of
instruction formats the ISA gains regularity which simplifies the instruction decod-
ing [2]. Each instruction has its own operation (OP) code which is encoded in the
op-field shown in Figs 2.3-2.5. Besides the OP-code, the main difference between the
instruction formats is the number of operands that are encoded in the instruction.
R-type instructions however, need two extra fields (shamt and funct) to characterize
each operation, which includes mathematical or logical operations such as addition,
subtraction and shift operations. R-type instructions require two operands encoded
in the rt and rd fields shown in Fig. 2.3. In contrast, I-type instructions require
just one operand encoded in the rt field (see Fig 2.4) and lastly J-type require no
operands.

The operands are fetched from a small register-file whose modest size lends it speed.
The size of the register-file and how the memory is addressed are the parameters,
besides instruction width, that the MIPS I ISA enforces on the underlying microar-

13


2. Background

op rs rt rd shamt funct

R-type

6-bits 5-bits 5-bits 5-bits 5-bits 6-bits

Figure 2.3: MIPS I R-type instruction format.

op rs rt immediate

I-type

6-bits 5-bits 5-bits 16-bits

Figure 2.4: MIPS I I-type instruction format.

op addr

J-type

6-bits 26-bits

Figure 2.5: MIPS I J-type instruction format.

chitecture. MIPS is a load-store ISA because all operands are fetched from the
register-file [1]. I-type instructions substitute one operand for a value, called imme-
diate value, which is encoded directly in the instruction itself. Examples of I-type
instructions are load and store operations. Stores read data from the register-file
and store the data in data memory. Loads on the other hand read data from data
memory and store the data in the register-file. Similarly, all R-type instructions also
store the result of the operation back into the register-file to a location indicated by
the rs field. However, other I-type instructions called control flow instructions, e.g.,
branch instructions, do not access the register-file. Instead, the control flow instruc-
tion decides the order in which the instructions in the program are executed. Lastly,
J-type instructions, which mainly include another type of flow control instructions
called jump instructions, trade both operands for a larger immediate value. MIPS
I also defines the format of the operands. Operands of 8-bit (ASCII characters),
16-bit (Unicode characters, half word), 32-bit (integers, word), 64-bit (long integer,
double word) and IEEE 754 floating point in 32-bit (single precision) and 64-bit
(double precision) are allowed [1].

If the aforementioned register-file were the only storage available to a processor com-
puter programs would be very limited in size. However, as implied above, another
type of memory that is larger in size is usually available. The MIPS I ISA defines
how the processor interfaces with memory by specifying how the memory is address-
able. MIPS I specifies two ways of addressing memory; 1) byte-addressable or 2)
word-addressable. This means that the smallest addressable data unit is a byte (8
bits) while the largest is a word (4 bytes) as illustrated in Fig. 2.6. All memory
accesses must be aligned to either a byte or word access, otherwise the access is
unaligned and erroneous [1]. In the same figure to the left, the address of the cor-
responding data is shown. The address used to access the memory is generated by
the memory operation (I-type instruction) by adding the immediate value with the
register indicated by the rt field.

14


2. Background

Figure 2.6: MIPS I memory is byte addressable [1].

Furthermore, MIPS I supports five ways of generating memory addresses through
so called addressing modes; Register-only addressing, Base addressing, Immediate
addressing, PC-relative addressing and pseudo-direct addressing [2]. Register-only
addressing has already been described, all R-type instruction uses this addressing
mode. Base addressing is used by some I-type instructions, such as stores and loads,
and has likewise been described. Immediate addressing is similar to base addressing
but it does not use the register pointed to by the rt field (I-type). Program counter
(PC) relative addressing is used by conditional branch instructions (I-type) where if
a condition holds true, the PC is added to the immediate field to produce the final
address. Lastly, pseudo-direct addressing is used by J-type instructions where the
larger address field (see Fig. 2.5) is first concatenated with the four most significant
bits of the PC.

2.3.2 A MIPS I pipeline

An ISA does not define the implemented hardware besides register-file and address-
ing modes. A distinction is made between an architecture and a microarchitecture
where the latter includes implementation details. This means that two different pro-
cessor architectures can support the same ISA while being fundamentally different
at the microarchitecture level. In this section a MIPS I compliant microarchitecture
will be presented. The microarchitecture utilizes pipelining, which is a concept that
is used in most modern processors that offers higher performance at the expense of
design complexity.

The speed of a processor, and systems in general, depends on latency and throughput
of the data passing through it [2]. Low latency is preferred for systems that are re-
quired to be responsive and deliver results in a timely manner. In contrast, through-
put is beneficial for systems that prioritize computational performance over time-
liness. Latency and throughput are often contradictory in the sense that measures
that improve one degrade the other [2]. In general-purpose computing, through-
put has historically been more important than latency. Throughput can mainly be
improved by exploiting instruction-level parallelism (ILP), i.e., by executing mul-

15


2. Background

tiple instruction at the same time. Parallelism can be divided into spatial and
temporal parallelism [2]. Spatial parallelism entails executing more instructions si-
multaneously by utilizing increased computational resources. In contrast, temporal
parallelism implies dividing the existing computational resources into discrete steps
where each step is utilized by different instructions. Spatial parallelism has the
benefit of increasing throughput with little or no impact on latency [2]. However,
spatial parallelism requires additional hardware resources and results in larger and
more complex designs. Conversely, temporal parallelism sacrifices latency to in-
crease throughput while only requiring limited hardware additions to the design in
the form of a few registers and control logic.

In the context of microarchitecture temporal parallelism is more commonly referred
to as pipelining and the concept has been used in most processors for the last three
decades [32]. Pipelining is implemented by dividing a processor’s data path, i.e.,
computational resources, into stages separated by pipeline registers which limit the
logic paths of the design to that between two consecutive pipeline registers. In effect,
the design can meet stricter timing constraints and thus run at a significantly reduced
cycle time. The reduced cycle time allows the design to be clocked at a higher rate,
which causes a reduction of the execution time since the pipelined processor executes
(ideally) one instruction each cycle. The discrete stages allow the pipelined processor
to achieve temporal parallelism with several in-flight instructions.

The goal of a pipeline design is to evenly distribute the datapath’s logic between
the different pipeline stages. In a perfectly balanced n-stage pipeline the cycle
time of the design is roughly 1/n of the cycle time of a corresponding un-pipelined
designs [1]. However, in practice the stages in the pipeline are seldom balanced
perfectly resulting in some stages requiring more time to finish their execution. The
critical path, which imposes the lower bound on the design cycle time, is thus found
in these stages. Furthermore, pipelining also introduces some performance overhead.
A small part of this overhead is the delay introduced by the inserted pipeline reg-
isters but the by far more substantial overhead is caused by dependencies between
instructions moving down the pipeline [2]. These dependencies are called hazards
and will be discussed in greater detail later in this section. In short, hazards increase
the cycles per instruction (CPI) or instruction per cycle (IPC) which has a detrimen-
tal effect on performance. The overhead caused by the pipeline registers and hazards
increases the latency of each individual instruction in a pipelined processor [1].

An example of a pipeline implementing the MIPS ISA is shown in Fig 2.7. The
pipelined processor performs operations in five discrete stages separated by pipeline
registers as shown in the figure. The stages are instruction fetch (IF), instruction
decode (ID), execute (EX), memory access (MEM) and write-back (WB).

Fig 2.8 shows the same pipeline as Fig 2.7 but it also shows the control unit. The
control unit generates control signals in the decode stage and the signals are prop-
agated alongside the instruction in a control-path that in each stage reflect the
instruction’s individual needs.

16


2. Background

Figure 2.7: A MIPS I 5SP [2]

Figure 2.8: A MIPS I 5SP augmented with a control unit [2]

All instructions traverse the datapath one stage at a time and need five clock cycles
to fully traverse the pipeline. Fig 2.9 illustrates an example of an instruction flow.
The first instruction is fetched in the first cycle and stored in the subsequent pipeline
register. In cycle two a new instruction is fetched while the preceding instruction is
decoded, both of the instructions are stored in the respective pipeline register after
the stage they passed through. In the third and fourth cycle yet another instruction
is fetched while the later instructions proceed through the pipeline. In cycle five the
pipeline is utilized fully with an instruction being executed in all stages. The first
instruction has now cleared the pipeline and is written back (if R-type or load) to
the register-file. After cycle 5 the pipeline should ideally remain fully utilized until
the program is completed.

17


2. Background

Figure 2.9: Instruction sequence in a 5SP.

However, in reality pipelined processors do not achieve full utilization in most cases
because of the occurrence of pipeline hazards. Hazards are defined as dependencies
between consecutive instructions in the pipeline. Hazards are divided into three cat-
egories; data hazards introduced by arithmetic and load/store instructions, control
hazards introduced by flow-control instructions, e.g., branches, and lastly structural
hazards where in-flight instructions compete for pipeline resources. Data and con-
trol hazards will be explained in greater detail but structural hazards, which are
non-existent by design, will not be elaborated on.

Data hazards occur when a subsequent instruction needs data generated by a pre-
vious instruction. For instance an add instruction is followed by a subtraction that
uses the value produced by the addition. The addition instruction is unable to reach
the WB stage before the subtraction clears the ID stage and finishes the register-file
access thus entering the EX stage with incorrect operands. This is called a read
after write (RAW) hazard and if unaddressed would lead to program errors. A less
elegant solution would be to stop the IF and wait for the instruction causing the
dependency to write back its result to the register-file. While simple, stalling the
pipeline increases the CPI and incurs performance losses. A more elegant solution
is forwarding. It is possible for the addition to provide the correct value to the
subtraction by passing it to the subtraction as it enters the EX stage. The addition
forwards the data causing the dependency to the subtraction. The pipeline shown
in Fig 2.10 has been augmented with a forwarding unit that controls the added for-
warding paths between the EX, MEM and WB stages. The forwarding unit reads
the Rs and Rt registers of the instruction entering th execute stage and compares it
to the Rd of the instruction entering the MEM and WB stage if this instruction is
a R-type instruction and forwards data as needed.

Forwarding does not solve RAW hazards where a dependency exists between a load
and a subsequent instruction. Assume a load followed by an addition: The load
needs to propagate to the WB stage before the data is brought from memory. At
this point however, the addition has already passed the EX stage and is entering
the MEM stage. The only solution to this problem is to stop the addition from
propagating in the pipeline by stalling it. This allows the load to propagate to the

18


2. Background

Figure 2.10: The 5SP augmented with a hazard detection unit [2]

WB stage where the load is able to forward data to the addition waiting in the
ID stage. The necessary addition to the pipeline and hazard detection is shown in
Fig 2.11. The pipeline register between IF and ID stages now has an enable signal
that when asserted forces it to hold its contents. The pipeline register between
the ID and EX stage has an additional clear signal that sets the register contents
to zero which effectively stops random data from propagating down the pipeline
after the load instruction. Instead, the pipeline stages after the load are idle or
conceptually executing a no-operation (NOP) instruction. Additional inputs to the
hazard detection unit are added to allow it to detect hazards that require stalls.

Control hazards are caused by branch and jump instructions because they update
the PC. Assume that a branch instruction is fetched from instruction memory. The
pipeline is oblivious to the fact that the branch could redirect the IF to a different
portion in the program and erroneously continue to fetch instructions sequentially.
When the branch is resolved to be taken in the EX stage (see Fig 2.8) two instruc-
tions from the wrong execution path have already been fetched. The pipeline would
then need to be flushed (pipeline registers emptied). Alternatively the pipeline could
be stalled, i.e., instruction fetch halted. Both solutions would degrade performance
by increasing the CPI. A better solution is instead to use a delayed branch slot.
The delayed branch slot scheme relies on the compiler to move an instruction orig-
inally placed before the branch to immediately behind it. The compiler must be
able to ensure that no dependencies are created when moving the instruction [1].
This scheme works reasonably well in the pipeline in Fig 2.8, but would still require
one stall cycle for the branch to be resolved in time for the instruction after the
delayed branch slot. However, this stall cycle can be avoided by moving the branch
resolution to the ID stage as depicted in Fig 2.12 below.

19


2. Background

Figure 2.11: The 5SP with stall support [2]

A dedicated comparator has been added in the ID stage that operates immediately
on the fetched register contents. Likewise, the sign extension and address generation
have also been moved to the ID stage.

Figure 2.12: The 5SP with branch resolution in the ID stage [2]

20


2. Background

While the stall cycle is eliminated in the pipeline in Fig 2.12 the early branch
resolution introduces additional RAW hazards. The branch condition could possibly
depend on a preceding instruction about to enter the EX stage and the lack of
forwarding paths from the EX stage to the ID stage where the branch is about to
be resolved could result in erroneous branching. However, forwarding paths can be
added and the hazard-detection unit could be expanded to detect and handle this
forwarding as shown in Fig 2.13.

Figure 2.13: The 5SP with branch resolution in ID stage with added forwarding
paths [2]

In this section a simple MIPS I 5-stage pipeline (5SP) was outlined. More advanced
pipelines are in use today and these are usually deeper than five stages. However,
deeper pipelining increases the occurrence of data hazards, which necessitates a
more complex control path. Adding more stages further decreases the logic per
stage, but increases the number of dependencies at the same time and ultimately
deeper pipelines will be stalled more than their simpler counterparts. Furthermore,
because of the minute logic in each stage, the setup time and input to output delay
of the pipeline register become more prominent [2]. This causes diminishing returns
and a minimum in execution time can be found at a specific number of pipeline
stages. If energy is also considered, determining the number of stages becomes even
more daunting because power increases linearly with frequency (see Sec. 2.1.1) which
in turn grows higher with the number of pipeline stages. Optimum pipeline depth
is dependent on the architecture and the specific program being executed, there-
fore there is no way to determine a general optimal number of pipeline stages [2].
Historically, processor pipelines grew deep to exploit the available ILP in the run-
ning instruction stream. However, ILP is limited and exceedingly deep pipelines,
called super pipelines, only yielded marginally more performance while significantly
increasing the power dissipation. Approaching the power wall, a practical upper

21


2. Background

limit on power dissipation due to discontinued Dennard scaling, and the advent
of portable computing made energy efficiency an important design goal [33]. The
design space became more complex as performance and energy efficiency on most
occasions warrant different design choices.

2.3.2.1 Caching

As mentioned before, if the register-file were the only memory available to the pro-
cessor, program complexity would be limited. However, as implied above, more
memory is available and the ISA defines how the memory is interfaced with the
processor. More available memory allows for more complex and useful programs to
be created. However, this memory would need to remain fast even though its size is
increased to not slow down the processor. Such memory, if it existed, would be very
expensive. A better solution can be found by taking into account how a program
is executed. Only a limited part of the program is executed at any given time and
due to code constructs such as loops the same parts of the program are likely to be
executed in the near future. The insights of spatial and temporal locality can be
used to construct a memory hierarchy that delivers on both speed and capacity at
less expense [1].

A conceptual view of a memory hierarchy is shown in Fig. 2.14. In the figure,
access times of the structures and their size are shown. As can be seen, smaller
memory is generally faster and kept closer to the processor. The register-file, as
discussed, is a very small and fast structure integrated into the datapath. The
cache is likewise integrated into the pipeline and is larger than the register-file and
consequently slower, but it is still fast enough to be accessed without imposing
intolerable performance losses [1]. Fig. 2.14 depicts the cache as several layers where
the level-one (L1) cache is small and fast followed by a larger, but slower level-two
(L2) cache. Modern designs stretches further with larger lower-level caches located
off chip or possibly integrated onto the chip [1]. Succeeding the caches is a larger
and slower (two orders of magnitude (OOM)) main memory. Lastly, the largest and
slowest part of the memory hierarchy is disk memory. The memory hierarchy is a
very complex system and describing it in its entirety is beyond the scope of this
report. Instead, the report is limited to describing the highest part of the hierarchy,
i.e., the caches.

In load-store architectures, such as MIPS I described in Sec. 2.3.1, the processor
is only allowed to interact with the memory through dedicated load and store in-
structions [2]. Consequently, if the processor needs data that is not present in the
register-file, a load instruction in the program instruction flow must bring it into the
register-file. This load instruction is directed to the cache. If the data is found in
the cache, a hit is generated and the data is sent to the processor. However, because
caches are small, it is likely that the data is not present and a miss is generated.
The memory access then continues searching in lower levels in the hierarchy until
the data is found. To support overlapping instruction fetch and data access, the L1
cache is usually separated into separate caches, i.e., an level-one instruction cache

22


2. Background

Figure 2.14: A conceptual overview of a memory-hierarchy [1]

(L1IC) and an level-one data cache (L1DC), and this approach is referred to as the
Harvard cache model. Lower levels in the memory hierarchy are usually unified,
containing both data and instructions, according to the Princeton cache model [34].

Caches are arrayed structures where each row, usually referred to as a cache line, is
addressable using the address generated by a store or load instruction. [1] The sim-
plest way to structure a cache is called direct-mapped, where each memory address
maps to one specific line within the cache. As caches are small, memory addresses
will overlap and map to the same location in the cache. Thus, the processor needs to
be able to distinguish between the addresses. This is achieved by storing the higher-
order address bits, called a tag, alongside the data and comparing these with the
address used to access the cache [1]. The portion of the cache that stores the tags is
referred to as tag-array and the part that stores the data the data-array. If the tag
matches the address used to access the cache a hit is generated. Conversely, if the
tag does not match the address a miss is generated and the cache line is replaced
by the requested data brought in from lower levels in the memory hierarchy.

A cache that supports a more flexible address mapping is called n-set-associative
where the n denotes the flexibility of the mapping [1]. Assuming a four way associa-
tive cache the cache is effectively split into four sets, each able to store data mapped
to one cache address. At the extreme end of associativity is a full-associative cache
where every address maps freely into the cache. Associative caches require more
hardware because each set needs to be searched for the proper cache line and com-
pare the tag values with the requested address [1]. Furthermore, when an address
maps to a full set and generates a miss, a victim selection mechanism needs to be
in place. There are several techniques available but the least recently used (LRU)
or random selection schemes or variations thereof are usually enforced. Again, this
adds to the hardware overhead of using a set-associative cache.

Direct-mapped and associative caches both have their advantages and disadvantages.
Direct-mapped caches are simple in terms of hardware but generally suffer from lower
hit rate than their associative counterparts [1]. In contrast, the higher hit rate of
associative caches comes at an expense of more hardware and thus power dissipation
overhead. Which scheme to use depends on the application [1].

23


2. Background

Irrespective to what cache scheme that is used, store instructions pose problems
when it comes to memory coherency [1]. This problem is present in all layers in the
memory hierarchy, but is more acute in caches which are latency sensitive. A store
operation will update the cache content in the L1 cache and unless this is reflected
in lower levels, e.g., the L2 cache, the memory is said to be incoherent. Should the
L1 cache line be evicted, the line cannot be restored and data is irreversibly lost.
Thus memory coherency is required to ensure program correctness. The simplest
solution is to let stores propagate down the memory hierarchy [1]. While simple,
this solution increases the cache latency and ergo the execution time of the running
application. Another less penalizing scheme is the write-back scheme, which only
writes back the cache line when evicted to lower levels in the hierarchy. The write-
back approach requires extra bookkeeping hardware, called a dirty bit, to indicate
whether a write-back operation is necessary. Naturally variations and enhancements
of these approaches exist but they will not be discussed.

Because misses are expensive, the performance of a memory hierarchy is to a large
extent determined by the miss rate as shown in Eq. 2.7 [1].

Cache access = Ht +Mr ∗Mp (2.7)

The hit time (Ht) is the time paid for a successful cache access, the miss rate (Mr)
the fraction of misses to the total access count and lastly miss penalty (Mp) the
cost to access lower levels in the hierarchy. This formula can easily be extended
to include more layers in the hierarchy, as shown in Eq. 2.8 where a L2 cache is
included.

Cache access = HtL1 +MrL1 ∗ (HtL2 +MrL2 ∗MpL2...) (2.8)

Depending on the workload, an increased average access time can be very deteri-
orative to performance [1]. As such, great care must be taken when designing the
memory hierarchy.

2.4 Existing pipeline evaluation method

As stated in Sec. 1.1 this work aims at implementing a methodology that has been
used with success at Chalmers University of Technology. The methodology builds
on two components, an architectural simulator and pipeline and cache RTL. This
section will elaborate on these components, starting with the simulator in Sec. 2.4.1
followed by the RTL in Sec. 2.4.2 and lastly how they have previously been combined
in Sec. 2.4.3.

24


2. Background

2.4.1 Architectural simulator

SimpleScalar is an execution-driven functional simulator capturing both the behavior
and performance of the simulated architecture [12]. The fact that it is execution-
driven is essential as this captures the dynamic behavior caused by branches and
cache misses of the underlying architecture, which can have a dramatic impact on
performance and energy. Furthermore, because SimpleScalar captures both the func-
tionality and performance of the architecture, correct program behavior is ensured
and accurate resource usage and time measurements (execution time in clock cycles)
are possible [12]. It can be argued that the simulator is too old and limited to single
core designs in an age of multi-core processors. Other more modern tools, with equal
or a super-set of SimpleScalar’s features, such as McPAT or Gem5 are available but
were turned down in favor for SimpleScalar because SimpleScalar was sufficient for
the relatively simple 5SP design that it was used to model.

SimpleScalar provides several different simulators of varying detail and speed [12].
The simplest and fastest simulator, called sim-fast, is a purely functional simulator
that does not account for time (cycles). In contrast, the most complex simulator,
the sim-outorder, supports out-of-order issue, speculative execution, multiple issue
while also accounting for time.

The structure of SimpleScalar is shown in Fig. 2.15. The bpred block defines the
branch predictor behavior, the cache block defines cache behavior (cache size, asso-
ciativity and replacement technique), the regs block defines register related behavior
and the memory block the memory related behavior. The simulator core defines the
datapath’s architecture and it is by far the most substantial block.

Figure 2.15: Modular structure of SimpleScalar

The simulators support configuration through configuration files that are provided
to the simulator of choice when calling it from the command line [12]. The config-
uration files allow features such as branch resolution, cache parameters, speculative
execution, decode width, issue width and number of functional units to be tweaked
without the need to rebuild the simulator.

The simulator used in the methodology is based on a modified version of the sim-
outorder simulator. The modifications were implemented to reduce the out-of-order

25


2. Background

pipeline modeled by the simulator to an in-order pipeline similar to the RTL pipeline
described in Sec. 2.4.2 below. The base simulator was then augmented with special-
ized performance counters that tracked usage of, for the project, relevant pipeline
resources. Because of frequent use in various projects the base simulator had been
modified, sometimes extensively, to fit the needs of each new project.

2.4.2 RTL design and verification

The RTL used in the methodology captures 5SP MIPS I pipeline design developed
at Chalmers. The 5SP has been enhanced with one cycle access latency L1IC and
L1DC caches also developed at Chalmers. This setup has since been used in several
well-received publications [4][35].

The implemented microarchitecture features some 50 instructions including different
branches, logic and memory instructions and a register-file with re general-purpose
32-bit registers. This microarchitecture does not include a floating-point unit to pro-
vide floating-point support, which was motivated by the targeted embedded market
where floating-point operations usually are replaced by fixed-point calculations.

An overview of the microarchitecture is shown in Fig. 2.16 and it is similar to the
pipelines discussed in Sec. 2.3.2. The instructions are processed in five stages; IF,
ID, EX, MEM, and WB. In the IF stage, instructions are read from the instruction
cache from an address pointed to by the PC register, which is updated to point
to consecutive instructions or to branch target addresses. During the ID stage the
register-file is accessed and control signals for later stages are set based on the
instruction type. Branch and jump instructions are solved in the ID stage, but by
the time they are resolved the next instruction has already been fetched. A delayed
branch slot is utilized to solve this problem and is accounted for by the compiler
(see Sec. 2.3.2). In the EX stage arithmetic and logic operations are executed in
an arithmetic logic unit (ALU). A dedicated two-stage multiplication unit is also
available, spanning the EX and MEM stages. In the memory access stage, loads
and stores access the L1DC. Finally, in the WB stage, results are written back
to the register-file. In the RTL code the MEM and WB stages were combined to
simplify the implementation. However, the combined stage logically functions as
two separate stages.

A hazard-detection unit, which physically resides in the ID stage but is shown sep-
arately in Fig. 2.16, detects any potential hazards and stops the pipeline by stalling
the IF stage. In this manner, NOPs are inserted into the pipeline. The cache also
produces a stall signal, which is asserted upon a cache miss. In contrast to the
hazard stalls, cache misses stall the entire pipeline in Fig. 2.16 where the arrows
pointing to the pipeline registers denote the stall signals. The microarchitecture
does not support exceptions, but these are by design rare events. Exceptions are
necessary to support I/O and recover from errors (Invalid Opcode etc.) and system
calls.

26


2. Background

Figure 2.16: Microarchitectural overview of the 5SP.

Figure 2.17: Memory hierarchy of the 5SP.

The design includes L1IC and L1DC caches. These caches are separate from each
other according to the Harvard architecture to avoid structural hazards as explained
in Sec. 2.3.2.1. No L2 cache is included in the design, instead an ideal memory mod-
ule serves as a replacement for the lower levels of the memory hierarchy as shown in
Fig. 2.17. The data cache is available for read and write accesses, while the instruc-
tion cache only serves reads. However, the instruction cache still needs to access
external memory on cache fills and in the case of a cache miss. The two caches share
one memory bus to the external memory and a memory controller (arbiter) orches-
trates which one of the caches that is allowed to access the external memory. Both
caches were designed to be flexible and allow for any size the user desires. However,
because the RTL has been used in several projects, which sometimes required ex-
tensive modifications to the RTL code much of this flexibility was lost. Previously
the associativity could be set to zero (effectively direct-mapped cache), two-way
or four-way with replacement algorithms LRU or pseudo random. The cache also
supported banking whereby cache lines are split across separate memory macros.

2.4.2.1 Design and verification flow

The existing evaluation method loosely defines a RTL design and verification flow
which has been used to verify and extract power from the pipeline RTL. The RTL
design was first brought through a functional verification as described in Sec. 2.2
followed by a cell-based synthesis and then PnR. From the place and routed netlist,
power estimates of the design were obtained. As the power estimates are obtained
from a complete pipeline design that has been synthesized and placed and routed to
meet a set timing constraint, they capture the synergy between the different com-

27


2. Background

ponents in the pipeline. The synergy is due to the fact that logic paths stretches
over several components, i.e., the speed of one component imposes speed require-
ments on subsequent components. These logic paths are then adapted to meet the
imposed timing constraint, which is achieved through transistor sizing. However,
the evaluation method focused solely on the caches of the RTL design which allowed
for probabilistic testing. The probabilistic approach was used to obtain power es-
timates of peripheral units of the cache such as DTLB, arbiter and replacement
logic. The power of the actual static random access memory (SRAM) memory cuts
was obtained in the library files used during synthesis. The existing evaluation
methodology strikes a balance between quick prototyping and estimation accuracy
but neglects scalability. Changes to the cache would require a complete reiteration
of the design flow, with a lot of effort spent on PnR and power estimates.

2.4.3 Ad-hoc combination of RTL and simulator

The RTL and SimpleScalar components of the methodology are then combined in a
way specific to each project. Performance counters were introduced for each project
and power estimates were extracted from RTL structures that were represented by
these performance counters. An example of a prior application is the STA (Spec-
ulative Tag Access) project where power estimates were extracted from the caches
through probabilistic techniques and the simulator was augmented with performance
counters that monitored cache access patterns [4]. Similar approaches were used in
several other publications with the main exception being the introduction of addi-
tional RTL structures and different performance counters [35][36] [37]. However, the
new RTL was not integrated into the pipeline but instead analyzed separately.

28


3
A unified evaluation framework

The goal of this work is to develop an energy estimation framework for pipelines
that captures the ad-hoc methodology outlined in Sec. 2.4. As discussed the pipeline
register-transfer level (RTL) code and SimpleScalar simulator constitute the two
major components of the framework. The methodology was established previous
to this work, albeit in an ad-hoc manner, so this section will instead elaborate
on the methods that allow the two components to be integrated into one coherent
framework.

3.1 Framework workflow

The conceptual workflow of the framework is shown in Fig 3.1 where the RTL and
architecture simulator constitute the two branches of the flow. The RTL branch
consists of an RTL verification flow similar to the one described in Sec. 2.4.2.1,
which will be implemented to verify the RTL and later netlists. Slight alterations to
this approach are necessary for it to be scalable and applicable to a range of designs.
For instance the power estimation previously based on probabilistic methods is sub-
stituted for a use-case based method, which allows the whole design to be studied
on a pipeline unit basis. Whether the power estimates are averaged or time-based
is similarly a matter of scalability. Time-based analysis produces considerably more
data than averaged but allows for detailed analysis of the power dissipation. In
contrast averaged analysis is easier to integrate into a scalable framework since less
data are produced. However, as the power is averaged over a unit of time, the power
dissipation for units with low utilization is amortized over the estimation interval.
The issue can be addressed by introducing resource counters in the testbenches used
in the verification flow and scaling the final average power according to the counters
as shown in Eq. 3.1.

Pscaled = Punscaled
Utilization

(3.1)

where Utilization is the total percentage of the total execution time, in either sec-
onds or cycles, when the unit is used. Furthermore, doing the power estimates after

29


3. A unified evaluation framework

CREEP: START

     RTL 
Verification

Architecture
 Simulation

    Power
 estimation

Resource:Power
     mapping

CREEP: END

Figure 3.1: The methodology embodied in the CREEP framework.

the place and route (PnR) creates an issue of scalability as all designs are different
and would need to be placed and routed manually. Instead, one design will be placed
and routed and serve as an indication for how much the power increases post PnR.

The simulator branch consists of SimpleScalar as described in Sec. 2.4.1. The sim-
ulator is used to acquire accurate per cycle resource usage using workloads that
are impractical (impossible) to use during RTL power estimation due to their com-
plexity and size. These resource usage statistics are then combined with the power
estimates obtained from the RTL.

There are two issues that arise in the combination of the power estimates and sim-
ulator statistics. Firstly, the mismatch between resource counter and RTL power
estimates must be bridged and secondly, the pipeline power must be mapped to re-
source counters in a manner that does not systematically over or underestimates the
energy. The first issue stems from the power reports, e.g., unit conversions, group-
ing into different sources of power dissipation, i.e., leakage and switching power and
whether the power reports are time-based or averaged. Unit conversions are trivial
and will be done to obtain energy per cycle in order to combine the power estimates
with resource counters. The power grouping should likewise pose no issues but will
require some consideration of scalability as dealing with different types of power
dissipation sources will complicate the framework workflow. The second issue is
brought about by the different structure and granularity of the architectural simu-
lator and the RTL code. The resource mapping will be done by carefully grouping
the RTL pipeline units and introducing selective performance counters in the archi-
tectural simulator where it is necessary. The granularity of the resource mapping
will be done incrementally from coarse to finer.

30


3. A unified evaluation framework

The combination of said energy estimates and performance counters will be auto-
mated in order for the framework to be able to reproduce results consistently, which
is stated as a goal in Sec. 1.1. Furthermore, to meet the configurability goal the user
shall be able to configure the RTL and simulator components. Settings related to
the cache and possibly to the pipeline will be exposed to the user, but the user will
not make any changes to the components themselves. Instead, this process will also
be automated thus ensuring that the components are combined consistently. Incon-
sistencies between the two components would needlessly affect the energy estimates
negatively.

3.2 Verification

Verification of the framework will be done at the component level. For the RTL
the functional verification is inherent to the established RTL verification and power
estimation flow. However, the power estimates themselves need to be verified. As
the RTL has been used in previous projects, estimates for parts of the design, more
specifically, the caches are available. These estimates will be used to do a rudimen-
tary evaluation of the power estimates.

For the simulator the modifications as well as the added resource counters need
to be verified. The modifications can be verified by comparing the default per-
formance counters to the counters produced by the modified simulator. Several
counters should indicate in-order, non-speculative and single-issue behavior. The
added resource counters can similarly be validated by comparing them to default
counters produced by SimpleScalar.

31


3. A unified evaluation framework

32


4
Implementation

This chapter will outline the implementation of Chalmers RTL-based energy eval-
uation framework for pipelines (CREEP), starting with the implementation of the
register-transfer level (RTL) flow in Sec. 4.1.1 followed by the simulator in Sec. 4.1.2.
How these two are integrated into the framework workflow is then discussed in
Sec. 4.2. Lastly, the work related to automating the framework will be outlined in
Sec. 4.3.

4.1 Implementation of framework components

The RTL of the framework describes a 5-stage pipeline (5SP) that supports the
integer subset of the 32-bit MIPS I instruction set architecture (ISA) (see Sec. 2.4.2).
However, in order for the RTL to fulfill the framework’s needs several modifications of
it was necessary. Furthermore, an integrated circuit (IC) design and verification flow
with emphasis on scalability was implemented in order to obtain power estimates
from a wide range of designs. The SimpleScalar simulator was similarly modified to
fit into the framework. For clarity this section is split up into three parts: The
first part, Sec. 4.1.1, discusses the RTL modifications and establishment of the
design and verification flow. The second part, Sec. 4.1.2, elaborates on the simulator
modifications. The last part, Sec 4.1.3, deals with the framework’s configurability.

4.1.1 RTL modifications

As previously mentioned in Sec. 2.4.2 the caches were originally designed to have
adjustable dimensions, associativity and replacement techniques. However, the
pipeline design was used in other projects prior to the framework development dur-
ing which the level-one instruction cache (L1IC) and level-one data cache (L1DC)
were fixed to a configuration with 16kB 4-way associativity and LRU as replacement
algorithm. To meet the goal of configurability stated in Sec. 1.1, it was decided that
the caches should be restored to their original flexible condition.

The main limitation of the cache components in their original initial condition was
set by the use of 1024x32-bit static random access memory (SRAM) memories for the

33


4. Implementation

data-arrays and 128x32-bit SRAM memories for the tag-arrays. This corresponded
to 128 sets with a line size of 8 instructions for the L1IC or 8 words for the L1DC.
Furthermore, the setup was locked to a 4-way configuration producing a cache of
16kB in total (4 × 1024 × 32). The tag arrays were optimized to fit four 21-bit
tags (tag + dirty bit) into three SRAM memories rather than the conventional four
memories (one per way). The rationale behind this was that three tag arrays were
sufficient to hold four tags (21 × 4 = 84 < 96). The code was modified to instead
map the tags into four tag-arrays. This change allowed the associativity to be set
within the original bounds of direct-mapped to 4-way associative. However, this
caused an increase in the bit overhead in the tag-arrays as only 21 of 32 bits were
used. This was deemed unavoidable since SRAM memory matching the tag width
of 21-bits was unavoidable.

Additional SRAM memories of different sizes were introduced to allow for additional
cache sizes of 8kB and 32kB. Furthermore, banking was reintroduced which can be
used to divide the cache lines between several SRAM memories. In addition to the
LRU replacement algorithm, the pseudo-random replacement algorithm was also
reintroduced.

Because of licensing issues no SRAM memories can be shipped with the frame-
work. To make the RTL available without SRAMs the caches were augmented with
the ability to use logic-based memory (flip-flops). Compared to SRAM logic-based
memory produces larger (area) and more power dissipating caches.

Other ways of allowing more flexibility in the pipeline design were considered, but
the remaining options were related to the datapath. Changes to the datapath,
such as allowing wider-issue width, speculative execution, different branch resolu-
tion techniques would all essentially warrant a complete redesign of all or some
pipeline stages. Furthermore, allowing such flexibility was outside the scope of the
framework, which targets simpler embedded processors.

4.1.1.1 Design verification flow

A verification flow built on the methods described in Sec. 2.4.2.1 was established.
Cadence IES was the electronic design automation (EDA) tool of choice for hard-
ware description language (HDL) simulations. More specifically NCVHDL was used
to compile the RTL code, NCELAB was used to elaborate the design and NCSIM
was used for simulations. To verify the design, a testbench was constructed around
the pipeline design. As the design is complex test vectors were chosen as stimuli for
the design. More specifically, the vectors were directed to test the design’s imple-
mentation of the MIPS I ISA through instruction set simulations using executables
compiled for the MIPS I ISA. However, RTL simulations of large designs are time
consuming but can be facilitated through the use of small and effective workloads
with ample test coverage. A good match was found in the EEMBC benchmark suite,
which is a benchmark suite that targets embedded processors. The benchmarks in
the suite are light weight and utilizes fixed point arithmetics. The benchmarks used

34


4. Implementation

in the framework are listed below:

• Autocorrelation

• Convolutional Encoder

• FFT/IFFT

• Viterbi Decoder

• RGBCMY01 (Consumer RGB to CMYK)

The next stage in the verification flow is synthesis. The EDA tool of choice for
synthesis was Synopsys Design Compiler (DC), which was chosen because it is the
de-facto standard tool used by the community. The synthesis was done using the
compile-ultra command and cells from the 65nm low power (LP) low threshold
worst-case corner library (1.1V 125◦) provided by ST Microelectronics. A corre-
sponding library was used for the SRAM memories in the caches. These libraries
represent the worst-case process corner and use scenario (extreme temperature and
low voltage) and were used because of the strict performance requirements placed
on ICs. Additionally, automatic clock-gating was enabled in an effort to reduce the
design’s dynamic power dissipation [38]. Clock-gating is widely used in the industry
because of its potentially large energy savings at little extra design effort. Most
EDAs specializing in synthesis support it, DC included. The synthesis was carried
out for increasingly strict timing constraints to find the maximum achievable clock
frequency of the design and the design was established to meet a timing constraint
of 2.5ns, producing a netlist running at 400 Mhz. The netlist verification was done
using the same testbench developed for RTL verification.

The final stage is place and route (PnR) for which Cadence Encounter was used.
As described in Sec. 2.4.2.1 PnR produces yet another netlist, but this netlist has
now been subjected to a number of structural changes to facilitate physical imple-
mentation. PnR was necessary to include in the verification flow because of two
reasons. Firstly, the utilized SRAM memories are already placed and routed and
thus dissipates significantly more power than the rest of the design unless this is also
placed and routed. Secondly, a placed and routed design allows for more accurate
power estimations than a post-synthesis design (see Sec. 2.2). However, the PnR
stage is unique to each design, which conflicted with the desired scalability of the
verification flow. A more scalable approach was implemented that estimated the
PnR impact on power dissipation by comparing the power of one placed and routed
netlist to a post-synthesis netlist and from this comparison a scaling factor could
be deduced. The rationale behind this approach was that the pipeline was not sub-
jected to any modifications (see Sec. 4.1.1) and remains relatively unaffected by the
configuration of the caches.

35


4. Implementation

4.1.1.2 Design power estimations

It was not possible to use the power-estimation method used in the ad-hoc method-
ology in order to obtain power estimates of the design (see Sec. 2.4.2.1). The main
reason was the larger scope of the power estimates, which previously were limited
to the L1DC, but now included the entire pipeline. As such, using a probabilistic
approach was unfeasible. Instead, the power dissipation of the design was estimated
by using use-case statistical simulations whereby switching activities for the nodes
in the design were obtained. Two different statistical methods were considered. The
first considered method was switching activity interchange format (SAIF) genera-
tion. During SAIF generation the average switching activities of the nodes in the
design are recorded throughout simulation, which allows an average power estimate
of the design to be produced. The second was value change dump (VCD) generation
which tracks the nodes’ switching on a per-cycle basis which allows for time-based
power analysis. The VCD based method allows for detailed analysis of the power
dissipation, e.g., maximum power dissipation analysis, but the usage of VCD is com-
putationally complex and hence less scalable than the SAIF-based method. Thus,
VCD generation was dropped in favor for SAIF generation. Cadence NCSIM was
used to simulate the netlist using the aforementioned RTL testbench and the previ-
ously listed EEMBC benchmarks as stimuli. A total of five of five SAIF-generations
were done (one per EEMBC benchmark).

Synopsys PrimeTime (PT) was the EDA tool used to generate the final power es-
timates [39]. PT was first used to remap the gate netlist to a different cell library
from the one used during synthesis. In contrast to the synthesis, which was done
with the worst-case high-temperature corner library, the nominal-nominal variation
(1.2V NOM 25◦) was used to generate the power reports. The reason for using
the nominal corner and nominal voltage and temperature library was to provide
nominal power estimates for the pipeline design and thus allow different pipeline
configurations to be compared under normal circumstances. The power estimation
was done by reading the netlist and each of the aforementioned SAIF files. Thus,
a total of five power reports were produced and averaged to create the final design
power estimate. Hierarchal reports were produced for the design and the granular-
ity of these reports was tweaked to reveal major pipeline units within each pipeline
stage. However, the granularity of these reports was later tweaked to better suit
the performance counters generated by the simulator component (Sec. 4.1.2). PT
reported the power divided into three different categories: 1) switching power, 2)
internal power and 3) leakage power. To simplify the workflow the sum of all these
powers was used for each reported pipeline unit.

As discussed in Sec. 3.1, average power reports amortize the power of certain units
over the power estimation interval. Units such as the arithmetic logic unit (ALU),
multiplier and L1DC are associated with enable signals that prompts them to acti-
vate, i.e., start switching and dissipating power. Unless the power of these units are
scaled according to their usage, the framework would greatly underestimate their
contribution to the final energy results. The solution to this problem, which was dis-

36


4. Implementation

cussed in Sec. 3.1, required information of how many cycles the affected units were
used during the power estimation interval. The usage information was obtained by
augmenting the RTL testbench used during verification and power estimation with
counters that were incremented when the enable signal for these structures was as-
serted. These counters were then divided by the total number of cycles also tracked
by the testbench. The scaling factor was then computed as shown in Eq. 4.1.

Utilization = active_cycles
total_cycles (4.1)

The power dissipation reported by PT was then divided by this utilization factor,
as shown in Eq. 4.2 to obtain the final power values used in the framework.

Pscaled = Punscaled
Utilization

(4.2)

As discussed briefly in Sec. 3.1 and Sec. 4.1.1.1 a scalable approach to power esti-
mation of a placed and routed design was necessary for the scalability goal as stated
in Sec. 1.1. An attempt was made at running the post-PnR netlist through the im-
plemented verification flow but this was met with technical issues that proved hard
to solve. Instead a probabilistic approach was adopted, which limited the scope
at which the design could be analyzed. Since the SRAM memories comes placed
and routed, the PnR scaling should only be applied to combinatorial pipeline units.
Hence, the ALU was chosen as a representative combinatorial unit and switching
activities were assigned to the ALU input. The power dissipation was then extracted
from a synthesized and a post-PnR netlist. Then the PnR factor was derived from
the fraction of the PnR power to the power based on the synthesized design as shown
in Eq. 4.3. The PnR-scaling factor was then applied to all combinatorial units in
the pipeline.

PnRscaling = ALUsynth
ALUPnR

(4.3)

A similar estimation was done for the clock-tree power, which is small in a synthe-
sised netlist. The limited clock power dissipation accounted for in a post-synthesis
design is related to the clock pins on registers in the design. Hence, the major-
ity of the difference in clock power dissipation between a synthesized netlist and a
post-PnR netlist is due to