Development of an implementation-centric energy-evaluation framework for MIPS-I pipelines Master’s thesis in Embedded Electronic System Design Daniel Moreau Department of Computer Science & Engineering CHALMERS UNIVERSITY OF TECHNOLOGY Gothenburg, Sweden 2016 © Daniel Moreau, 2016. Supervisor: Per Larsson-Edefors, Department of Computer Science & Engineering Examiner: Sven Knutsson, Department of Computer Science & Engineering Department of Computer Science & Engineering Division of Computer Engineering VLSI Research Group Chalmers University of Technology SE-412 96 Gothenburg Telephone +46 31 772 1000 Cover: Placed and routed five stage pipeline built at Chalmers University of Tech- nology. Typeset in LATEX Gothenburg, Sweden 2016 ii Daniel Moreau Master’s thesis in Embedded Electronic System Design Chalmers University of Technology Abstract An RTL-based energy evaluation framework dubbed CREEP (Chalmers Energy Evaluation framework for Pipelines) is implemented and evaluated. The frame- work consists of pipeline RTL and the architectural simulator SimpleScalar. Power estimates are extracted from the RTL and combined with performance counters generated by SimpleScalar. The combination lends SimpleScalar accurate energy estimates otherwise reserved to low level circuit analysis. The framework has been used to characterize several different embedded processor configurations. Addition- ally, a case study of the framework was used to implement and evaluate a speculative way-halting technique called SHA which pointed to a 25.6% energy reduction in a conventional four-way data cache. iii Acknowledgements I want to thank my supervisor Per Larsson-Edefors for his guidance and support throughout my work. I also want to thank Alen Bardizbanyan, without his help and technical expertise this work would would not have been possible. Lastly, I want to thank my family and girlfriend for their continued support and encouragement during hard times. Daniel Moreau, Gothenburg, January 2016 v Contents List of Figures x List of Tables xi Acronyms xiii 1 Introduction 1 1.1 Goals and challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.1.1 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Related works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2 Background 7 2.1 CMOS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.1.1 Power dissipation . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.1.2 Speed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2 IC design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.3 Pipeline design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.3.1 MIPS I instruction set architecture . . . . . . . . . . . . . . . 13 2.3.2 A MIPS I pipeline . . . . . . . . . . . . . . . . . . . . . . . . 15 2.3.2.1 Caching . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.4 Existing pipeline evaluation method . . . . . . . . . . . . . . . . . . . 24 2.4.1 Architectural simulator . . . . . . . . . . . . . . . . . . . . . . 25 2.4.2 RTL design and verification . . . . . . . . . . . . . . . . . . . 26 2.4.2.1 Design and verification flow . . . . . . . . . . . . . . 27 2.4.3 Ad-hoc combination of RTL and simulator . . . . . . . . . . . 28 3 A unified evaluation framework 29 3.1 Framework workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.2 Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4 Implementation 33 4.1 Implementation of framework components . . . . . . . . . . . . . . . 33 4.1.1 RTL modifications . . . . . . . . . . . . . . . . . . . . . . . . 33 4.1.1.1 Design verification flow . . . . . . . . . . . . . . . . . 34 4.1.1.2 Design power estimations . . . . . . . . . . . . . . . 36 4.1.2 SimpleScalar . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 vii Contents 4.1.3 Configurability . . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.2 Combining RTL and SimpleScalar . . . . . . . . . . . . . . . . . . . . 40 4.3 Framework automation . . . . . . . . . . . . . . . . . . . . . . . . . . 42 5 Results and discussion 45 5.1 User-centric overview . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 5.2 Demonstration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 5.2.1 MiBench execution time . . . . . . . . . . . . . . . . . . . . . 48 5.2.2 Power and performance . . . . . . . . . . . . . . . . . . . . . . 48 5.2.3 Energy distribution . . . . . . . . . . . . . . . . . . . . . . . . 52 5.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 5.3.1 Evaluation of verification methods . . . . . . . . . . . . . . . . 55 5.3.2 Achievement of goals . . . . . . . . . . . . . . . . . . . . . . . 56 5.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 6 Case study 61 6.1 SHA - practical way-halting . . . . . . . . . . . . . . . . . . . . . . . 61 7 Conclusion 67 Bibliography 72 viii List of Figures 2.1 Schematic view of CMOS inverter . . . . . . . . . . . . . . . . . . . . 8 2.2 Example of a CMOS circuit consisting of multiple inverters . . . . . . 10 2.3 MIPS I R-type instruction format. . . . . . . . . . . . . . . . . . . . . 14 2.4 MIPS I I-type instruction format. . . . . . . . . . . . . . . . . . . . . 14 2.5 MIPS I J-type instruction format. . . . . . . . . . . . . . . . . . . . . 14 2.6 MIPS I memory is byte addressable [1]. . . . . . . . . . . . . . . . . . 15 2.7 A MIPS I 5SP [2] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.8 A MIPS I 5SP augmented with a control unit [2] . . . . . . . . . . . 17 2.9 Instruction sequence in a 5SP. . . . . . . . . . . . . . . . . . . . . . . 18 2.10 The 5SP augmented with a hazard detection unit [2] . . . . . . . . . 19 2.11 The 5SP with stall support [2] . . . . . . . . . . . . . . . . . . . . . . 20 2.12 The 5SP with branch resolution in the ID stage [2] . . . . . . . . . . 20 2.13 The 5SP with branch resolution in ID stage with added forwarding paths [2] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.14 A conceptual overview of a memory-hierarchy [1] . . . . . . . . . . . 23 2.15 Modular structure of SimpleScalar . . . . . . . . . . . . . . . . . . . . 25 2.16 Microarchitectural overview of the 5SP. . . . . . . . . . . . . . . . . . 27 2.17 Memory hierarchy of the 5SP. . . . . . . . . . . . . . . . . . . . . . . 27 3.1 The methodology embodied in the CREEP framework. . . . . . . . . 30 5.1 CREEP package overview . . . . . . . . . . . . . . . . . . . . . . . . 46 5.2 The workflow of the framework showing the RTL and simulator com- ponents and the central CREEP.pl script. . . . . . . . . . . . . . . . 46 5.3 Per-benchmark execution time for standard 16kB 4-4 configuration. . 49 5.4 Miss rates of the different configurations. . . . . . . . . . . . . . . . . 49 5.5 Execution time versus power of the different configurations. . . . . . . 50 5.6 Absolute energy of the different configurations. . . . . . . . . . . . . . 53 5.7 Energy distribution for a) Unscaled and without way-prediction b) Scaled and without way-prediction c) scaled with way-prediction . . . 53 5.8 Energy distribution of three 8kB 1-1 configurations with different L2 latencies, a) 8kB 1-1 10 cycles b) 8kB 1-1 12 cycles c) 8kB 14 cycles 54 5.9 Energy distribution of three 8kB 2-2 configurations with different L2 latencies, a) 8kB 2-2 10 cycles b) 8kB 2-2 12 cycles c) 8kB 2-2 14 cycles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 ix List of Figures 5.10 Energy distribution of three 8kB 4-4 configurations with different L2 latencies, a) 8kB 4-4 10 cycles b) 8kB 4-4 12 cycles c) 8kB 4-4 14 cycles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 5.11 Energy distribution of three 8kB configurations, a) 8kB 1-1 b) 8kB 1-4 c) 8kB 4-4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 5.12 Energy distribution the different 16kB configurations: a) 16kB 4-4 b) 16kB 2-4 c) 16kB 2-2 . . . . . . . . . . . . . . . . . . . . . . . . . 55 5.13 Energy distribution 32kB configuration. . . . . . . . . . . . . . . . . . 55 6.1 Overview of SHA. [3] . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 6.2 AGU address calculation showing the address fields of interest. [3] . . 62 6.3 SHA energy for the MiBench suite used in the CREEP framework. [3] 64 6.4 SHA energy compared to STA and a conventional baseline cache. [3] . 65 x List of Tables 4.1 Summary of relevant SimpleScalar configurable settings . . . . . . . . 39 4.2 MiBench benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . 39 5.1 Cache parameters for the selected configurations . . . . . . . . . . . . 48 5.2 Power estimates obtained from the RTL during previous projects [4] . 56 6.1 L1 DC Component Energy. [3] . . . . . . . . . . . . . . . . . . . . . . 63 6.2 Components Accessed for Each Case. [3] . . . . . . . . . . . . . . . . 63 6.3 Components Accessed on Miss Events. [3] . . . . . . . . . . . . . . . . 64 xi List of Tables xii Acronyms 5SP 5-stage pipeline. 2, 3, 4, 21, 24, 26, 33, 38, 67 AGU address generation unit. 41, 61 ALU arithmetic logic unit. 4, 26, 36, 37, 41, 58 CAM content addressable memory. 4 CMOS complementary metal–oxide–semiconductor. 1, 7, 8, 9, 10, 12 CPI cycles per instruction. 16, 18, 19 CREEP Chalmers RTL-based energy evaluation framework for pipelines. 2, 3, 4, 5, 6, 7, 33, 42, 43, 45, 54, 55, 57, 67 DC Synopsys Design Compiler. 35, 42 DTLB data translation lookaside buffer. 54, 55 EDA electronic design automation. 10, 11, 12, 34, 35, 36 EX execute. 16, 18, 19, 21, 26, 41 GP general purpose. 11 HDL hardware description language. 10, 11, 34, 45 IC integrated circuit. 1, 7, 10, 11, 12, 33, 35 ID instruction decode. 16, 18, 19, 20, 21, 26, 41 IF instruction fetch. 16, 18, 19, 26 ILP instruction-level parallelism. 15, 21 IPC instruction per cycle. 16, 48 ISA instruction set architecture. 13, 15, 16, 22, 33, 34, 67 ITRS International Technology Roadmap for Semiconductors. 4 L1 level-one. 22, 23, 47, 48, 50, 67 L1DC level-one data cache. 2, 22, 26, 33, 35, 36, 39, 47, 50, 51, 52, 53, 54, 55, 56, 62, 67 L1IC level-one instruction cache. 2, 22, 26, 33, 39, 42, 47, 50, 51, 52, 53 L2 level-two. 3, 22, 23, 24, 26, 39, 47, 48, 50, 51, 57 LP low power. 11, 35 LRU least recently used. 23, 26, 54 LSU load store unit. 41 MEM memory access. 16, 18, 26, 41 xiii Acronyms nMOS n-type metal–oxide–semiconductor. 7, 8, 10 NOP no-operation. 18, 26 NP nondeterministic polynomial time. 12 OOM orders of magnitude. 22 OP operation. 13 PC program counter. 15, 19, 26 pMOS p-type metal–oxide–semiconductor. 7, 8, 10 PnR place and route. 11, 12, 27, 29, 35, 37, 41, 58, 67 PT Synopsys PrimeTime. 36, 37, 40, 42 RAM random access memory. 4 RAW read after write. 18, 21 RISC reduced instruction set computing. 13 RT register transfer. 11, 67 RTL register-transfer level. 1, 2, 3, 4, 5, 10, 11, 12, 24, 25, 26, 27, 28, 29, 30, 31, 33, 34, 35, 36, 38, 39, 41, 42, 43, 45, 46, 54, 57, 58, 61, 67 SAIF switching activity interchange format. 35, 36, 42, 45 SPEF standard parasitic exchange format. 57 SRAM static random access memory. 27, 33, 34, 35, 37, 46, 50, 51, 52, 57, 61 VCD value change dump. 35, 57 VLSI very-large-scale integration. 1, 4 WB write-back. 16, 18, 26, 41 xiv 1 Introduction In the early days of integrated circuit (IC) design, computer architectures were devel- oped with a focus on achieving high performance. Other design factors such as cost, area and power were also considered but only as limiting factors. However, in the late 1990’s it became apparent that this design philosophy was unsustainable. Comple- mentary metal–oxide–semiconductor (CMOS) technology scaling allowed for higher densities and increasing clock frequencies, but performance-centered designs that tried to leverage these advances became hard or impossible to cool cost-effectively [5]. Currently energy efficiency is next to performance the major focal points in very- large-scale integration (VLSI) design. The driving forces behind this are increased portability, the power wall and environmental concerns. For portable battery- powered devices lower energy consumption directly translates into a more well- received product. The power wall, a direct consequence of discontinued Dennard scaling, means that technology scaling no longer is the obvious answer to increased performance and lower power [6][7]. Lastly, it is becoming painfully obvious that the rate at which the global energy consumption increases is not sustainable. ICs contribute to a considerable chunk of this increase [8]. To facilitate energy efficient design evaluation frameworks at the software and register- transfer level (RTL) are required to make vital early estimations. Early estimations are perhaps the most important estimations as changes at the architectural level have a larger impact on the final energy and performance numbers than changes at the circuit level. As such, these frameworks allow for accurate estimation and thereby more predictable prototyping results. Furthermore, these tools are sig- nificantly faster than those available at the circuit level which is essential when exploring a complex design space [9][10]. These framework have traded speed for accuracy, often adopting parameterizable models obtained through analytical or em- pirical studies of the underlying hardware. However, by adopting such models many of the existing frameworks neglects the impact of design integration, i.e., the syn- ergy between the integrated parts of a design. We propose an open source energy evaluation framework for pipelines that facilitates software and hardware co-design. 1 1. Introduction The framework, named Chalmers RTL-based energy evaluation framework for pipelines (CREEP), extends down to the RTL which yields high accuracy and allows for de- tailed pipeline studies at the system level. 1.1 Goals and challenges The aim of this thesis is to develop and demonstrate CREEP, a framework that esti- mates energy usage of integrated processor pipelines. The framework is based on an existing methodology that has been used in several published papers at Chalmers. The two major components that will be used to create the framework are; 1) Pipeline and cache RTL [11] and 2) A version of the SimpleScalar simulator [12]. The RTL and the simulator exist prior to the framework development but they are two sepa- rate and incoherent components combined in an ad-hoc manner. Hence, there are many challenges to address throughout the work, some of which are listed below: • Create a scalable energy estimation framework from an ad-hoc methodology. Trade-offs between energy estimation accuracy and scalability are required throughout the development. • The existing pipeline RTL code represents an in-order 5-stage pipeline (5SP) augmented with level-one instruction cache (L1IC) and level-one data cache (L1DC). The RTL code has been used in several projects that necessitated changes and quick patches in the code. As such, the code needs to be cleaned to make the RTL presentable and to reintroduce some features such as config- urable cache sizes that is necessary for the framework. Additionally, a scalable power estimation methodology that includes the impact on power due to in- tegration aspects needs to be implemented. • The SimpleScalar simulator represents a more complex pipeline than the RTL. Thus, the simulator shall be modified to match the RTL code as closely as possible. This requires changes to the source code which must be verified to work as intended. Furthermore, the simulator will be modified to track additional resource usage information. • The pipeline RTL and SimpleScalar components will then be combined by mapping resource usage obtained from the simulator to the RTL power es- timations. The challenge here is to do the mapping in such a way that the framework estimates the processor energy adequately. • Lastly, the framework will be automated to make it more approachable. The automation also serves the purpose of keeping the components coherent, which will make the results generated by the framework reproducible. 2 1. Introduction 1.1.1 Goals Several concrete goals related to CREEP have been identified and these are sum- marized below: • Present a coherent and scalable framework with accuracy close to a placed and routed pipeline design. • The framework should be automated through scripts which will make CREEP more user-friendly. • The framework should support limited configuration, e.g., different cache con- figurations and processor speeds. • Evaluate the applicability of the framework in a case study. • Present the framework in a suitable forum to introduce it to the community. 1.2 Limitations The framework development is complex and limitations need to be imposed on the development. • The framework is only guaranteed to work as is. As such, any modifications made by the user to any of the framework components are not covered by the standard framework workflow. • CREEP will be limited to the provided 5SP. Any changes to the RTL code are not guaranteed to work and the user needs to verify the changes in the context of the framework. • CREEP does not include a level-two (L2) cache and does not attempt to approximate the impact of lower levels in the memory hierarchy on power dissipation and performance. • CREEP is limited to the SimpleScalar source and configuration provided with the framework. Any changes to these components need to be verified and integrated into the framework by the user. • This thesis will use the RTL for the pipeline and will only modify it to suit the needs of the framework. No performance enhancements will be done and the CREEP configurations are limited to run at 400MHz. • The RTL will not be fabricated and no silicon of said design will be produced. 3 1. Introduction 1.3 Related works Energy evaluation frameworks have over time evolved from small frameworks limited to specific structures within a processor, to large and complex system-level frame- works. Depending on how the frameworks obtain circuit-level energy estimations they can be divided into analytical or empirical frameworks [5]. Analytical tools generally have the advantage of being more generally applicable to different archi- tectures whilst empirical methods are best suited for the type of architectures from which they were derived. In this section previous energy evaluation frameworks and methods are presented. The first high-level energy framework, CACTI, was released 1996 specifically target- ing cache structures [13]. CACTI uses analytical models to estimate both power and delay within the cache structure. It has since its release been updated regularly to include leakage power, other types of memory cells, device scaling effects based on International Technology Roadmap for Semiconductors (ITRS) predictions and wire effects on delay and power [14]. The reason it targeted caches was that a significant amount of total chip power, up to 40%, was dissipated by the caches in embedded processors [9]. Furthermore, caches are highly regular structures thus less complex analytical models are needed to accurately estimate energy consumption and delay. It has allowed computer architects to explore trade-offs in the memory hierarchy de- sign [13]. In contrast to CACTI, CREEP also models the datapath with which the caches are integrated. Hence CREEP provides an estimate for a complete integrated pipeline. WATTCH and SimplePower, both released in 2000, analytically modeled power for a whole processor. WATTCH was one of the first tools to link a traditional archi- tectural performance simulator, SimpleScalar [12], to analytical power models [15]. It bases its power estimations on a collection of parametrized power models for different hardware structures (for example random access memory (RAM), content addressable memory (CAM), other array structures, latches, buses, caches arith- metic logic unit (ALU)s) and per-cycle resource usage counts generated through cycle-level simulations using the SimpleScalar architectural simulator [9] [15]. Sim- plePower is an execution-driven, cycle-accurate RTL energy estimation tool that uses a combination of analytical and transition sensitive energy models [16] [17]. The SimplePower framework is built around a five stage datapath with instruction fetch, decode, execution, memory and write-back stages [16]. Transition-sensitive models are defined for each functional unit in the datapath and the models contain switch capacitance on a per-input basis obtained from VLSI layouts and extensive circuit simulation [17]. Models are provided for several technology nodes. Simple- Power uses a combination of analytical and transition-sensitive energy models for the memory system. The analytical models are reserved for the memory arrays whereas the transition-sensitive models are used for the connecting buses [17]. In contrast to the functional units, the switching capacitance of these buses is based on pessimistic assumptions rather than HSPICE simulations [17]. The control path of the 5SP has been neglected because developing transition-sensitive models for this was consid- 4 1. Introduction ered extremely difficult. SimplePower leverages on the SimpleScalar simulator by using the same ISA and compiler. The framework simulates the generated executa- bles providing cycle-by-cycle energy values based on the aforementioned models [16]. Both of these were fast and usefully accurate to quantify potential power savings in architecture design. However, compared to CREEP, WATTCH and SimplePower, while being more flexible, again fail to capture the integration aspect that CREEP addresses. McPAT, another analytical tool, is an abbreviation of multicore power area and timing. The framework was released 2009 and it estimates power, area and timing which enables architects to use metrics that relate performance to both area and power [10]. In contrast to SimplePower and WATTCH, McPAT is compatible with any performance simulator through an XML interface. Furthermore, McPAT is built on more accurate analytical models compared to WATTCH and these models also include static and short-circuit power. Just as the name implies it also handles the complexities of multicore architectures. Similarly to McPAT, CREEP provides a system perspective but does so more accurately as power estimates are obtained from an RTL implementation and not analytical models. However, CREEP supports less complex systems as it targets simple embedded processors. One empirical framework of interest is IBM’s PowerTimer [15]. The major difference from the previous approaches that are based on analytical models is primarily the formation of the energy models. PowerTimer’s models are based on empirical data collected from existing microprocessors. These models are then scaled to capture device scaling. PowerTimer takes a bottom-up approach and the energy models are derived from circuit-level power simulation data. Low-level circuit macros are analyzed and used to generate higher-level energy models for microarchitectural units [15]. These models are then controlled by two sets of parameters; 1) technology and circuit parameters, 2) microarchitectual parameters such as buffer sizes, pipeline latencies and bandwidth values. The microarchitectual parameters are also used in a stand-alone performance simulator. By connecting the performance simulator with the energy models a total or cycle-by-cycle energy evaluation can be performed. IBM’s PowerPC architecture was used to create the energy models and as such was best suited for design exploration within that microarchitecture. CREEP is likewise limited to the specific architecture implemented in RTL. Both frameworks work at the system level but PowerTimer chooses to distance itself from the physical implementation through parameterized models which lends it greater flexibility at the expense of accuracy. Yet another example of an empirical framework was proposed by Aziz et al. [18]. This framework is used for marginal-cost analysis. Their approach was to first create architectural models using design space sampling and statistical inference to capture the multi-dimensional space of microarchitectural parameters. The energy- delay trade-offs of the composing circuit blocks that formed the architectures were then stored in a circuit library. The created joint architecture-circuit design space was then combined with exploration engine which is given an optimization objective 5 1. Introduction and resource budgets [18]. The exploration engine then searches the design space to find the most efficient configuration under the given constraints. As it is a high-level framework, it is more flexible than CREEP but trades aspects such as accuracy and integration to achieve this flexibility. Rance Rodriges et al. conducted a study in [19] on the usage of performance counters and how they can be used to estimate power in microprocessors. While performance counters have been widely used to estimate power online in situ the counters used vary widely between processor architectures. Rance Rodrices et al. attempts to identify a set architecture-agnostic counters that estimate processor power with low error. Two architectures, Intel Atom and Nehalem, at opposite ends of the de- sign spectrum were used to select performance counters which in both architectures showed a strong correlation to power. Using SESC architectural performance simu- lator and WATTCH as reference they concluded that #Fetched instructions, #L1 hit and #Dispatch stall counters was sufficient to approximate processor power with an average error of 5%. Furthermore, the chosen set of counters variation between pro- cessor architectures only had a small impact, 3%, on the estimation accuracy. While the objective of this work is different from CREEP, it indicates what performance counters are relevant for a selection of architectures, albeit at higher performance design point, and can serve as an inspiration for CREEP. A high-level estimation methodology and the associated tool, SoftExplorer, was presented in [20]. The methodology models a processor through functional analysis and a parametric software model is used to capture the software’s impact on power. The processor model can be as coarse grained as a functional block diagram. The parametric software model accepts relevant algorithmic parameters such as cache miss rate. The first step in the methodology is to cluster the processor model into functional blocks that are concurrently activated when code is running. The relevant consumption parameters are chosen as the links between the functional blocks. The second step is to characterize the processor model’s power consumption as the architectural and algorithmic parameters are varied. Lastly, a curve fitting of the graphical representation of the characterized power is performed through regression analysis. SoftExplorer was compared to SimplePower where the tool was found to be significantly faster and within 2.4% of the estimates. Compared to CREEP, SoftExplorer sacrifices accuracy for flexibility and speed and neglects the integration aspect covered by CREEP. 6 2 Background This chapter will provide the reader with the basic knowledge to understand the concepts used to develop Chalmers RTL-based energy evaluation framework for pipelines (CREEP). First, the basics of CMOS logic with a focus on implemen- tation, i.e., power and speed will be discussed in Sec. 2.1. Since complementary metal–oxide–semiconductor (CMOS) is the primary fabrication technology used to implement integrated circuits (ICs), CMOS speed and power are central to this work. Secondly, the basics of IC design with a focus on cell-based CMOS designs are presented in Sec. 2.2. The foundations of computer architecture with a focus on pipeline design and physical implementation are presented in Sec. 2.3. Lastly, the existing ad-hoc methodology which this work is based on is presented in Sec. 2.4. 2.1 CMOS The abbreviation CMOS stems from the structure of the device as it was composed of at least one n-type metal–oxide–semiconductor (nMOS) and one p-type metal–oxide–semiconductor (pMOS) transistor [21]. The simplest CMOS circuit, the CMOS inverter, is shown in Fig. 2.1. The arrangement of the inverter is such that the input of the CMOS inverter is connected to the gate terminal of both transistors. Whilst the transistors’ behavior in reality is more complex, they can ideally be viewed as switches that close and open when a voltage transition is detected on the gate. The behaviors of the nMOS and pMOS are opposite that of each other, i.e., when a potential VDD (supply voltage, logic 1) is asserted on gate terminal the nMOS closes and the pMOS opens. Conversely, when no potential or GND (ground, logic 0) is present on the gate the nMOS opens and the pMOS closes. 2.1.1 Power dissipation The power dissipation of a CMOS circuit is generally considered to be composed of three components; 1) Dynamic power, 2) Short-circuit power and 3) Static power [22]. The total gate power dissipation is given as the sum of these components as shown in Eq. 2.1. 7 2. Background VDD GND OutIn Cout Ishort In Ip Figure 2.1: Schematic view of CMOS inverter Ptotal = Pdynamic + Pshort + Pstatic (2.1) The dynamic power dissipation is by far the most dominant source of power con- sumption in a CMOS circuit [22]. Dynamic power is also called switching power because power is consumed when the gate is switching, i.e., charging or discharging the gate output capacitance Cout to VDD or GND [21]. The output capacitance consists of several components; Cint, Cwire and Cload as shown in Eq. 2.2 [22]. Cout = Cint + Cwire + Cload (2.2) The internal capacitance Cint is related to the structure of the gate and include parasitic capacitances. Cwire is the capacitance of the wire that connects the output of the device to the input of another CMOS gate which in turn constitutes the Cload capacitance. Consider Fig 2.1 where a voltage transition from VDD to GND is asserted on the input. The nMOS transistor opens and the pMOS transistor closes. A current Ip flows from the voltage supply to the output capacitance which charges the capacitance. The amount of charge pulled from the supply is given by CoutVDD and the energy drawn from it by CoutV 2 DD. However, half of the energy drawn from the supply is dissipated as heat in the resistance posed by the pMOS transistor so the energy in the output capacitance is given by Ec = 1/2CoutV 2 DD. When the input voltage later is increased to VDD the pMOS opens, the nMOS closes, and the output capacitance is discharged as a current In flows to ground. The stored energy in the capacitance Ec is dissipated in the resistance posed by the nMOS transistor. If this circuit is operated at a clock frequency f and and the output switches with a probability of α the total dynamic power drawn from the supply is given by Eq. 2.3 [21]. Pdynamic = CoutV 2 DDαf (2.3) 8 2. Background The short circuit power dissipation Pshort is nowadays considered a small compo- nent of the total power [22]. The power is dissipated when the output of the gate switches. The nMOS and pMOS devices are in reality not behaving as ideal switches and require a finite time to open and close. This time is determined by how long the input voltage remains between the transistors threshold voltage Vtn, and VDD −Vtp, where Vtn and Vtp are the threshold voltages of the nMOS and pMOS transistors re- spectively. Threshold voltage is the minimum gate to source potential that is needed to create a conducting path in the transistor, i.e., close the switch. Consequently there is a small period of time when both transistors are on and a current Ishort shown in Fig. 2.1 is allowed to pass from the supply to ground, which consumes a small amount of power. The last component is the static power dissipation that is intermediate in size com- pared to the previous components [22]. It is smaller than the dynamic power and has historically been negligible. It is called static because it is omnipresencent in all CMOS circuits that are powered. The static power stems from a collection of different currents passing between the various terminals of the devices most notably source to drain. The leakage power is closely connected to the threshold voltage and the temperature of the device [21]. As the feature size of the transistor is shrinking below 65nm, leakage power is increasing and in more recent technology nodes it has become a considerable contributor to the total power dissipation. 2.1.2 Speed As discussed in the previous section, a transition on the output of a CMOS gate does not happen instantaneously as the current would have to be infinite in magnitude. Naturally, this is not the case in a real CMOS circuit. Instead, the current is determined by the transistor’s ability to drive it, which due to nonlinear I-V and C-V characteristics is no simple thing [21]. However, the transistors can be approximated by an RC-delay model that allows the transistors to be viewed as simple RC circuits, which most electrical engineers are familiar with. R is the transistor’s effective resistance that is the product of the Vds and Ids, i.e., the potential between the drain and source terminal and the current passing through the drain source junction [21]. The capacitance is the output capacitance of the CMOS circuit (see Sec. 2.1.1). The transfer function of the equivalent RC circuit is given in Eq. 2.4 and the step response in Eq. 2.5. H(s) = 1 1 + sRC (2.4) Vout(t) = VDDe −t/τ (2.5) Solving the step response for Vout(t) = 1/2VDD gives the propagation delay through the CMOS circuit shown in Eq. 2.6. 9 2. Background tpd = RC ln 2 (2.6) The propagation delay is an approximation of how fast the output of the CMOS circuit transitions from VDD to 1/2VDD when an input step is asserted on the input of the circuit. A non-trivial CMOS circuit is composed of many CMOS circuits which are connected as shown in Fig. 2.2 and the propagation delay from input In1 to the final output Out n can become significant. In1 In2 In n Out n Figure 2.2: Example of a CMOS circuit consisting of multiple inverters To manage the delay, the current drive capability of the transistors in the CMOS gate can be increased. This is done through transistor sizing whereby the widths of the pMOS and nMOS transistors are increased [21]. Essentially this reduces the effective resistance experienced by the current and a larger current is allowed through the circuit. However, increasing the transistor size also causes an increase of the gate capacitance, i.e., the output capacitance experienced by the driving gate in the circuit resulting in higher power dissipation. Moreover, gates that are unsized and connected to the resized gates will have to charge a larger load capacitance, which slows down unsized parts of the design. 2.2 IC design Modern ICs are immensely complicated circuits often composed of several millions, if not billions, of transistors. Designing such complex beasts is without computer aid simply beyond the capabilities of a human designer. To facilitate IC develop- ment software assistance is key throughout the design process. The software tools providing this assistance are collectively refered to as electronic design automation (EDA). The term EDA spans a wide range of functionality required throughout the design of an IC, which will be the focus of this section. Designing ICs is complex and it was discovered early on that doing so at the gate level, even with the aid of EDAs specializing in the practice, was too cumbersome. As a response, tools were developed to create gate-level representations, called netlists, from a specification at a higher level of abstraction through a process called logic synthesis. These abstractions are usually expressed in a hardware description lan- guage (HDL) such as verilog or VHDL. These design languages allow the designers 10 2. Background to express the behavior of the logic circuits at the register-transfer level (RTL) in the sense that an assignment to a register expresses functionality. The process of designing an IC is composed of several stages and for digital circuits these are design, functional verification, logic synthesis and place and route (PnR). The initial design stage is followed by functional verification, which is first done at the register transfer (RT) level and infers testing that design described in HDL matches the expected functional behavior. This is normally done at the cycle level by applying stimuli to the design whereby the logic transitions of the output can be observed and compared with the desired behavior. The stimuli is commonly supplied by a testbench that provides input from a set of test vectors [23]. The test vectors can be selected with the intent of testing specific functionality (directed testing) or randomly to test corner-cases [23]. A key concern when selecting test vectors is coverage that can be defined as how large part of the design that has been tested (in percent). The RTL verification is facilitated by an HDL simulator tool. There are many different HDL simulators available such as ModelSim from Mentor Graphics, IES from Cadence and VCS from Synopsys [24][25][26]. After the RTL has been verified the design is brought through a cell-based logic synthesis with the aid of a synthesis tool. The designer supplies the RTL design together with design constraints with regards to timing which will guide the synthesis tool through the multiple stage process that is cell-based logic synthesis [23]. The cell library, which contains the standard gates (cells) used for synthesis, is provided by silicon foundries such as ST Microelectronic or TSMC. The cell libraries are unique to each manufacturer as they are tightly knit to their manufacturing processes. For each cell in the library, parameters such as size, internal power dissipation, leakage power and input pin capacitance are defined [23]. Normally several libraries are necessary to fully evaluate a process technology. The libraries are optimized for different design points, e.g., low power (LP) and general purpose (GP). The GP cell library is optimized for performance and the LP cell library for low power designs. Furthermore, the GP and LP libraries are further divided into sub-libraries with different threshold voltages, which allows for fine grained control of performance and power dissipation. Higher performance can be achieved by using a low-threshold voltage version but at the price of higher leakage power dissipation. Conversely, for design where power dissipation is a cause for concern, a high-threshold voltage version is a good choice as these are slower but have lower leakage-power dissipation. It is up to the designer to choose a library that suits the application at hand. Small variations in the manufactured design can have a large impact on cells’ behavior. To capture these variations, design corners are used. The worst-case corner contains cells that have the worst possible (and still producing working devices) variations that affect speed negatively. Conversely, the best-case corner cell library has the best variations. Naturally, there is a nominal cell library that falls in between the two. Moreover, the cell libraries have been characterized for different temperatures and voltages. Temperature and voltage depend on in situ conditions and also affect the behavior of the final circuit. As such, every cell library exist in several models with different temperatures and voltages. 11 2. Background Synthesis is a complex process and EDA tools that specialize in the practice are avail- able from different suppliers such as Encounter RTL Compiler from Cadence, Design compiler from Synopsys and HDL Designer from Mentor Graphics [27][28][29]. The different tools provide similar functionality but differ in the algorithms and heuris- tics used during the synthesis process. The synthesis results in a gate-level netlist, a sequence of standard cell logic gates realizing the functionality of the RTL code. In contrast to the HDL description of the design that solely captures the intended functionality, the netlist also includes parameters such as area, timing and power. The synthesis tool strives to meet the imposed timing constraint using the cells from the specified libraries. It accomplishes this through static timing analysis which al- lows it to find and balance the critical paths in the design [23]. This balancing act entails selecting gates with sufficient current drive capabilities for the entire cir- cuit to switch within the timing constraint. As such, the same design will produce different gate-level netlists with different area and power. At the synthesis level, the functional verification amounts to ensuring that the netlist behaves the same as the RTL design. This is achieved through equivalence checking or simulation-based methods as described for RTL verification. Lastly, the design netlist is brought through PnR which is a physical design phase composed of three steps; 1) Floorplaning where the design’s blocks are organized, 2) Placement of standard cells and iterative optimization of placement and 3) Rout- ing of standard cell interconnects, power lines and clock tree [23]. The process is strictly guided by design rules imposed to ensure that the placed and routed design is manufacturable. The most significant changes to the netlist are the addition of wires and clock tree. Wires constitute a part of the nodal capacitance described in Sec. 2.1.1 which in some cases necessitates larger, more powerful gates to be used. The addition of wire capacitance and larger gates with higher internal capacitances increases the power of the design. Furthermore, the clock tree is a significant con- tributor to the design power and is only included after the design has been placed and routed. This stage relies on EDA tools, such as Encounter from Cadence, that specialize on PnR as the burden of placing thousands of gates is simply beyond the capability of a human designer [30]. Implementation power closure is important for power constrained circuits, e.g., portable embedded processors. As such, methods for obtaining power closures are also included in many IC design flows. The power dissipation of CMOS-based cir- cuits comes from active device switching and leakage where the former are the main contributor as discussed in Sec. 2.1.1. The switching powers are then summed over all capacitive nodes in the design. While the power estimates could be done prior to the PnR, the power would be underestimated as the nodal capacitance is greatly in- creased by the wire capacitances. The power also depends on the switching activity (see Sec. 2.1.1) of these nodes and there are different techniques used to approximate it. One such technique is probabilistic testing where the input statistics, asserted by a designer, are propagated to each node in the circuit [22]. However, this creates an nondeterministic polynomial time (NP) complete problem so the scope at which this is done must be limited, e.g., parts of the design are analyzed instead of the 12 2. Background whole design. Another is use-case based switching activity which is facilitated through simulation- based methods as described for RTL verification [22]. 2.3 Pipeline design The most fundamental parts of computer architecture are the instructions that define what a computer is capable of and the microarchitecture that decides how it executes the instructions. To that end the structure of a complete set of instructions called instruction set architecture (ISA) and more specifically the MIPS I ISA, is presented in Sec. 2.3.1. Microarchitectural concepts relevant to this thesis such as pipelining and caching are then discussed in Sec. 2.3.2. 2.3.1 MIPS I instruction set architecture All computer programs are made up of instructions which are the basic operations carried out by a processor [2]. Instructions are usually very frugal and each of them normally does one basic operation, e.g., memory access, arithmetic or flow control. To make up a complex computer program many different instructions are needed. The instructions are grouped together to form a set of instructions possibly unique to their implementing architecture thus forming an ISA. The MIPS I ISA, first released in 1982, was developed by John Hennessy and his col- leagues at Stanford [31][2]. MIPS I was one of the first successful reduced instruction set computing (RISC) ISAs built on four main principles; simplicity favors regular- ity, make the common case fast, smaller is faster and that good design demands good compromises. Derivatives of the MIPS ISA are still used today by CISCO (routers), Nintendo and Sony (hand held gaming consoles) and Silicon Graphics among others. MIPS I instructions can have three different encoding formats referred to as R, I and J-type instructions in the literature [2]. By only allowing a limited number of instruction formats the ISA gains regularity which simplifies the instruction decod- ing [2]. Each instruction has its own operation (OP) code which is encoded in the op-field shown in Figs 2.3-2.5. Besides the OP-code, the main difference between the instruction formats is the number of operands that are encoded in the instruction. R-type instructions however, need two extra fields (shamt and funct) to characterize each operation, which includes mathematical or logical operations such as addition, subtraction and shift operations. R-type instructions require two operands encoded in the rt and rd fields shown in Fig. 2.3. In contrast, I-type instructions require just one operand encoded in the rt field (see Fig 2.4) and lastly J-type require no operands. The operands are fetched from a small register-file whose modest size lends it speed. The size of the register-file and how the memory is addressed are the parameters, besides instruction width, that the MIPS I ISA enforces on the underlying microar- 13 2. Background op rs rt rd shamt funct R-type 6-bits 5-bits 5-bits 5-bits 5-bits 6-bits Figure 2.3: MIPS I R-type instruction format. op rs rt immediate I-type 6-bits 5-bits 5-bits 16-bits Figure 2.4: MIPS I I-type instruction format. op addr J-type 6-bits 26-bits Figure 2.5: MIPS I J-type instruction format. chitecture. MIPS is a load-store ISA because all operands are fetched from the register-file [1]. I-type instructions substitute one operand for a value, called imme- diate value, which is encoded directly in the instruction itself. Examples of I-type instructions are load and store operations. Stores read data from the register-file and store the data in data memory. Loads on the other hand read data from data memory and store the data in the register-file. Similarly, all R-type instructions also store the result of the operation back into the register-file to a location indicated by the rs field. However, other I-type instructions called control flow instructions, e.g., branch instructions, do not access the register-file. Instead, the control flow instruc- tion decides the order in which the instructions in the program are executed. Lastly, J-type instructions, which mainly include another type of flow control instructions called jump instructions, trade both operands for a larger immediate value. MIPS I also defines the format of the operands. Operands of 8-bit (ASCII characters), 16-bit (Unicode characters, half word), 32-bit (integers, word), 64-bit (long integer, double word) and IEEE 754 floating point in 32-bit (single precision) and 64-bit (double precision) are allowed [1]. If the aforementioned register-file were the only storage available to a processor com- puter programs would be very limited in size. However, as implied above, another type of memory that is larger in size is usually available. The MIPS I ISA defines how the processor interfaces with memory by specifying how the memory is address- able. MIPS I specifies two ways of addressing memory; 1) byte-addressable or 2) word-addressable. This means that the smallest addressable data unit is a byte (8 bits) while the largest is a word (4 bytes) as illustrated in Fig. 2.6. All memory accesses must be aligned to either a byte or word access, otherwise the access is unaligned and erroneous [1]. In the same figure to the left, the address of the cor- responding data is shown. The address used to access the memory is generated by the memory operation (I-type instruction) by adding the immediate value with the register indicated by the rt field. 14 2. Background Figure 2.6: MIPS I memory is byte addressable [1]. Furthermore, MIPS I supports five ways of generating memory addresses through so called addressing modes; Register-only addressing, Base addressing, Immediate addressing, PC-relative addressing and pseudo-direct addressing [2]. Register-only addressing has already been described, all R-type instruction uses this addressing mode. Base addressing is used by some I-type instructions, such as stores and loads, and has likewise been described. Immediate addressing is similar to base addressing but it does not use the register pointed to by the rt field (I-type). Program counter (PC) relative addressing is used by conditional branch instructions (I-type) where if a condition holds true, the PC is added to the immediate field to produce the final address. Lastly, pseudo-direct addressing is used by J-type instructions where the larger address field (see Fig. 2.5) is first concatenated with the four most significant bits of the PC. 2.3.2 A MIPS I pipeline An ISA does not define the implemented hardware besides register-file and address- ing modes. A distinction is made between an architecture and a microarchitecture where the latter includes implementation details. This means that two different pro- cessor architectures can support the same ISA while being fundamentally different at the microarchitecture level. In this section a MIPS I compliant microarchitecture will be presented. The microarchitecture utilizes pipelining, which is a concept that is used in most modern processors that offers higher performance at the expense of design complexity. The speed of a processor, and systems in general, depends on latency and throughput of the data passing through it [2]. Low latency is preferred for systems that are re- quired to be responsive and deliver results in a timely manner. In contrast, through- put is beneficial for systems that prioritize computational performance over time- liness. Latency and throughput are often contradictory in the sense that measures that improve one degrade the other [2]. In general-purpose computing, through- put has historically been more important than latency. Throughput can mainly be improved by exploiting instruction-level parallelism (ILP), i.e., by executing mul- 15 2. Background tiple instruction at the same time. Parallelism can be divided into spatial and temporal parallelism [2]. Spatial parallelism entails executing more instructions si- multaneously by utilizing increased computational resources. In contrast, temporal parallelism implies dividing the existing computational resources into discrete steps where each step is utilized by different instructions. Spatial parallelism has the benefit of increasing throughput with little or no impact on latency [2]. However, spatial parallelism requires additional hardware resources and results in larger and more complex designs. Conversely, temporal parallelism sacrifices latency to in- crease throughput while only requiring limited hardware additions to the design in the form of a few registers and control logic. In the context of microarchitecture temporal parallelism is more commonly referred to as pipelining and the concept has been used in most processors for the last three decades [32]. Pipelining is implemented by dividing a processor’s data path, i.e., computational resources, into stages separated by pipeline registers which limit the logic paths of the design to that between two consecutive pipeline registers. In effect, the design can meet stricter timing constraints and thus run at a significantly reduced cycle time. The reduced cycle time allows the design to be clocked at a higher rate, which causes a reduction of the execution time since the pipelined processor executes (ideally) one instruction each cycle. The discrete stages allow the pipelined processor to achieve temporal parallelism with several in-flight instructions. The goal of a pipeline design is to evenly distribute the datapath’s logic between the different pipeline stages. In a perfectly balanced n-stage pipeline the cycle time of the design is roughly 1/n of the cycle time of a corresponding un-pipelined designs [1]. However, in practice the stages in the pipeline are seldom balanced perfectly resulting in some stages requiring more time to finish their execution. The critical path, which imposes the lower bound on the design cycle time, is thus found in these stages. Furthermore, pipelining also introduces some performance overhead. A small part of this overhead is the delay introduced by the inserted pipeline reg- isters but the by far more substantial overhead is caused by dependencies between instructions moving down the pipeline [2]. These dependencies are called hazards and will be discussed in greater detail later in this section. In short, hazards increase the cycles per instruction (CPI) or instruction per cycle (IPC) which has a detrimen- tal effect on performance. The overhead caused by the pipeline registers and hazards increases the latency of each individual instruction in a pipelined processor [1]. An example of a pipeline implementing the MIPS ISA is shown in Fig 2.7. The pipelined processor performs operations in five discrete stages separated by pipeline registers as shown in the figure. The stages are instruction fetch (IF), instruction decode (ID), execute (EX), memory access (MEM) and write-back (WB). Fig 2.8 shows the same pipeline as Fig 2.7 but it also shows the control unit. The control unit generates control signals in the decode stage and the signals are prop- agated alongside the instruction in a control-path that in each stage reflect the instruction’s individual needs. 16 2. Background Figure 2.7: A MIPS I 5SP [2] Figure 2.8: A MIPS I 5SP augmented with a control unit [2] All instructions traverse the datapath one stage at a time and need five clock cycles to fully traverse the pipeline. Fig 2.9 illustrates an example of an instruction flow. The first instruction is fetched in the first cycle and stored in the subsequent pipeline register. In cycle two a new instruction is fetched while the preceding instruction is decoded, both of the instructions are stored in the respective pipeline register after the stage they passed through. In the third and fourth cycle yet another instruction is fetched while the later instructions proceed through the pipeline. In cycle five the pipeline is utilized fully with an instruction being executed in all stages. The first instruction has now cleared the pipeline and is written back (if R-type or load) to the register-file. After cycle 5 the pipeline should ideally remain fully utilized until the program is completed. 17 2. Background Figure 2.9: Instruction sequence in a 5SP. However, in reality pipelined processors do not achieve full utilization in most cases because of the occurrence of pipeline hazards. Hazards are defined as dependencies between consecutive instructions in the pipeline. Hazards are divided into three cat- egories; data hazards introduced by arithmetic and load/store instructions, control hazards introduced by flow-control instructions, e.g., branches, and lastly structural hazards where in-flight instructions compete for pipeline resources. Data and con- trol hazards will be explained in greater detail but structural hazards, which are non-existent by design, will not be elaborated on. Data hazards occur when a subsequent instruction needs data generated by a pre- vious instruction. For instance an add instruction is followed by a subtraction that uses the value produced by the addition. The addition instruction is unable to reach the WB stage before the subtraction clears the ID stage and finishes the register-file access thus entering the EX stage with incorrect operands. This is called a read after write (RAW) hazard and if unaddressed would lead to program errors. A less elegant solution would be to stop the IF and wait for the instruction causing the dependency to write back its result to the register-file. While simple, stalling the pipeline increases the CPI and incurs performance losses. A more elegant solution is forwarding. It is possible for the addition to provide the correct value to the subtraction by passing it to the subtraction as it enters the EX stage. The addition forwards the data causing the dependency to the subtraction. The pipeline shown in Fig 2.10 has been augmented with a forwarding unit that controls the added for- warding paths between the EX, MEM and WB stages. The forwarding unit reads the Rs and Rt registers of the instruction entering th execute stage and compares it to the Rd of the instruction entering the MEM and WB stage if this instruction is a R-type instruction and forwards data as needed. Forwarding does not solve RAW hazards where a dependency exists between a load and a subsequent instruction. Assume a load followed by an addition: The load needs to propagate to the WB stage before the data is brought from memory. At this point however, the addition has already passed the EX stage and is entering the MEM stage. The only solution to this problem is to stop the addition from propagating in the pipeline by stalling it. This allows the load to propagate to the 18 2. Background Figure 2.10: The 5SP augmented with a hazard detection unit [2] WB stage where the load is able to forward data to the addition waiting in the ID stage. The necessary addition to the pipeline and hazard detection is shown in Fig 2.11. The pipeline register between IF and ID stages now has an enable signal that when asserted forces it to hold its contents. The pipeline register between the ID and EX stage has an additional clear signal that sets the register contents to zero which effectively stops random data from propagating down the pipeline after the load instruction. Instead, the pipeline stages after the load are idle or conceptually executing a no-operation (NOP) instruction. Additional inputs to the hazard detection unit are added to allow it to detect hazards that require stalls. Control hazards are caused by branch and jump instructions because they update the PC. Assume that a branch instruction is fetched from instruction memory. The pipeline is oblivious to the fact that the branch could redirect the IF to a different portion in the program and erroneously continue to fetch instructions sequentially. When the branch is resolved to be taken in the EX stage (see Fig 2.8) two instruc- tions from the wrong execution path have already been fetched. The pipeline would then need to be flushed (pipeline registers emptied). Alternatively the pipeline could be stalled, i.e., instruction fetch halted. Both solutions would degrade performance by increasing the CPI. A better solution is instead to use a delayed branch slot. The delayed branch slot scheme relies on the compiler to move an instruction orig- inally placed before the branch to immediately behind it. The compiler must be able to ensure that no dependencies are created when moving the instruction [1]. This scheme works reasonably well in the pipeline in Fig 2.8, but would still require one stall cycle for the branch to be resolved in time for the instruction after the delayed branch slot. However, this stall cycle can be avoided by moving the branch resolution to the ID stage as depicted in Fig 2.12 below. 19 2. Background Figure 2.11: The 5SP with stall support [2] A dedicated comparator has been added in the ID stage that operates immediately on the fetched register contents. Likewise, the sign extension and address generation have also been moved to the ID stage. Figure 2.12: The 5SP with branch resolution in the ID stage [2] 20 2. Background While the stall cycle is eliminated in the pipeline in Fig 2.12 the early branch resolution introduces additional RAW hazards. The branch condition could possibly depend on a preceding instruction about to enter the EX stage and the lack of forwarding paths from the EX stage to the ID stage where the branch is about to be resolved could result in erroneous branching. However, forwarding paths can be added and the hazard-detection unit could be expanded to detect and handle this forwarding as shown in Fig 2.13. Figure 2.13: The 5SP with branch resolution in ID stage with added forwarding paths [2] In this section a simple MIPS I 5-stage pipeline (5SP) was outlined. More advanced pipelines are in use today and these are usually deeper than five stages. However, deeper pipelining increases the occurrence of data hazards, which necessitates a more complex control path. Adding more stages further decreases the logic per stage, but increases the number of dependencies at the same time and ultimately deeper pipelines will be stalled more than their simpler counterparts. Furthermore, because of the minute logic in each stage, the setup time and input to output delay of the pipeline register become more prominent [2]. This causes diminishing returns and a minimum in execution time can be found at a specific number of pipeline stages. If energy is also considered, determining the number of stages becomes even more daunting because power increases linearly with frequency (see Sec. 2.1.1) which in turn grows higher with the number of pipeline stages. Optimum pipeline depth is dependent on the architecture and the specific program being executed, there- fore there is no way to determine a general optimal number of pipeline stages [2]. Historically, processor pipelines grew deep to exploit the available ILP in the run- ning instruction stream. However, ILP is limited and exceedingly deep pipelines, called super pipelines, only yielded marginally more performance while significantly increasing the power dissipation. Approaching the power wall, a practical upper 21 2. Background limit on power dissipation due to discontinued Dennard scaling, and the advent of portable computing made energy efficiency an important design goal [33]. The design space became more complex as performance and energy efficiency on most occasions warrant different design choices. 2.3.2.1 Caching As mentioned before, if the register-file were the only memory available to the pro- cessor, program complexity would be limited. However, as implied above, more memory is available and the ISA defines how the memory is interfaced with the processor. More available memory allows for more complex and useful programs to be created. However, this memory would need to remain fast even though its size is increased to not slow down the processor. Such memory, if it existed, would be very expensive. A better solution can be found by taking into account how a program is executed. Only a limited part of the program is executed at any given time and due to code constructs such as loops the same parts of the program are likely to be executed in the near future. The insights of spatial and temporal locality can be used to construct a memory hierarchy that delivers on both speed and capacity at less expense [1]. A conceptual view of a memory hierarchy is shown in Fig. 2.14. In the figure, access times of the structures and their size are shown. As can be seen, smaller memory is generally faster and kept closer to the processor. The register-file, as discussed, is a very small and fast structure integrated into the datapath. The cache is likewise integrated into the pipeline and is larger than the register-file and consequently slower, but it is still fast enough to be accessed without imposing intolerable performance losses [1]. Fig. 2.14 depicts the cache as several layers where the level-one (L1) cache is small and fast followed by a larger, but slower level-two (L2) cache. Modern designs stretches further with larger lower-level caches located off chip or possibly integrated onto the chip [1]. Succeeding the caches is a larger and slower (two orders of magnitude (OOM)) main memory. Lastly, the largest and slowest part of the memory hierarchy is disk memory. The memory hierarchy is a very complex system and describing it in its entirety is beyond the scope of this report. Instead, the report is limited to describing the highest part of the hierarchy, i.e., the caches. In load-store architectures, such as MIPS I described in Sec. 2.3.1, the processor is only allowed to interact with the memory through dedicated load and store in- structions [2]. Consequently, if the processor needs data that is not present in the register-file, a load instruction in the program instruction flow must bring it into the register-file. This load instruction is directed to the cache. If the data is found in the cache, a hit is generated and the data is sent to the processor. However, because caches are small, it is likely that the data is not present and a miss is generated. The memory access then continues searching in lower levels in the hierarchy until the data is found. To support overlapping instruction fetch and data access, the L1 cache is usually separated into separate caches, i.e., an level-one instruction cache 22 2. Background Figure 2.14: A conceptual overview of a memory-hierarchy [1] (L1IC) and an level-one data cache (L1DC), and this approach is referred to as the Harvard cache model. Lower levels in the memory hierarchy are usually unified, containing both data and instructions, according to the Princeton cache model [34]. Caches are arrayed structures where each row, usually referred to as a cache line, is addressable using the address generated by a store or load instruction. [1] The sim- plest way to structure a cache is called direct-mapped, where each memory address maps to one specific line within the cache. As caches are small, memory addresses will overlap and map to the same location in the cache. Thus, the processor needs to be able to distinguish between the addresses. This is achieved by storing the higher- order address bits, called a tag, alongside the data and comparing these with the address used to access the cache [1]. The portion of the cache that stores the tags is referred to as tag-array and the part that stores the data the data-array. If the tag matches the address used to access the cache a hit is generated. Conversely, if the tag does not match the address a miss is generated and the cache line is replaced by the requested data brought in from lower levels in the memory hierarchy. A cache that supports a more flexible address mapping is called n-set-associative where the n denotes the flexibility of the mapping [1]. Assuming a four way associa- tive cache the cache is effectively split into four sets, each able to store data mapped to one cache address. At the extreme end of associativity is a full-associative cache where every address maps freely into the cache. Associative caches require more hardware because each set needs to be searched for the proper cache line and com- pare the tag values with the requested address [1]. Furthermore, when an address maps to a full set and generates a miss, a victim selection mechanism needs to be in place. There are several techniques available but the least recently used (LRU) or random selection schemes or variations thereof are usually enforced. Again, this adds to the hardware overhead of using a set-associative cache. Direct-mapped and associative caches both have their advantages and disadvantages. Direct-mapped caches are simple in terms of hardware but generally suffer from lower hit rate than their associative counterparts [1]. In contrast, the higher hit rate of associative caches comes at an expense of more hardware and thus power dissipation overhead. Which scheme to use depends on the application [1]. 23 2. Background Irrespective to what cache scheme that is used, store instructions pose problems when it comes to memory coherency [1]. This problem is present in all layers in the memory hierarchy, but is more acute in caches which are latency sensitive. A store operation will update the cache content in the L1 cache and unless this is reflected in lower levels, e.g., the L2 cache, the memory is said to be incoherent. Should the L1 cache line be evicted, the line cannot be restored and data is irreversibly lost. Thus memory coherency is required to ensure program correctness. The simplest solution is to let stores propagate down the memory hierarchy [1]. While simple, this solution increases the cache latency and ergo the execution time of the running application. Another less penalizing scheme is the write-back scheme, which only writes back the cache line when evicted to lower levels in the hierarchy. The write- back approach requires extra bookkeeping hardware, called a dirty bit, to indicate whether a write-back operation is necessary. Naturally variations and enhancements of these approaches exist but they will not be discussed. Because misses are expensive, the performance of a memory hierarchy is to a large extent determined by the miss rate as shown in Eq. 2.7 [1]. Cache access = Ht +Mr ∗Mp (2.7) The hit time (Ht) is the time paid for a successful cache access, the miss rate (Mr) the fraction of misses to the total access count and lastly miss penalty (Mp) the cost to access lower levels in the hierarchy. This formula can easily be extended to include more layers in the hierarchy, as shown in Eq. 2.8 where a L2 cache is included. Cache access = HtL1 +MrL1 ∗ (HtL2 +MrL2 ∗MpL2...) (2.8) Depending on the workload, an increased average access time can be very deteri- orative to performance [1]. As such, great care must be taken when designing the memory hierarchy. 2.4 Existing pipeline evaluation method As stated in Sec. 1.1 this work aims at implementing a methodology that has been used with success at Chalmers University of Technology. The methodology builds on two components, an architectural simulator and pipeline and cache RTL. This section will elaborate on these components, starting with the simulator in Sec. 2.4.1 followed by the RTL in Sec. 2.4.2 and lastly how they have previously been combined in Sec. 2.4.3. 24 2. Background 2.4.1 Architectural simulator SimpleScalar is an execution-driven functional simulator capturing both the behavior and performance of the simulated architecture [12]. The fact that it is execution- driven is essential as this captures the dynamic behavior caused by branches and cache misses of the underlying architecture, which can have a dramatic impact on performance and energy. Furthermore, because SimpleScalar captures both the func- tionality and performance of the architecture, correct program behavior is ensured and accurate resource usage and time measurements (execution time in clock cycles) are possible [12]. It can be argued that the simulator is too old and limited to single core designs in an age of multi-core processors. Other more modern tools, with equal or a super-set of SimpleScalar’s features, such as McPAT or Gem5 are available but were turned down in favor for SimpleScalar because SimpleScalar was sufficient for the relatively simple 5SP design that it was used to model. SimpleScalar provides several different simulators of varying detail and speed [12]. The simplest and fastest simulator, called sim-fast, is a purely functional simulator that does not account for time (cycles). In contrast, the most complex simulator, the sim-outorder, supports out-of-order issue, speculative execution, multiple issue while also accounting for time. The structure of SimpleScalar is shown in Fig. 2.15. The bpred block defines the branch predictor behavior, the cache block defines cache behavior (cache size, asso- ciativity and replacement technique), the regs block defines register related behavior and the memory block the memory related behavior. The simulator core defines the datapath’s architecture and it is by far the most substantial block. Figure 2.15: Modular structure of SimpleScalar The simulators support configuration through configuration files that are provided to the simulator of choice when calling it from the command line [12]. The config- uration files allow features such as branch resolution, cache parameters, speculative execution, decode width, issue width and number of functional units to be tweaked without the need to rebuild the simulator. The simulator used in the methodology is based on a modified version of the sim- outorder simulator. The modifications were implemented to reduce the out-of-order 25 2. Background pipeline modeled by the simulator to an in-order pipeline similar to the RTL pipeline described in Sec. 2.4.2 below. The base simulator was then augmented with special- ized performance counters that tracked usage of, for the project, relevant pipeline resources. Because of frequent use in various projects the base simulator had been modified, sometimes extensively, to fit the needs of each new project. 2.4.2 RTL design and verification The RTL used in the methodology captures 5SP MIPS I pipeline design developed at Chalmers. The 5SP has been enhanced with one cycle access latency L1IC and L1DC caches also developed at Chalmers. This setup has since been used in several well-received publications [4][35]. The implemented microarchitecture features some 50 instructions including different branches, logic and memory instructions and a register-file with re general-purpose 32-bit registers. This microarchitecture does not include a floating-point unit to pro- vide floating-point support, which was motivated by the targeted embedded market where floating-point operations usually are replaced by fixed-point calculations. An overview of the microarchitecture is shown in Fig. 2.16 and it is similar to the pipelines discussed in Sec. 2.3.2. The instructions are processed in five stages; IF, ID, EX, MEM, and WB. In the IF stage, instructions are read from the instruction cache from an address pointed to by the PC register, which is updated to point to consecutive instructions or to branch target addresses. During the ID stage the register-file is accessed and control signals for later stages are set based on the instruction type. Branch and jump instructions are solved in the ID stage, but by the time they are resolved the next instruction has already been fetched. A delayed branch slot is utilized to solve this problem and is accounted for by the compiler (see Sec. 2.3.2). In the EX stage arithmetic and logic operations are executed in an arithmetic logic unit (ALU). A dedicated two-stage multiplication unit is also available, spanning the EX and MEM stages. In the memory access stage, loads and stores access the L1DC. Finally, in the WB stage, results are written back to the register-file. In the RTL code the MEM and WB stages were combined to simplify the implementation. However, the combined stage logically functions as two separate stages. A hazard-detection unit, which physically resides in the ID stage but is shown sep- arately in Fig. 2.16, detects any potential hazards and stops the pipeline by stalling the IF stage. In this manner, NOPs are inserted into the pipeline. The cache also produces a stall signal, which is asserted upon a cache miss. In contrast to the hazard stalls, cache misses stall the entire pipeline in Fig. 2.16 where the arrows pointing to the pipeline registers denote the stall signals. The microarchitecture does not support exceptions, but these are by design rare events. Exceptions are necessary to support I/O and recover from errors (Invalid Opcode etc.) and system calls. 26 2. Background Figure 2.16: Microarchitectural overview of the 5SP. Figure 2.17: Memory hierarchy of the 5SP. The design includes L1IC and L1DC caches. These caches are separate from each other according to the Harvard architecture to avoid structural hazards as explained in Sec. 2.3.2.1. No L2 cache is included in the design, instead an ideal memory mod- ule serves as a replacement for the lower levels of the memory hierarchy as shown in Fig. 2.17. The data cache is available for read and write accesses, while the instruc- tion cache only serves reads. However, the instruction cache still needs to access external memory on cache fills and in the case of a cache miss. The two caches share one memory bus to the external memory and a memory controller (arbiter) orches- trates which one of the caches that is allowed to access the external memory. Both caches were designed to be flexible and allow for any size the user desires. However, because the RTL has been used in several projects, which sometimes required ex- tensive modifications to the RTL code much of this flexibility was lost. Previously the associativity could be set to zero (effectively direct-mapped cache), two-way or four-way with replacement algorithms LRU or pseudo random. The cache also supported banking whereby cache lines are split across separate memory macros. 2.4.2.1 Design and verification flow The existing evaluation method loosely defines a RTL design and verification flow which has been used to verify and extract power from the pipeline RTL. The RTL design was first brought through a functional verification as described in Sec. 2.2 followed by a cell-based synthesis and then PnR. From the place and routed netlist, power estimates of the design were obtained. As the power estimates are obtained from a complete pipeline design that has been synthesized and placed and routed to meet a set timing constraint, they capture the synergy between the different com- 27 2. Background ponents in the pipeline. The synergy is due to the fact that logic paths stretches over several components, i.e., the speed of one component imposes speed require- ments on subsequent components. These logic paths are then adapted to meet the imposed timing constraint, which is achieved through transistor sizing. However, the evaluation method focused solely on the caches of the RTL design which allowed for probabilistic testing. The probabilistic approach was used to obtain power es- timates of peripheral units of the cache such as DTLB, arbiter and replacement logic. The power of the actual static random access memory (SRAM) memory cuts was obtained in the library files used during synthesis. The existing evaluation methodology strikes a balance between quick prototyping and estimation accuracy but neglects scalability. Changes to the cache would require a complete reiteration of the design flow, with a lot of effort spent on PnR and power estimates. 2.4.3 Ad-hoc combination of RTL and simulator The RTL and SimpleScalar components of the methodology are then combined in a way specific to each project. Performance counters were introduced for each project and power estimates were extracted from RTL structures that were represented by these performance counters. An example of a prior application is the STA (Spec- ulative Tag Access) project where power estimates were extracted from the caches through probabilistic techniques and the simulator was augmented with performance counters that monitored cache access patterns [4]. Similar approaches were used in several other publications with the main exception being the introduction of addi- tional RTL structures and different performance counters [35][36] [37]. However, the new RTL was not integrated into the pipeline but instead analyzed separately. 28 3 A unified evaluation framework The goal of this work is to develop an energy estimation framework for pipelines that captures the ad-hoc methodology outlined in Sec. 2.4. As discussed the pipeline register-transfer level (RTL) code and SimpleScalar simulator constitute the two major components of the framework. The methodology was established previous to this work, albeit in an ad-hoc manner, so this section will instead elaborate on the methods that allow the two components to be integrated into one coherent framework. 3.1 Framework workflow The conceptual workflow of the framework is shown in Fig 3.1 where the RTL and architecture simulator constitute the two branches of the flow. The RTL branch consists of an RTL verification flow similar to the one described in Sec. 2.4.2.1, which will be implemented to verify the RTL and later netlists. Slight alterations to this approach are necessary for it to be scalable and applicable to a range of designs. For instance the power estimation previously based on probabilistic methods is sub- stituted for a use-case based method, which allows the whole design to be studied on a pipeline unit basis. Whether the power estimates are averaged or time-based is similarly a matter of scalability. Time-based analysis produces considerably more data than averaged but allows for detailed analysis of the power dissipation. In contrast averaged analysis is easier to integrate into a scalable framework since less data are produced. However, as the power is averaged over a unit of time, the power dissipation for units with low utilization is amortized over the estimation interval. The issue can be addressed by introducing resource counters in the testbenches used in the verification flow and scaling the final average power according to the counters as shown in Eq. 3.1. Pscaled = Punscaled Utilization (3.1) where Utilization is the total percentage of the total execution time, in either sec- onds or cycles, when the unit is used. Furthermore, doing the power estimates after 29 3. A unified evaluation framework CREEP: START RTL Verification Architecture Simulation Power estimation Resource:Power mapping CREEP: END Figure 3.1: The methodology embodied in the CREEP framework. the place and route (PnR) creates an issue of scalability as all designs are different and would need to be placed and routed manually. Instead, one design will be placed and routed and serve as an indication for how much the power increases post PnR. The simulator branch consists of SimpleScalar as described in Sec. 2.4.1. The sim- ulator is used to acquire accurate per cycle resource usage using workloads that are impractical (impossible) to use during RTL power estimation due to their com- plexity and size. These resource usage statistics are then combined with the power estimates obtained from the RTL. There are two issues that arise in the combination of the power estimates and sim- ulator statistics. Firstly, the mismatch between resource counter and RTL power estimates must be bridged and secondly, the pipeline power must be mapped to re- source counters in a manner that does not systematically over or underestimates the energy. The first issue stems from the power reports, e.g., unit conversions, group- ing into different sources of power dissipation, i.e., leakage and switching power and whether the power reports are time-based or averaged. Unit conversions are trivial and will be done to obtain energy per cycle in order to combine the power estimates with resource counters. The power grouping should likewise pose no issues but will require some consideration of scalability as dealing with different types of power dissipation sources will complicate the framework workflow. The second issue is brought about by the different structure and granularity of the architectural simu- lator and the RTL code. The resource mapping will be done by carefully grouping the RTL pipeline units and introducing selective performance counters in the archi- tectural simulator where it is necessary. The granularity of the resource mapping will be done incrementally from coarse to finer. 30 3. A unified evaluation framework The combination of said energy estimates and performance counters will be auto- mated in order for the framework to be able to reproduce results consistently, which is stated as a goal in Sec. 1.1. Furthermore, to meet the configurability goal the user shall be able to configure the RTL and simulator components. Settings related to the cache and possibly to the pipeline will be exposed to the user, but the user will not make any changes to the components themselves. Instead, this process will also be automated thus ensuring that the components are combined consistently. Incon- sistencies between the two components would needlessly affect the energy estimates negatively. 3.2 Verification Verification of the framework will be done at the component level. For the RTL the functional verification is inherent to the established RTL verification and power estimation flow. However, the power estimates themselves need to be verified. As the RTL has been used in previous projects, estimates for parts of the design, more specifically, the caches are available. These estimates will be used to do a rudimen- tary evaluation of the power estimates. For the simulator the modifications as well as the added resource counters need to be verified. The modifications can be verified by comparing the default per- formance counters to the counters produced by the modified simulator. Several counters should indicate in-order, non-speculative and single-issue behavior. The added resource counters can similarly be validated by comparing them to default counters produced by SimpleScalar. 31 3. A unified evaluation framework 32 4 Implementation This chapter will outline the implementation of Chalmers RTL-based energy eval- uation framework for pipelines (CREEP), starting with the implementation of the register-transfer level (RTL) flow in Sec. 4.1.1 followed by the simulator in Sec. 4.1.2. How these two are integrated into the framework workflow is then discussed in Sec. 4.2. Lastly, the work related to automating the framework will be outlined in Sec. 4.3. 4.1 Implementation of framework components The RTL of the framework describes a 5-stage pipeline (5SP) that supports the integer subset of the 32-bit MIPS I instruction set architecture (ISA) (see Sec. 2.4.2). However, in order for the RTL to fulfill the framework’s needs several modifications of it was necessary. Furthermore, an integrated circuit (IC) design and verification flow with emphasis on scalability was implemented in order to obtain power estimates from a wide range of designs. The SimpleScalar simulator was similarly modified to fit into the framework. For clarity this section is split up into three parts: The first part, Sec. 4.1.1, discusses the RTL modifications and establishment of the design and verification flow. The second part, Sec. 4.1.2, elaborates on the simulator modifications. The last part, Sec 4.1.3, deals with the framework’s configurability. 4.1.1 RTL modifications As previously mentioned in Sec. 2.4.2 the caches were originally designed to have adjustable dimensions, associativity and replacement techniques. However, the pipeline design was used in other projects prior to the framework development dur- ing which the level-one instruction cache (L1IC) and level-one data cache (L1DC) were fixed to a configuration with 16kB 4-way associativity and LRU as replacement algorithm. To meet the goal of configurability stated in Sec. 1.1, it was decided that the caches should be restored to their original flexible condition. The main limitation of the cache components in their original initial condition was set by the use of 1024x32-bit static random access memory (SRAM) memories for the 33 4. Implementation data-arrays and 128x32-bit SRAM memories for the tag-arrays. This corresponded to 128 sets with a line size of 8 instructions for the L1IC or 8 words for the L1DC. Furthermore, the setup was locked to a 4-way configuration producing a cache of 16kB in total (4 × 1024 × 32). The tag arrays were optimized to fit four 21-bit tags (tag + dirty bit) into three SRAM memories rather than the conventional four memories (one per way). The rationale behind this was that three tag arrays were sufficient to hold four tags (21 × 4 = 84 < 96). The code was modified to instead map the tags into four tag-arrays. This change allowed the associativity to be set within the original bounds of direct-mapped to 4-way associative. However, this caused an increase in the bit overhead in the tag-arrays as only 21 of 32 bits were used. This was deemed unavoidable since SRAM memory matching the tag width of 21-bits was unavoidable. Additional SRAM memories of different sizes were introduced to allow for additional cache sizes of 8kB and 32kB. Furthermore, banking was reintroduced which can be used to divide the cache lines between several SRAM memories. In addition to the LRU replacement algorithm, the pseudo-random replacement algorithm was also reintroduced. Because of licensing issues no SRAM memories can be shipped with the frame- work. To make the RTL available without SRAMs the caches were augmented with the ability to use logic-based memory (flip-flops). Compared to SRAM logic-based memory produces larger (area) and more power dissipating caches. Other ways of allowing more flexibility in the pipeline design were considered, but the remaining options were related to the datapath. Changes to the datapath, such as allowing wider-issue width, speculative execution, different branch resolu- tion techniques would all essentially warrant a complete redesign of all or some pipeline stages. Furthermore, allowing such flexibility was outside the scope of the framework, which targets simpler embedded processors. 4.1.1.1 Design verification flow A verification flow built on the methods described in Sec. 2.4.2.1 was established. Cadence IES was the electronic design automation (EDA) tool of choice for hard- ware description language (HDL) simulations. More specifically NCVHDL was used to compile the RTL code, NCELAB was used to elaborate the design and NCSIM was used for simulations. To verify the design, a testbench was constructed around the pipeline design. As the design is complex test vectors were chosen as stimuli for the design. More specifically, the vectors were directed to test the design’s imple- mentation of the MIPS I ISA through instruction set simulations using executables compiled for the MIPS I ISA. However, RTL simulations of large designs are time consuming but can be facilitated through the use of small and effective workloads with ample test coverage. A good match was found in the EEMBC benchmark suite, which is a benchmark suite that targets embedded processors. The benchmarks in the suite are light weight and utilizes fixed point arithmetics. The benchmarks used 34 4. Implementation in the framework are listed below: • Autocorrelation • Convolutional Encoder • FFT/IFFT • Viterbi Decoder • RGBCMY01 (Consumer RGB to CMYK) The next stage in the verification flow is synthesis. The EDA tool of choice for synthesis was Synopsys Design Compiler (DC), which was chosen because it is the de-facto standard tool used by the community. The synthesis was done using the compile-ultra command and cells from the 65nm low power (LP) low threshold worst-case corner library (1.1V 125◦) provided by ST Microelectronics. A corre- sponding library was used for the SRAM memories in the caches. These libraries represent the worst-case process corner and use scenario (extreme temperature and low voltage) and were used because of the strict performance requirements placed on ICs. Additionally, automatic clock-gating was enabled in an effort to reduce the design’s dynamic power dissipation [38]. Clock-gating is widely used in the industry because of its potentially large energy savings at little extra design effort. Most EDAs specializing in synthesis support it, DC included. The synthesis was carried out for increasingly strict timing constraints to find the maximum achievable clock frequency of the design and the design was established to meet a timing constraint of 2.5ns, producing a netlist running at 400 Mhz. The netlist verification was done using the same testbench developed for RTL verification. The final stage is place and route (PnR) for which Cadence Encounter was used. As described in Sec. 2.4.2.1 PnR produces yet another netlist, but this netlist has now been subjected to a number of structural changes to facilitate physical imple- mentation. PnR was necessary to include in the verification flow because of two reasons. Firstly, the utilized SRAM memories are already placed and routed and thus dissipates significantly more power than the rest of the design unless this is also placed and routed. Secondly, a placed and routed design allows for more accurate power estimations than a post-synthesis design (see Sec. 2.2). However, the PnR stage is unique to each design, which conflicted with the desired scalability of the verification flow. A more scalable approach was implemented that estimated the PnR impact on power dissipation by comparing the power of one placed and routed netlist to a post-synthesis netlist and from this comparison a scaling factor could be deduced. The rationale behind this approach was that the pipeline was not sub- jected to any modifications (see Sec. 4.1.1) and remains relatively unaffected by the configuration of the caches. 35 4. Implementation 4.1.1.2 Design power estimations It was not possible to use the power-estimation method used in the ad-hoc method- ology in order to obtain power estimates of the design (see Sec. 2.4.2.1). The main reason was the larger scope of the power estimates, which previously were limited to the L1DC, but now included the entire pipeline. As such, using a probabilistic approach was unfeasible. Instead, the power dissipation of the design was estimated by using use-case statistical simulations whereby switching activities for the nodes in the design were obtained. Two different statistical methods were considered. The first considered method was switching activity interchange format (SAIF) genera- tion. During SAIF generation the average switching activities of the nodes in the design are recorded throughout simulation, which allows an average power estimate of the design to be produced. The second was value change dump (VCD) generation which tracks the nodes’ switching on a per-cycle basis which allows for time-based power analysis. The VCD based method allows for detailed analysis of the power dissipation, e.g., maximum power dissipation analysis, but the usage of VCD is com- putationally complex and hence less scalable than the SAIF-based method. Thus, VCD generation was dropped in favor for SAIF generation. Cadence NCSIM was used to simulate the netlist using the aforementioned RTL testbench and the previ- ously listed EEMBC benchmarks as stimuli. A total of five of five SAIF-generations were done (one per EEMBC benchmark). Synopsys PrimeTime (PT) was the EDA tool used to generate the final power es- timates [39]. PT was first used to remap the gate netlist to a different cell library from the one used during synthesis. In contrast to the synthesis, which was done with the worst-case high-temperature corner library, the nominal-nominal variation (1.2V NOM 25◦) was used to generate the power reports. The reason for using the nominal corner and nominal voltage and temperature library was to provide nominal power estimates for the pipeline design and thus allow different pipeline configurations to be compared under normal circumstances. The power estimation was done by reading the netlist and each of the aforementioned SAIF files. Thus, a total of five power reports were produced and averaged to create the final design power estimate. Hierarchal reports were produced for the design and the granular- ity of these reports was tweaked to reveal major pipeline units within each pipeline stage. However, the granularity of these reports was later tweaked to better suit the performance counters generated by the simulator component (Sec. 4.1.2). PT reported the power divided into three different categories: 1) switching power, 2) internal power and 3) leakage power. To simplify the workflow the sum of all these powers was used for each reported pipeline unit. As discussed in Sec. 3.1, average power reports amortize the power of certain units over the power estimation interval. Units such as the arithmetic logic unit (ALU), multiplier and L1DC are associated with enable signals that prompts them to acti- vate, i.e., start switching and dissipating power. Unless the power of these units are scaled according to their usage, the framework would greatly underestimate their contribution to the final energy results. The solution to this problem, which was dis- 36 4. Implementation cussed in Sec. 3.1, required information of how many cycles the affected units were used during the power estimation interval. The usage information was obtained by augmenting the RTL testbench used during verification and power estimation with counters that were incremented when the enable signal for these structures was as- serted. These counters were then divided by the total number of cycles also tracked by the testbench. The scaling factor was then computed as shown in Eq. 4.1. Utilization = active_cycles total_cycles (4.1) The power dissipation reported by PT was then divided by this utilization factor, as shown in Eq. 4.2 to obtain the final power values used in the framework. Pscaled = Punscaled Utilization (4.2) As discussed briefly in Sec. 3.1 and Sec. 4.1.1.1 a scalable approach to power esti- mation of a placed and routed design was necessary for the scalability goal as stated in Sec. 1.1. An attempt was made at running the post-PnR netlist through the im- plemented verification flow but this was met with technical issues that proved hard to solve. Instead a probabilistic approach was adopted, which limited the scope at which the design could be analyzed. Since the SRAM memories comes placed and routed, the PnR scaling should only be applied to combinatorial pipeline units. Hence, the ALU was chosen as a representative combinatorial unit and switching activities were assigned to the ALU input. The power dissipation was then extracted from a synthesized and a post-PnR netlist. Then the PnR factor was derived from the fraction of the PnR power to the power based on the synthesized design as shown in Eq. 4.3. The PnR-scaling factor was then applied to all combinatorial units in the pipeline. PnRscaling = ALUsynth ALUPnR (4.3) A similar estimation was done for the clock-tree power, which is small in a synthe- sised netlist. The limited clock power dissipation accounted for in a post-synthesis design is related to the clock pins on registers in the design. Hence, the major- ity of the difference in clock power dissipation between a synthesized netlist and a post-PnR netlist is due to