Anti-Aging Accelerators A study on the effects of GPU degradation in AI applications Master’s thesis in Computer science and engineering Björn Forssén Department of Computer Science and Engineering CHALMERS UNIVERSITY OF TECHNOLOGY UNIVERSITY OF GOTHENBURG Gothenburg, Sweden 2026 Master’s thesis 2026 Anti-Aging Accelerators A study on the effects of GPU degradation in AI applications Björn Forssén Department of Computer Science and Engineering Chalmers University of Technology University of Gothenburg Gothenburg, Sweden 2026 Björn Forssén © Björn Forssén, 2026. Supervisor & Examiner: Pedro Petersen Moura Trancoso, Department of Computer Science and Engineering Master’s Thesis 2026 Department of Computer Science and Engineering Chalmers University of Technology and University of Gothenburg SE-412 96 Gothenburg Telephone +46 31 772 1000 Typeset in LATEX Gothenburg, Sweden 2026 iv Anti-Aging Accelerators A study on the effects of GPU degradation in AI applications Björn Forssén Department of Computer Science and Engineering Chalmers University of Technology and University of Gothenburg Abstract By using a model based on first principles, it is possible to argue about the sustain- ability effects of aging AI accelerator hardware by relaxing some constraints. But first order principles have its limits and despite this effort to reason regarding the replacement of AI accelerator hardware and significant remains to be explored. This work presents some common considerations for reasoning about sustainability with computer hardware, some of the considerations being based on current trends are based on the situation at the time of writing. Then a model based on first principles is explained and is used to reason about replacement of AI accelerator hardware, this is combined with software simulation of faulty hardware in-place of faulty hardware. Finally, we present some of the considerations learned from testing the first-order model and conclude on some suggestions for further work to study the sustainability of AI accelerators. Keywords: Computer, science, computer science, engineering, project, thesis, DNN, reliability, carbon footprint, sustainability. v Acknowledgements I wish to thank my supervisor and examiner Pedro Petersen Moura Trancoso for his patient guidance and support. Additionally, many thanks to my family for their continued encouragement and support. Björn Forssén, Gothenburg, 2026-02-01 vii Contents List of Figures xi 1 Introduction 1 1.1 Research questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Ethics considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2 Background 3 2.1 Carbon footprint of computers . . . . . . . . . . . . . . . . . . . . . . 3 2.1.1 Embodied footprint . . . . . . . . . . . . . . . . . . . . . . . . 4 2.1.2 Operational footprint . . . . . . . . . . . . . . . . . . . . . . . 4 2.1.3 Scope 1, scope 2, scope 3 emissions . . . . . . . . . . . . . . . 5 2.1.4 Data uncertainty . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.1.5 Scaling trends . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2 Estimating carbon footprint with a simple first order model . . . . . 7 2.2.1 First order embodied footprint . . . . . . . . . . . . . . . . . . 8 2.2.2 First order operational footprint . . . . . . . . . . . . . . . . . 8 2.2.3 Carbon footprint model . . . . . . . . . . . . . . . . . . . . . 8 2.2.4 Carbon footprint model with replacement . . . . . . . . . . . 8 2.3 Errors in accelerators . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.3.1 Hard errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.3.2 Soft errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.3.3 Error mitigation . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3 Method 15 3.1 Implementing the carbon footprint model . . . . . . . . . . . . . . . . 15 3.2 Error injection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.2.1 Error injection in software . . . . . . . . . . . . . . . . . . . . 15 3.2.2 Error injection in hardware . . . . . . . . . . . . . . . . . . . 16 3.2.3 Error injection with PyTorch . . . . . . . . . . . . . . . . . . 16 4 Results 17 4.1 Experiment setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 4.1.1 AlexNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 4.1.2 ResNet18 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 4.1.3 Results and graphs . . . . . . . . . . . . . . . . . . . . . . . . 18 ix Contents 4.2 Fault injection in ResNet18’s convolutional layers . . . . . . . . . . . 19 4.2.1 Converting errors to loss in accuracy . . . . . . . . . . . . . . 19 4.3 Effect of model parameters . . . . . . . . . . . . . . . . . . . . . . . . 20 4.3.1 Balance between embodied and operational footprint . . . . . 21 4.3.2 Increased operational footprint with faulty hardware . . . . . 22 4.4 Replacement parameters . . . . . . . . . . . . . . . . . . . . . . . . . 23 4.4.1 Different levels of fault tolerance . . . . . . . . . . . . . . . . . 24 4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 5 Conclusions 29 5.1 Accelerators towards end of life . . . . . . . . . . . . . . . . . . . . . 29 5.2 Limitations encountered . . . . . . . . . . . . . . . . . . . . . . . . . 30 5.3 Further work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 5.3.1 Long-term studies . . . . . . . . . . . . . . . . . . . . . . . . . 30 5.3.2 Better methods for error mitigation . . . . . . . . . . . . . . . 31 5.3.3 Better understanding of user attitude towards faulty hardware 31 Bibliography 33 A Algorithms used I x List of Figures 2.1 Every 3rd year the chip is replaced which incurs the embodied foot- print, the operational footprint is then split over the 3 year lifespan. . 9 2.2 Every time the chip is replaced the increase in embodied footprint is taken into account. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.3 An example with an altered replacement policy which replaces its chip less often . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.4 Normal adder and a faulty adder . . . . . . . . . . . . . . . . . . . . 12 4.1 AlexNet with errors inserted into the images . . . . . . . . . . . . . . 17 4.2 AlexNet with errors inserted during inference . . . . . . . . . . . . . . 18 4.3 ResNet18 with errors inserted during inference . . . . . . . . . . . . . 18 4.4 Execution results are interpreted and the statistics get turned into graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 4.5 Increase in error rate for ResNet18 . . . . . . . . . . . . . . . . . . . 20 4.6 Approximated error to accuracy curve . . . . . . . . . . . . . . . . . . 21 4.7 Carbon model simulation example . . . . . . . . . . . . . . . . . . . . 22 4.8 Low α model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 4.9 High α model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 4.10 Varied α model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 4.11 Increased operational footprint model . . . . . . . . . . . . . . . . . . 26 4.12 Varied fault tolerance level . . . . . . . . . . . . . . . . . . . . . . . . 27 xi List of Figures xii 1 Introduction Since the start of the AI boom the interest in Deep Neural Networks (DNNs) has kept increasing, and while early DNNs relied on Graphical Processing Units (GPUs) for computing power the more modern DNNs can leverage specialized accelerators for even better performance. This in conjunction with the fact that with more AI accelerators being announced and released each year, it is safe to say that the demand and usage of accelerator hardware remains high and is likely to continue for some time [1]. With this increasing interest, there are both more DNNs being trained and more hardware needed to support it [2]. The increase in users of DNNs means more hardware units, more power to support the user base etc. Also, the hardware itself is having an increasingly larger effect on the environment with more complex pro- duction processes requiring rare materials to be used in power intensive processing steps. These factors combine to set an upwards trend for the carbon footprint of the Information and Communication Technology (ICT) sector which is estimated on the high end to be 2,1% to 3,9% of the Global Greenhouse Gas (GHG) emissions [3]. A ratio which is not massive in and of itself but by being connected to other more polluting sectors it is important that the carbon footprint trend is kept in mind. With the objective of keeping the carbon footprint in mind for future developments of AI accelerators we propose an extension to the model presented by Eeckhout [4]. The extension takes advantage of DNN’s ability to resist errors in order to mitigate some amount of degradation in the accelerator hardware allowing users to reduce the number of times they need to replace their hardware thus lowering the demand for accelerator hardware. This error resilience is similar to how synapses in the human brain is able to tolerate noisy signals, the effect is limited however and is determined by network design which may introduce spare capacity in the model to reach the correct answer even given some errors [5]. In order to implement this extension, we propose some new variables and discuss some scenarios to demonstrate the potential effect of an altered replacement policy. For some of the variables proposed we tried to approximate using available data but we also attempted to quantify the drop in application accuracy from hardware errors by using pre-trained version of ResNet18 and AlexNet for image classification on the ImageNet Large Scale Visual Recognition Challenge 2012 dataset [6]. 1 1. Introduction 1.1 Research questions 1. Accelerator hardware is designed to last for a specific amount of time, how does it start to behave towards the end of its lifetime and beyond? 2. When the hardware starts causing errors what effects does this have on AI applications? 3. For larger users of accelerator hardware, the decision to replace ageing hard- ware is not a simple decision. With respect to environmental impact, what are some of the most important considerations? 1.2 Limitations Due to the availability of hardware and simulator software this paper focuses on the usage of GPUs in AI applications. This excludes hardware such as FPGA accelerators, NPUs and processors designed for AI applications, as a result different patterns or limits of degradation dependent on hardware type is not considered. The only application considered is image recognition due to the availability of test data and the limit of computational power, this means that the resilience of different networks based on application type are not considered. 1.3 Ethics considerations The energy usage of DNN’s can be significant, but since this project uses common pretrained models the resource cost of training is divided among many users and together with a limited amount of inference the energy usage of this project is kept within excessive limits. Although, the ILSVRC2012 dataset is quite old and does not necessarily live up to recent standards since it is only used to determine model degradation through inference there should be minimal disruption from the dataset, despite its age. 2 2 Background In this chapter we present some common considerations when reasoning about the sustainability of computer hardware. This includes the footprint occurred from production and usage represented here and later with embodied and operational footprint. We also present and discuss some of the ways to calculate emissions alongside some considerations for uncertainty in the data, but also some overarching trends. We also present the carbon footprint model and the proposed extension to explore different options for replacement policies. As a part of this proposed extension there are several new variables where some have been estimated using available data and others will be quantified with error simulation. The variables that are to be quantified relate to the effect that errors in the hardware have on the application results, in particular by using ResNet18 for image classification we study the decrease in accuracy as a function of an increasing number of errors. Carbon footprint refers to the release of GHGs for in this case accelerators during various parts of their lifetimes, in this chapter we discuss embodied and operational footprint but a significant part of an accelerator’s life is the end-of-life. However, this part is not covered and this project instead focuses on the footprint up until a chip reaches its end-of-life. 2.1 Carbon footprint of computers The carbon footprint of computers encompasses a broad range of environmental impacts associated with their lifecycle, from production to disposal. This section ex- plores key components of a computers carbon footprint, beginning with the emissions from embodied footprint tied to material extraction, manufacturing, and transport, followed by the operational footprint, which reflects emissions from energy consumed during usage. It further examines Scope 1, Scope 2, and Scope 3 emissions in sec- tion 2.1.3, providing a framework to categorize direct and indirect emissions from production and use. Given the complexity of tracking accurate emissions data, data uncertainty remains a critical challenge, impacting efforts to accurately assess and reduce emissions. Finally, this section addresses scaling trends as computing de- mand rises, considering how emissions evolve alongside increasing production and use. 3 2. Background 2.1.1 Embodied footprint The embodied footprint of computers refers to the total environmental impact that occurs throughout the entire lifetime of a computer system, from the extraction of raw materials to the disposal of the system. This includes energy consumption and resource use during manufacturing, transportation, and end-of-life disposal, as well as the waste generated in these stages. The concept contrasts with operational energy use, which accounts only for the energy consumed while the device is in use. Computers rely on a wide range of materials, including metals like aluminum, copper, and gold, as well as Rare Earth Elements (REEs) such as neodymium, and tanta- lum can be found in computer systems [2]. These materials are essential for various components like circuit boards, semiconductors, and batteries. Mining these re- sources is energy-intensive and often involves significant environmental degradation, including habitat destruction, soil and water pollution, and high carbon emissions. Additionally, the extraction of REEs frequently produces hazardous waste that can contaminate surrounding ecosystems. The manufacturing phase contributes significantly to a computer’s embodied foot- print. The process involves energy-intensive activities, including the fabrication of semiconductors, assembly of components, and production of peripherals. Manufac- turing facilities often rely on energy from fossil fuels, contributing to greenhouse gas emissions. The complex supply chains involved in producing computers also increase their overall environmental impact, as materials are often shipped across the globe multiple times before the final product is assembled [7]. Once manufactured, computers must be transported from production sites to distri- bution centers and retailers around the world. The embodied footprint at this stage includes the carbon emissions associated with the transportation of goods via air, sea, and land. The energy used to store and package computers also adds to their environmental cost [2]. The final stage of a computers life is often problematic from an environmental stand- point. Many computers are disposed of in landfills or improperly recycled, leading to the release of toxic substances such as lead, mercury, and cadmium into the envi- ronment. Electronic waste can have a long-lasting impact, polluting water sources, soil, and air. Furthermore, the low recycling rates of critical materials like REEs mean that new extraction processes are continually required, perpetuating the envi- ronmental damage associated with raw material extraction [8]. As an example, Google’s TPU v4i has an estimated embodied footprint of 386 kCO2, the TPU v5e 402 kCO2, and the TPU v6e 692 kCO2 [9]. 2.1.2 Operational footprint The operational footprint of computers relates to the environmental impact asso- ciated with their use over their functional lifetime, mainly this is made up of the energy consumed and the resulting greenhouse gas emissions. While the embodied footprint accounts for environmental impacts incurred before and after use, the oper- 4 2. Background ational footprint encompasses the energy demands, cooling requirements, and other resources consumed during active usage. It is also likely the most significant foot- print where for specialised AI-hardware it makes up 70% to 90% of total emissions [9]. Computers consume a significant amount of electricity while in operation. This electricity which is generated from a variety of sources like coal, oil, natural gas, solar, wind but still emit carbon dioxide and other greenhouse gases. High-performance computers, such as those used in data centers, are particularly energy-intensive and contribute substantially to global electricity consumption [10]. Besides consuming electricity, computers also generate considerable amounts of heat. This creates a need for cooling systems in order to maintain optimal performance and prevent overheating. Cooling systems, especially in data centers, are often energy-intensive which adds to the overall operational footprint. Large-scale facili- ties sometimes use water for cooling which is sometimes consumed which creates an additional demand for water which can have environmental impact when a facility is located in water scarce regions. A computers operational footprint also includes the power needed to run connected devices, such as monitors and external storage devices, as well as networking infras- tructure. Networked devices often operate continuously to maintain connectivity, consuming energy even when they are not in active use. In addition, the type and efficiency of software running on computers also affect their operational foot- print. Poorly optimized software can increase a computers processing load, leading to greater energy consumption. And most importantly, the way individuals use computers can have a substantial impact on their operational footprint. Users who, fail to adjust power settings or use resource-intensive applications unnecessarily can significantly increase the energy demand. Education around sustainable practices, such as not using computer resources unnecessarily can help reduce the operational footprint. 2.1.3 Scope 1, scope 2, scope 3 emissions Based on the GHG protocol, both Scope 1 and Scope 2 emissions contribute to an entity’s overall carbon footprint, they represent different types of environmental impact. Scope 1 emissions come directly from a entitys own activities and offer more immediate feedback on emissions. Scope 2 emissions, however, are caused by an entity’s action and often depend on the carbon footprint of third parties, this makes them more challenging to control or understand but they can still be influenced by long term strategies and selective cooperation with parties who have a lower carbon footprint. The most fundamental distinction between scope 1 and scope 2 emissions lies in whether the emissions are directly or indirectly caused by an entitys activities. Scope 1 emissions are direct emissions that occur from sources controlled or owned by the entity. In the computer industry, this can include emissions from manufacturing processes that require fuel combustion, transportation of products or intermediate- 5 2. Background products, or emissions from facilities operated by the entity. Scope 2 Emissions, on the other hand, are indirect emissions created from an entity’s activities. For example, when a computer manufacturer buys electricity to power its data centers or facilities, the emissions generated by the power plant providing that electricity are counted as scope 2 emissions or when acquiring materials for manufacturing the emissions generated by the provider of the material are also counted as scope 2 emissions. Scope 3 emissions represent the most extensive and complex category of emissions in the carbon footprint of computers, encompassing all indirect emissions that occur across a entitys value chain, outside of its direct operations and purchased energy. These emissions originate from a variety of sources that are harder to measure and control, including upstream activities like material sourcing and downstream activi- ties like product disposal. The biggest difference to scope 2 emissions is that scope 3 emissions also include the emissions from downstream, for example the emissions generated after a computer has been sold and is used by any number of entities with different policies and practices until its end of life. 2.1.4 Data uncertainty Data uncertainty is a significant challenge in accurately assessing the carbon foot- print of computers, affecting both embodied and operational emissions. The com- plexity of modern supply chains, variations in production processes, and differences in energy sources contribute to uncertainty at nearly every stage of a computers lifecycle. This uncertainty complicates efforts to calculate and compare carbon foot- prints since the differences can vary across types of computers and users with differ- ent requirements for manufacturing and usage. Computers are built from many different components, each can require different raw materials, manufacturing processes, and global transportation routes. For example, a single computer may contain components sourced from multiple countries, each with distinct energy sources and emissions. Collecting accurate emissions data for each stage of material extraction, transportation, and production is challenging, and small variations can lead to substantial differences in calculated emissions. Supply chain complexity thus introduces high variability and uncertainty into embodied emissions calculations. Different manufacturing techniques and technologies further contribute to data uncertainty. Semiconductor fabrication, for example, is an energy- intensive process with emissions that can vary widely depending on the technology, age of the equipment, and the specific energy source used. Even within the same company, manufacturing emissions can differ significantly across facilities or produc- tion lines. Inconsistent practices or lack of precise emissions data at the factory level make it challenging to obtain reliable data for specific products or batches, further complicating accurate carbon footprint assessments. [2] The operational footprint of computers heavily depends on the energy sources avail- able in the regions where they are used and produced. Regions with high reliance on fossil fuels, for instance, can contribute more than those that draw from renew- ables like wind, solar, or hydroelectric power. However, companies and users in 6 2. Background many regions do not even have full control of the exact energy mix powering their facilities, leading to uncertainty in accurately attributing emissions. Standardized methods for tracking and reporting emissions data are still developing, especially within the rapidly changing tech industry. While frameworks like the Greenhouse Gas Protocol provide guidelines, there is often variability in how companies apply these standards. For example, indirect emissions (Scope 2 and Scope 3) are some- times estimated based on industry averages rather than precise data from specific suppliers or energy providers, which can introduce inaccuracies. Smaller suppliers or manufacturers may also lack the resources or systems to measure their emissions accurately, forcing companies to rely on generalized assumptions. The operational carbon footprint of a computer is also influenced by the way end-users interact with their devices. User behavior such as idle time, frequency of use, and power settings vary widely, making it difficult to standardize operational emissions data. Moreover, variations in device lifespanssuch as extended use through upgrades versus early disposal and replacementfurther add to uncertainty. Tracking emissions based on assumptions about average usage patterns may not reflect real-world use, leading to either overestimations or underestimations of a computer’s operational footprint [2]. The end-of-life stage introduces significant uncertainty due to differing practices in recycling, disposal, and refurbishment. Many computers end up in informal recycling sectors where emissions data is not tracked or reported accurately. Additionally, the environmental impact of e-waste varies depending on whether devices are recycled for materials, refurbished for reuse, or discarded in landfills. Inconsistent data on recycling rates, disposal methods, and emissions associated with these activities adds further uncertainty to the total carbon footprint. 2.1.5 Scaling trends As global reliance on computing grows, scaling trends within the computer industry have significant implications for its carbon footprint. The increasing demand for devicesfrom personal electronics to industrial servershas led to a rise in production, which in turn amplifies both embodied and operational emissions. Additionally, the growth of cloud computing, artificial intelligence, and data analytics has accelerated the expansion of data centers, many of which are energy-intensive and contribute substantially to operational emissions. Efforts to improve energy efficiency in both hardware and data centers have mitigated some environmental impacts; however, overall emissions continue to increase as computing demand outpaces efficiency gains [3] [11]. 2.2 Estimating carbon footprint with a simple first order model The work of Eeckhout [4] presents a model to assess computer architecture sustain- ability based on first principles and in this project we expand upon it to assess the long-term sustainability with regards to the replacement of chips. 7 2. Background 2.2.1 First order embodied footprint The chosen proxy for embodied footprint in Eeckhout [4] is chip area. This is based on the fact that a main component of semiconductor production is the silicon wafer and with a formula from de Vries [12] the area can be used as the proxy for the entire footprint. 2.2.2 First order operational footprint The operational footprint can be understood as the footprint caused by the operation of a chip during its lifetime and can be considered in two scenarios, fixed-work and fixed time. The fixed-work scenario is when a chip does a fixed amount of work during its lifetime, this implies that a more efficient chip could doing the same work could the operational footprint. This however does not take Jevon’s paradox [13] into account, so a common scenario is the efficiency gains being offset by the user issuing more work. The fixed-time scenario takes this into account and provides a different way to characterize the operational footprint. Fixed-time assumes that the amount of time a chip is used remains constant regardless of what chip is used, with this in mind the chosen proxy for characterizing the operational footprint is the power of the chip [4]. 2.2.3 Carbon footprint model Putting the embodied footprint and the fixed-time operational footprint together we get equation 2.1 where A is the chip area, P is the chip power. However, α is chosen in (0, 1) to represent the balance between the embodied and operational footprint and an appropriate value has to be chosen for different applications. This equation gives the total footprint for a chip over its lifetime and is used to include additional chip characteristics such as increasing power usage in more powerful chips or different manufacturing processes for different generations of chips. Ffixed−time = α · A + (1 − α)P (2.1) 2.2.4 Carbon footprint model with replacement To characterize the carbon footprint caused by replacement policies it is necessary to model several generations of chips. The footprint of each generation is character- ized by equation 2.1 and can be separated into the embodied footprint α · A and the operational footprint (1 − α)P . Since the operational footprint represents the footprint over the chip’s lifetime the entire footprint is divided into equal parts over its lifetime and is incurred each year. Meanwhile, the entire embodied footprint is incurred on the first year of operation of the chip and once the chip is replaced, the entire embodied footprint is incurred again. An example of this model with α = 50% and the chip being replaced every 3rd year is shown in Figure 2.1. Another aspect that the model allows us to characterize is that with increasingly complicated manufacturing processes the embodied footprint is increasing every year, 8 2. Background Figure 2.1: Every 3rd year the chip is replaced which incurs the embodied footprint, the operational footprint is then split over the 3 year lifespan. with several needs of the manufacturing process expected to grow around 10% each year [14], this model assumes a 10% increase in the embodied footprint each year. Since insufficient data is available the operational footprint is assumed to be constant but the different scenarios will be discussed in section 4. The model as described with 10% increase in the operational footprint each year, α = 50%, and the chip being replaced every 3rd year is shown in Figure 2.2. The final variable to be addressed is the matter of when to replace a chip, after a fixed amount of time or by some more flexible criteria. To replace a chip after a certain amount of time is intended to be analogous to replacing a chip once it reaches its end-of-life and correct execution is no longer guaranteed by the manufacturer. For this project the constant replacement time of 3 years was chosen based on the data from the Titan supercomputer [15]. Besides fixed time this project has looked into replacing a chip after a certain amount of accuracy has been lost due to errors. In this case there is a need to determine when errors start to appear Ttime_to_errors, at what rate errors appear Terror_appear_rate, what impact these errors have on the accuracy ferror_to_accuracy, and at what loss of accuracy is the chip replaced Rreplacement_threshold. For this project, some of these variables will need to be assumed and explored since not a lot of data is available to estimate them. What will be estimated however is the effect of errors on the accuracy and this will be done through error injection. 9 2. Background Figure 2.2: Every time the chip is replaced the increase in embodied footprint is taken into account. Putting these considerations into the carbon footprint model we would get a slightly more complicated formula but the principle is the same. First, every time a chip is replaced the embodied footprint is incurred and like previously it will keep grow- ing for each year. Secondly, the operational footprint that previously covered the entire lifetime of the chip is now made into the operational footprint per year Fop_per_year = (1 − α)P/lifetime_in_years and each year of operation incurs this footprint. This means that once errors start to appear it is assumed that the opera- tional footprint remains the same. An example of this model with Ttime_to_errors = 3 years, Terror_appear_rate = 10 per year, ferror_to_accuracy = x → x/100, and Rreplacement_threshold = 20% is shown in Figure 2.3. 2.3 Errors in accelerators In accelerators errors are considered at a very low level, at the individual bit-level. Hardware errors typically occur in the transistors whose effect typically only is a single bit and as such either a piece of data or a control signal is corrupted. These errors then propagate throughout the accelerator until they are countered by an error correction system, absorbed by the application, or affect the final result. In this project we focused on injecting errors in the software and as such are not 10 2. Background Figure 2.3: An example with an altered replacement policy which replaces its chip less often modelling any specific type of hardware error, however some suggestions for how to implement the different types of errors are provided. While these errors can occur in any part of the accelerator in this project we are interested in the memory and the compute units. These errors occur because of sev- eral physical phenomena such as Negative-bias Temperature Instability, Hot Carrier Injection, and Electromigration. The effect of these errors can be categorized into two groups, soft errors and hard errors. 2.3.1 Hard errors Hard errors refer to errors that always produce the same result regardless of the intended behavior. One type of hard error is the stuck-at-fault, with this error a circuit element always outputs the same value. With this type of error in an adder circuit a bit of the input could be stuck at 1 which makes some but not all outputs faulty, see Figure 2.4. Hard errors can affect the memory as well and similarly to compute elements they behave in a constant manner. This constant characteristic of hard errors makes them not too difficult to spot with unit tests and can to an extent be worked around. Given that the accelerator has the functionality supported it is possible to create a fault-map of known hard errors and their behavior, this map can then be leveraged to assign data and computations in such a way that the hard 11 2. Background errors effects get masked as correct behavior. Figure 2.4: Normal adder and a faulty adder 2.3.2 Soft errors Soft errors refer to errors that only produce the wrong result some of the time. An example of this is timing errors, with this error the circuit has degraded and no longer works at the speed it was designed to operate at. As a result, when the next clock cycle starts it is possible that a circuit has not reached its steady state and an erroneous value is propagated. This uncertainty with soft errors can make them more difficult to spot as unlike hard errors they do not have the same constant characteristic, this means that unit tests may not be enough to detect these soft errors and an online error detection system may have to be used instead. To counter soft errors it is possible to do the same as with hard errors and create a fault-map but since soft errors do not necessarily produce the wrong results it can be a bit much to introduce a fault-map. Another common countermeasure is to simply lower the operating frequency to lower the timing requirements on the circuit. 2.3.3 Error mitigation In modern computing there a variety of methods used to mitigate these errors such as workload-aware redundancy which focuses on assigning redundant computational resources based on the criticality and sensitivity of workloads. Instead of applying uniform redundancy across all tasks, systems can selectively replicate or verify com- putations for high-priority or error-sensitive workloads while relaxing redundancy for tolerant ones, such as AI inference tasks. This targeted approach minimizes the performance and energy costs associated with redundancy while maintaining system reliability where it matters most [16]. 12 2. Background Another method is fleet-wide error correlation which leverages large-scale monitoring and data aggregation across large numbers of accelerators to detect patterns of degradation and failure. By correlating error occurrences across similar hardware units, systems can predict which components are likely to experience hard errors before they manifest critically. This enables proactive mitigationsuch as reassigning workloads, adjusting operating parameters, or scheduling maintenancebefore the error occurs [16]. Finally, Task Distribution serves as a complementary strategy by intelligently rout- ing workloads away from unreliable or degraded hardware. Rather than immedi- ately decommissioning aging accelerators, systems can assign them less critical or error-tolerant tasks, maximizing resource utilization while minimizing the risk of computation failure. Combined, these strategies can create a resilient computing system in which the impact of hard errors is minimized and hardware lifespans are extended through various methods [17] [18]. 13 2. Background 14 3 Method 3.1 Implementing the carbon footprint model For this project we implemented the model as a python script that estimated the carbon footprint over 20 years and compared between switching the chip after 3 years and a certain loss of application accuracy. Time is counted discreetly year for year so each year the model checks how long the chip has been in use and depending on the predicted reduction in accuracy and current replacement policy and may replace the chip if accuracy has dropped below acceptable levels. At each timestep the operational footprint is added to the carbon footprint and if the chip is replace a unit of the embodied carbon footprint, the model also takes into account the increasing footprint of such chips over time. In order to ascertain the loss of application accuracy over time it is necessary to know how any number of errors affect the application and the result on application accuracy. In this project we decided to assume the number of errors affecting the ap- plication instead of figuring out at what rate errors accumulate and instead focus on a method for determining what effect a certain number of errors have on application accuracy, how this was implemented is explained in the section below. 3.2 Error injection A brief explanation and the algorithms used to inject errors are available in Appendix A. 3.2.1 Error injection in software Injection errors in software is the simplest option to implement but requires the most knowledge of error characteristics to be realistic. Similar to real errors we can choose to implement the injection at the compute units or in the memory, regardless of where it is prudent to modify a correctly functioning program to add the error injection functionality. To inject errors into the memory an alternative is to pause the execution, modify the variables, and then continue the execution. The functionality required for this is present in debugging software such as GDB [19]. When modifying the memory we can use a fault-map to always modify the same memory locations with either a hard error by modifying it to a consistent value or 15 3. Method if it is a soft error we can do so randomly. On the other hand, to inject errors in the compute units we could modify the code itself so that the calculations are erroneous on purpose, similarly as with memory we can use a fault-map and characterize the errors similarly in order to mimic hard and soft errors. 3.2.2 Error injection in hardware Injecting errors in hardware can be very dependent on which hardware architecture is targeted since it is up to the vendor to provide sufficient APIs or other interfaces to the hardware. An example of this is NVBitFI [20] which is an error injection tool based on NVBit [21] which allows users to modify the GPU’s assembly code. NVBitFI has quite strict requirements for what architectures are compatible along with what compilers can be used. This makes for a rather tricky development environment since applications also have to be compiled and linked with specific versions. And this is not limited to NVBitFI, its successor SASSIFI [22] also has the same strict requirements for architectures and compiler which lead to error injection in hardware being unavailable due to lack of compatible hardware. 3.2.3 Error injection with PyTorch The chosen tool for error injection in this project ended up being the tool PyTorchFI [23], the tool is integrated with the popular deep learning platform PyTorch which makes it easy to use and integrate into existing python scripts. PyTorchFI uses hooks which is a tool provided by PyTorch which allows PyTorchFI to modify both neurons and weights in a lightweight manner, it also allows an easy way to write custom functions which can be leveraged to model different types of errors. The function we implemented only targets a single layer at a time and when doing so injects errors into random neurons each time selected by python’s pseudo random number generator random.py, the injected error each time was a random bitflip where the chosen bit also was selected by random.py. This function most closely models hard errors since errors are always encountered compared to soft errors which may or may not affect a calculation. 16 4 Results 4.1 Experiment setup The chosen dataset for this project was the ILSVRC2012 [6], in particular it was the validation set along with its solution. Due to restrictions in computing power only the first 1000 images were used for error injection with PyTorchFI. All the code along with results is available at https://git.chalmers.se/bjornfo/msc_thesis and a description of project workflow is available below. 4.1.1 AlexNet Firstly, AlexNet was run with altered images where a number of bitflips had been introduced into the images to simulate errors, see workflow in Figure 4.1. The scripts modify_images.py and modify_images_parallel.py were used to randomly insert a specified number of bitflips into the images, both use the same method to flip bits but the parallel version was created to speed up execution for additional bitflips. Figure 4.1: AlexNet with errors inserted into the images Secondly, Alexnet was also run with an error injection framework using PyTorchFI. This allowed us to insert a custom function onto layers to run during inference which simulated a specified number of bitflips, see workflow in Figure 4.2. 17 4. Results Figure 4.2: AlexNet with errors inserted during inference 4.1.2 ResNet18 For ResNet18 we used the same PyTorchFI framework as for AlexNet but parallelised it in order to speed up execution, see workflow in Figure 4.3. The improved execution speed was necessary in order to test the effect of an increasing amount of errors for every layer of ResNet18. Figure 4.3: ResNet18 with errors inserted during inference 4.1.3 Results and graphs The execution results for each scenario were compared against the solution and the accuracy for each layer was calculated and then graphed, see workflow in Figure 4.4. The execution results were in form of the filename followed by five tags correspond- ing to the five best guesses for what the image contained, the script stats_run.py compared the guesses to the solution and calculated the error rate for allowing the 1-5 best guesses and outputs the error rates for graphing. The scripts draw_graph_alexnet.py and draw_graph_resnet.py both compare the degraded error rates to the error rate without any fault injection and then draws a graph using Matplotlib. Then the script carbon_model.py implements the carbon footprint model as described in section 2.2 using the calculated error rates and some user parameters to draw the graphs shown below, as an example see Figure 4.8 18 4. Results Figure 4.4: Execution results are interpreted and the statistics get turned into graphs 4.2 Fault injection in ResNet18’s convolutional layers In Figure 4.5 we see the increase in error rate from injecting errors into a layer, the increase is shown in percentage starting from its unaffected error rate. The figure shows the effect for top 1 and top 5 scenarios which start out with error rates 30, 24% and 10, 92% respectively. The horizontal red line shows where the application has an error rate of 50% which corresponds to the application having a 50% accuracy in classifying the images. Perhaps unsurprisingly we can see that increasing the number of errors injected sig- nificantly increases the error rate which corresponds to a decrease in the applications classification accuracy. Only one layer in the top 5 case manages to retain an error rate below 50% at 100 errors. 4.2.1 Converting errors to loss in accuracy In Figure 4.6 we can see the curves from averaging the results in Figure 4.5 and then using numpy.polyfit() function to fit the curve to a n-th degree polynomial, the degree is chosen with regards to the data points from 4.5. The graph is limited to 100 errors since an error rate above 50% is beyond reasonable for any application modelled in the carbon footprint model. The curve can now be referenced by the carbon footprint model to approximate the loss of accuracy given a certain number of errors. 19 4. Results Figure 4.5: Increase in error rate for ResNet18 In this section, figures such as 4.7 will be used to compare and contrast different aspects and implications of the carbon footprint model described in section 2.2.4. The top graph plots two scenarios, replace after a fixed amount of time and replace once a certain increase in error rate has been reached, the bars indicate the footprint each year and the lines indicate the total footprint up until that year. The second graph shows the difference between these two scenarios which indicates if the altered replacement policy gives a larger or smaller carbon footprint. Unless otherwise specified the parameters will be set at α = 50% and fault_tolerance = 20%, also do note that scale of the y-axis across different graphs is not consistent. 4.3 Effect of model parameters In this section we discuss two model parameters, the α value and increasing footprint over time, although these are present in the original model as presented in 2.2.4 here we discuss their interaction in the context of a DNN application. 20 4. Results Figure 4.6: Approximated error to accuracy curve 4.3.1 Balance between embodied and operational footprint One of the strengths of the model proposed by Eeckhout [4] is that by varying the α value we can reason about the balance between embodied and operational footprint and what implications this has for a proposed chip design. As a reminder, low α value means that the operational footprint is more prominent and that a high α value means that the embodied footprint is more prominent. Ffixed−time = α · A + (1 − α)P The low α scenario is presented in Figure 4.8 with α values 10%, 20%, and 30%. This scenario demonstrates an increase in carbon footprint with a lowered α value. The low α scenario is presented in Figure 4.9 with α values 70%, 80%, and 90%. This scenario demonstrates an decrease in carbon footprint with a higher α value. In Figure 4.10 we see a wide representation of different α values and a strong corre- lation with higher α values giving a lowered carbon footprint. Since the proposed alternative is one with a different replacement policy it is reasonable that the more prominent the embodied footprint is the more impact we can have with a lowered 21 4. Results Figure 4.7: Carbon model simulation example rate of replacement. 4.3.2 Increased operational footprint with faulty hardware The model assumes that the embodied footprint increases over time but that the operational footprint remains constant, here we explore the effects of an increased operational footprint when using hardware with faults. The idea is that the appli- cation needs to be rerun to get the correct result or the faulty hardware consumes work resources to operate such as increased power need or increased need for cooling. In model this is implemented such that when a chip reaches its end of life and starts accumulating errors it also increases its operational footprint, in Figure 4.11 we see scenarios with 100%, 300%, and 500%. The 100% scenario has the same parameters as the example 4.7 as can be seen as a baseline in this case. For the 300% scenario we can see that for some years, such as year 4 and year 5, the altered policy has a higher footprint due to the increased operational footprint, but still the accumulated footprint is never lower than for the fixed replacement policy. However, with the 500% operational footprint there is quite some time where the altered replacement policy performs worse than the fixed replacement and even if it performs better some years and goes into the positive on 22 4. Results Figure 4.8: Low α model years that the fixed policy replaces the chip, the majority of the time is spent in the negative. On the other hand, it seems that the increased operational footprint is being offset by the increasing embodied footprint and that by the end of the simulation it has almost caught up. The reason that the increased operational footprint gets amortized over time is likely because the embodied footprint rises by 10% each year while the operational footprint remains the same. If the operational footprint were to rise each year similar to the embodied footprint it is likely that even the 300% scenario could put the altered replacement policy in the negative. It would create a scenario where the embodied footprint for acquiring a new chip would have to be higher than the increased operational footprint due to faulty hardware and the accumulated increased embodied footprint that would come from acquiring a new chip a few years later. 4.4 Replacement parameters In this project we have proposed a simple policy for chip replacement and in this section we discuss two aspects of how that policy is implemented. 23 4. Results Figure 4.9: High α model 4.4.1 Different levels of fault tolerance Depending on the use case of the DNN a different fault tolerance may be accepted and in Figure 4.12 we explore three scenarios 15%, 10%, and 5% fault tolerance. In this case it seems that the scenario for 15% and 10% end up being identical, likely because the timesteps are large enough that both scenarios have the same outcome, a smaller timestep would likely bring different results. For the 5% scenario however we see a difference in that the difference in footprint is lower at its peak which is reasonable since the chip is being replaced more often. An interesting thing to note however is that in this simulation all scenarios end with a sharp decrease in the carbon footprint difference, this is because all scenarios end up incurring the embodied footprint in the final year however the implication is interesting. By using the chip for longer we also acquire a new chip later, acquisition of a chip whose embodied footprint has risen by 10% each year. If the increase in embodied footprint was not linear and instead increased different amounts from year to year it could even be that by waiting, we are unlucky and end up with an even higher embodied footprint. However, the suggested replacement policy allows for the alternative to acquire a new chip at the 3-year mark in the trend is favorable and then keep using the old chip until it hits the tolerance level until doing the actual 24 4. Results Figure 4.10: Varied α model chip replacement, on the other hand if at 3 years the trend is unfavorable then we can wait until the tolerance level is hit to acquire the new chip. In other words, the proposed replacement policy allows for some flexibility to adapt to the embodied footprint trend. Also for the 5% it is negative during two years, year 4 and year 8, as well as ending the simulation very close to 0, an implication from this is that the savings done by not acquiring a new chip can be negated by the increased embodied footprint due to time. 4.5 Summary In summary it is possible to lower the footprint somewhat by using an altered re- placement policy where some reduction in model accuracy is tolerated. However, in the scenarios where faulty hardware has a higher operational footprint the fault tolerance only has a lower footprint in the long run. Additionally, when the fault tolerance is varied we see an effect where by using hardware only for a little longer before replacement, the trend of increasing embodied footprint may cause the fault tolerant policy to incur a higher footprint upon replacement due to increased em- 25 4. Results Figure 4.11: Increased operational footprint model bodied footprint. 26 4. Results Figure 4.12: Varied fault tolerance level 27 4. Results 28 5 Conclusions 5.1 Accelerators towards end of life In conclusion, while accelerators tend to show an increased rate of errors as they ap- proach the end of their lifecycle, they can still offer a positive impact when carefully managed. By balancing the replacement rate with performance requirements, orga- nizations can extend the useful life of these accelerators while maintaining accept- able output. This strategy allows for a gradual replacement approach, optimizing resources and cost while achieving the desired computational performance. Proper planning and maintenance can thus ensure that aging accelerators continue to con- tribute effectively, minimizing environmental and financial impacts associated with frequent replacements. Certain AI applications, particularly those involving tasks like image recognition, can tolerate moderate levels of computational errors without significant degradation in performance. Many machine learning algorithms are designed to be resilient to minor inaccuracies due to their probabilistic nature. That is, errors in individual computations may introduce noise but often dont prevent the model from producing useful results. This tolerance allows older hardware, even with a higher error rate, to remain viable in such applications. However, as error rates increase beyond a certain threshold, they begin to disrupt the consistency and reliability of model outputs, ultimately deteriorating performance to a point where the hardware is no longer useful. High error rates can lead to incor- rect predictions, decreased accuracy, and misclassification in AI tasks, affecting the overall quality and trustworthiness of the application. For this reason, even with error-tolerant AI workloads, it is essential to monitor error rates and establish re- placement policies that keep performance within acceptable limits while maximizing the hardwares lifecycle. Allowing a controlled level of error in AI applications can yield substantial long-term advantages by fostering robustness and adaptability in model development. This approach also offers sustainable benefits, as it reduces the frequency of hardware replacements, allowing resources to be redirected toward refining and optimizing algorithms rather than continuous hardware upgrades. In the long term, this can cultivate a more sustainable and efficient technology ecosystem, where AI models become progressively more resilient and adaptable to real-world variability and hard- 29 5. Conclusions ware inconsistencies. By planning for and accepting manageable errors, the industry can enhance the durability and robustness of AI, ensuring its capability to evolve alongside and capitalize on hardware improvements over time. 5.2 Limitations encountered The biggest limit encountered was the available tools for error injection. All of the explored hardware injectors were for GPUs of unavailable architectures or different manufacturers than what was available or in the case that an injector was available it was for significantly older software. One way to solve this would be to search for an available GPU compatible with current hardware injectors at a larger scale than this project which put little energy into finding new hardware besides immediately available institutional resources and personal hardware. Another solution would have been to set up a dedicated lab machine with an older suite of software alongside a rewrite of some project code to be compatible. Another limitation was the small scope of models, datasets, and error characteristics considered which put limits on the project with regards to exploring more alternatives. For example the ability to vary the predictive model could provide more insights into the inherent reliability of DNNs, varying datasets would have given better results as the guidelines and principles for assembling datasets has evolved since ILSVRC2012 [6] was created. And finally, due to the difficulty of implementation very few types of errors were ever implemented which restricted the insights gained into the error resilience of DNNs. 5.3 Further work 5.3.1 Long-term studies Long-term studies on aging accelerators offer valuable insights into the lifecycle and reliability of these critical computing components, with the potential to influence both hardware design and operational strategies. By analyzing how accelerators perform as they approach the end of their lifecycle, we can uncover trends in error rates, energy efficiency, and computational output. These insights are essential for developing predictive models that help identify when performance degradation reaches a point where replacement or intervention is necessary. During the background reading for this paper, we found few sources where efforts to keep a significant number of accelerators running for a long time. The benefits of such an effort is to provide more data regarding the lifetime of accelerators, where accelerators are most likely to break first, the effect of different loads on accelerators lifetime, and the effect of degraded accelerators on tasks. These are questions that are difficult to answer with current tools as they either require non-existent data or sophisticated techniques and access to hardware schematics. 30 5. Conclusions 5.3.2 Better methods for error mitigation One aspect not explored in this paper is the potential effect of error mitigation techniques on AI accelerators. Since AI applications have some resilience towards errors it would be interesting to know if error mitigation techniques have any impact at all, if it extends the existing resilience of AI applications, or if it provides little effect at all. 5.3.3 Better understanding of user attitude towards faulty hardware It is generally understood that potential errors in any application is unwanted and if possible shall be removed. However, with AI applications it is necessary to accept some risk that the result is incorrect, therefore it would be interesting to know if users would be willing to accept a reduction in correctness for improvements elsewhere. 31 5. Conclusions 32 Bibliography [1] A. Reuther, P. Michaleas, M. Jones, V. Gadepally, S. Samsi, and J. Kep- ner, “Lincoln ai computing survey (laics) update,” in 2023 IEEE High Perfor- mance Extreme Computing Conference (HPEC), 2023, pp. 1–7. doi: 10.1109/ HPEC58863.2023.10363568. [2] K. Kirkpatrick, “The carbon footprint of artificial intelligence,” Commun. ACM, vol. 66, no. 8, pp. 17–19, Jul. 2023, issn: 0001-0782. doi: 10.1145/ 3603746. [Online]. Available: https://doi.org/10.1145/3603746. [3] C. Freitag, M. Berners-Lee, K. Widdicks, B. Knowles, G. S. Blair, and A. Fri- day, “The real climate and transformative impact of ict: A critique of estimates, trends, and regulations,” Patterns (New York, N.Y.), vol. 2, no. 9, p. 100 340, Sep. 2021, issn: 2666-3899. doi: 10.1016/j.patter.2021.100340. [Online]. Available: https://europepmc.org/articles/PMC8441580. [4] L. Eeckhout, “A first-order model to assess computer architecture sustainabil- ity,” IEEE Computer Architecture Letters, vol. 21, no. 2, pp. 137–140, 2022. doi: 10.1109/LCA.2022.3217366. [5] C. Torres-Huitzil and B. Girau, “Fault and error tolerance in neural networks: A review,” IEEE Access, vol. 5, pp. 17 322–17 341, 2017. doi: 10.1109/ACCESS. 2017.2742698. [6] O. Russakovsky, J. Deng, H. Su, et al., “ImageNet Large Scale Visual Recogni- tion Challenge,” International Journal of Computer Vision (IJCV), vol. 115, no. 3, pp. 211–252, 2015. doi: 10.1007/s11263-015-0816-y. [7] M. Ruberti, “The chip manufacturing industry: Environmental impacts and eco-efficiency analysis,” Science of The Total Environment, vol. 858, p. 159 873, 2023, issn: 0048-9697. doi: https://doi.org/10.1016/j.scitotenv.2022. 159873. [Online]. Available: https://www.sciencedirect.com/science/ article/pii/S004896972206973X. [8] K. Liu, Q. Tan, J. Yu, and M. Wang, “A global perspective on e-waste recy- cling,” Circular Economy, vol. 2, no. 1, p. 100 028, 2023, issn: 2773-1677. doi: https://doi.org/10.1016/j.cec.2023.100028. [Online]. Available: https: //www.sciencedirect.com/science/article/pii/S2773167723000055. [9] I. Schneider, H. Xu, S. Benecke, et al., Life-cycle emissions of ai hardware: A cradle-to-grave approach and generational trends, 2025. arXiv: 2502.01671 [cs.AR]. [Online]. Available: https://arxiv.org/abs/2502.01671. [10] J. Malmodin, N. Lövehagen, P. Bergmark, and D. Lundén, “Ict sector electric- ity consumption and greenhouse gas emissions 2020 outcome,” Telecommuni- cations Policy, vol. 48, no. 3, p. 102 701, 2024, issn: 0308-5961. doi: https: 33 https://doi.org/10.1109/HPEC58863.2023.10363568 https://doi.org/10.1109/HPEC58863.2023.10363568 https://doi.org/10.1145/3603746 https://doi.org/10.1145/3603746 https://doi.org/10.1145/3603746 https://doi.org/10.1016/j.patter.2021.100340 https://europepmc.org/articles/PMC8441580 https://doi.org/10.1109/LCA.2022.3217366 https://doi.org/10.1109/ACCESS.2017.2742698 https://doi.org/10.1109/ACCESS.2017.2742698 https://doi.org/10.1007/s11263-015-0816-y https://doi.org/https://doi.org/10.1016/j.scitotenv.2022.159873 https://doi.org/https://doi.org/10.1016/j.scitotenv.2022.159873 https://www.sciencedirect.com/science/article/pii/S004896972206973X https://www.sciencedirect.com/science/article/pii/S004896972206973X https://doi.org/https://doi.org/10.1016/j.cec.2023.100028 https://www.sciencedirect.com/science/article/pii/S2773167723000055 https://www.sciencedirect.com/science/article/pii/S2773167723000055 https://arxiv.org/abs/2502.01671 https://arxiv.org/abs/2502.01671 https://arxiv.org/abs/2502.01671 https://doi.org/https://doi.org/10.1016/j.telpol.2023.102701 https://doi.org/https://doi.org/10.1016/j.telpol.2023.102701 Bibliography //doi.org/10.1016/j.telpol.2023.102701. [Online]. Available: https: //www.sciencedirect.com/science/article/pii/S0308596123002124. [11] C.-J. Wu, R. Raghavendra, U. Gupta, et al., “Sustainable ai: Environmental implications, challenges and opportunities,” in Proceedings of Machine Learn- ing and Systems, D. Marculescu, Y. Chi, and C. Wu, Eds., vol. 4, 2022, pp. 795– 813. [Online]. Available: https://proceedings.mlsys.org/paper_files/ paper/2022/file/462211f67c7d858f663355eff93b745e-Paper.pdf. [12] D. de Vries, “Investigation of gross die per wafer formulas,” IEEE Transactions on Semiconductor Manufacturing, vol. 18, no. 1, pp. 136–139, 2005. doi: 10. 1109/TSM.2004.836656. [13] B. Alcott, “Jevons’ paradox,” Ecological Economics, vol. 54, no. 1, pp. 9–21, 2005, issn: 0921-8009. doi: https://doi.org/10.1016/j.ecolecon.2005. 03.020. [Online]. Available: https://www.sciencedirect.com/science/ article/pii/S0921800905001084. [14] M. Garcia Bardon, P. Wuytens, L.-Å. Ragnarsson, et al., “Dtco including sus- tainability: Power-performance-area-cost-environmental score (ppace) analysis for logic technologies,” in 2020 IEEE International Electron Devices Meeting (IEDM), 2020, pp. 41.4.1–41.4.4. doi: 10.1109/IEDM13553.2020.9372004. [15] G. Ostrouchov, D. Maxwell, R. A. Ashraf, C. Engelmann, M. Shankar, and J. H. Rogers, “Gpu lifetimes on titan supercomputer: Survival analysis and reliability,” in SC20: International Conference for High Performance Comput- ing, Networking, Storage and Analysis, 2020, pp. 1–14. doi: 10.1109/SC41405. 2020.00045. [16] S. Gupta, “Reducing hardware-related interruptions in ai clusters: Strategies for resilient gpu infrastructure,” English, Journal of International Crisis and Risk Communication Research, vol. 8, pp. 44–53, 2025. [Online]. Available: http://proxy.lib.chalmers.se/login?url=https://www.proquest. com/scholarly-journals/reducing-hardware-related-interruptions- ai/docview/3257197972/se-2. [17] R. Boëzennec, F. Dufossé, G. Pallez, and A. Tremodeux, “Improving Su- percomputer Usage with Aging Awareness,” in Sustainable Supercomputing (Workshop of SC25), St. Louis, Missouri, United States, Nov. 2025. [Online]. Available: https://hal.science/hal-05109521. [18] T. B. Hewage, S. Ilager, M. R. Read, and R. Buyya, “Aging-aware cpu core management for embodied carbon amortization in cloud llm inference,” in Proceedings of the 16th ACM International Conference on Future and Sus- tainable Energy Systems, ser. E-Energy ’25, New York, NY, USA: Associa- tion for Computing Machinery, 2025, pp. 43–55, isbn: 9798400711251. doi: 10.1145/3679240.3734608. [Online]. Available: https://doi.org/10.1145/ 3679240.3734608. [19] GDB: The GNU Project Debugger — sourceware.org, https://sourceware. org/gdb/, [Accessed 15-04-2024]. [20] T. Tsai, S. K. S. Hari, M. Sullivan, O. Villa, and S. W. Keckler, “Nvbitfi: Dy- namic fault injection for gpus,” in 2021 51st Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), 2021, pp. 284–291. doi: 10.1109/DSN48987.2021.00041. 34 https://doi.org/https://doi.org/10.1016/j.telpol.2023.102701 https://doi.org/https://doi.org/10.1016/j.telpol.2023.102701 https://www.sciencedirect.com/science/article/pii/S0308596123002124 https://www.sciencedirect.com/science/article/pii/S0308596123002124 https://proceedings.mlsys.org/paper_files/paper/2022/file/462211f67c7d858f663355eff93b745e-Paper.pdf https://proceedings.mlsys.org/paper_files/paper/2022/file/462211f67c7d858f663355eff93b745e-Paper.pdf https://doi.org/10.1109/TSM.2004.836656 https://doi.org/10.1109/TSM.2004.836656 https://doi.org/https://doi.org/10.1016/j.ecolecon.2005.03.020 https://doi.org/https://doi.org/10.1016/j.ecolecon.2005.03.020 https://www.sciencedirect.com/science/article/pii/S0921800905001084 https://www.sciencedirect.com/science/article/pii/S0921800905001084 https://doi.org/10.1109/IEDM13553.2020.9372004 https://doi.org/10.1109/SC41405.2020.00045 https://doi.org/10.1109/SC41405.2020.00045 http://proxy.lib.chalmers.se/login?url=https://www.proquest.com/scholarly-journals/reducing-hardware-related-interruptions-ai/docview/3257197972/se-2 http://proxy.lib.chalmers.se/login?url=https://www.proquest.com/scholarly-journals/reducing-hardware-related-interruptions-ai/docview/3257197972/se-2 http://proxy.lib.chalmers.se/login?url=https://www.proquest.com/scholarly-journals/reducing-hardware-related-interruptions-ai/docview/3257197972/se-2 https://hal.science/hal-05109521 https://doi.org/10.1145/3679240.3734608 https://doi.org/10.1145/3679240.3734608 https://doi.org/10.1145/3679240.3734608 https://sourceware.org/gdb/ https://sourceware.org/gdb/ https://doi.org/10.1109/DSN48987.2021.00041 Bibliography [21] O. Villa, M. Stephenson, D. Nellans, and S. W. Keckler, “Nvbit: A dynamic bi- nary instrumentation framework for nvidia gpus,” in Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, ser. MI- CRO ’52, Columbus, OH, USA: Association for Computing Machinery, 2019, pp. 372–383, isbn: 9781450369381. doi: 10.1145/3352460.3358307. [Online]. Available: https://doi.org/10.1145/3352460.3358307. [22] S. K. S. Hari, T. Tsai, M. Stephenson, S. W. Keckler, and J. Emer, “Sassifi: An architecture-level fault injection tool for gpu application resilience evaluation,” in 2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 2017, pp. 249–258. doi: 10 . 1109 / ISPASS . 2017 . 7975296. [23] A. Mahmoud, N. Aggarwal, A. Nobbe, et al., “Pytorchfi: A runtime perturba- tion tool for dnns,” in 2020 50th Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN-W), 2020, pp. 25–31. 35 https://doi.org/10.1145/3352460.3358307 https://doi.org/10.1145/3352460.3358307 https://doi.org/10.1109/ISPASS.2017.7975296 https://doi.org/10.1109/ISPASS.2017.7975296 Bibliography 36 A Algorithms used 1 for i in flip_list : 2 bim[i] = (bim[i] + random . randrange (256)) % 256 Listing A.1: Algorithm for flipping bits in an image represented as a byte array The image to be corrupted is turned into a byte array bim which at the same time converts all images to an RGB format. Then a list flip_list is generated which contains all the indices in the byte array of where to flip a bit, this list is sorted such that the byte array is accessed in order. Finally a byte is accessed at index i and a bit-flip is simulated with the formula on row 2. 1 fs = pack('f', output [0, x, y, z]. item ()) 2 bval = list( unpack ('BBBB ', fs)) 3 [q,r] = divmod ( random . randrange (32) , 8) 4 bval[q] ^= 1 << r 5 fs = pack('BBBB ', *bval) 6 fnew= unpack ('f', fs) 7 a = fnew [0] Listing A.2: Algorithm for flipping a bit in a python 32-bit float taken from https://stackoverflow.com/a/34679225 I https://stackoverflow.com/a/34679225 List of Figures Introduction Research questions Limitations Ethics considerations Background Carbon footprint of computers Embodied footprint Operational footprint Scope 1, scope 2, scope 3 emissions Data uncertainty Scaling trends Estimating carbon footprint with a simple first order model First order embodied footprint First order operational footprint Carbon footprint model Carbon footprint model with replacement Errors in accelerators Hard errors Soft errors Error mitigation Method Implementing the carbon footprint model Error injection Error injection in software Error injection in hardware Error injection with PyTorch Results Experiment setup AlexNet ResNet18 Results and graphs Fault injection in ResNet18's convolutional layers Converting errors to loss in accuracy Effect of model parameters Balance between embodied and operational footprint Increased operational footprint with faulty hardware Replacement parameters Different levels of fault tolerance Summary Conclusions Accelerators towards end of life Limitations encountered Further work Long-term studies Better methods for error mitigation Better understanding of user attitude towards faulty hardware Bibliography Algorithms used