Chalmers Open Digital Repository

Välkommen till Chalmers öppna digitala arkiv!

Här hittar du:

  • Studentarbeten utgivna på lärosätet, såväl kandidatarbeten som examensarbeten på grund- och masternivå
  • Digitala specialsamlingar, som t ex Chalmers modellkammare
  • Utvalda projektrapporter
 

Enheter i Chalmers ODR

Välj en enhet för att se alla samlingar.

Senast inlagda

Reproducible Performance Variability Mitigation of OpenMP and SYCL Applications
(2025) Persson, Christoffer; Prétot, Mathias
Performance variability caused by unpredictable system noise remains a persistent challenge in high-performance and parallel computing. This thesis presents a methodology for characterising such variability through reproducible noise injection, using three representative benchmarks implemented with OpenMP and SYCL. A custom noise injector was developed to capture real system traces, isolate average and outlier behaviours, and reinject the delta as controlled, reproducible noise. We evaluate and compare multiple mitigation strategies, such as thread pinning, use of housekeeping cores, and simultaneous multithreading (SMT) toggling, under both default and noise-injected conditions. Our experimental study spans three benchmarks (N-body, Babelstream, and MiniFE) executed on local Intel and AMD desktop processors, enabling a comprehensive analysis of mitigation effectiveness across platforms and workloads. Results indicate that while OpenMP consistently delivers higher raw performance, SYCL tends to be more resilient to noisy environments. The proposed noise injection framework facilitates more rigorous and repeatable assessment of parallel program behaviour under controlled perturbations. Although the effectiveness of mitigation strategies varies with workload characteristics, system configuration, and noise intensity, certain techniques, such as isolating housekeeping cores, show clear benefits, particularly in high-noise scenarios.
ILAN: The Interference- and Locality-Aware NUMA Scheduler
(2025) Carlsson, Axel; Mellberg, Edvin
Non-Uniform Memory Access (NUMA) systems are increasingly common as the goto processor architecture for parallel computing within the field of High-Performance Computing (HPC). Similarly, OpenMP is the de-facto standard runtime for enabling parallelism. However, the default OpenMP runtime does not account for interference or data locality aspects, leading to performance degradations on NUMA systems where these effects become magnified. To address these challenges, this thesis proposes ILAN, an interference- and data locality-aware NUMA scheduler integrated into the LLVM OpenMP runtime, specifically targeting the taskloop construct. ILAN utilizes hardware topology information to enable a more structured task distribution strategy compared to the default OpenMP tasking scheduler, the work stealing scheduler, yielding improved data locality. Furthermore, the ILAN scheduler utilizes moldability to incorporate interference awareness, dynamically reducing the number of OpenMP threads to mitigate the effects of interference while further improving data locality. Performance evaluation using the NAS Parallel Benchmarks, Matrix Multiplication, and LULESH on a multi-socket NUMA platform demonstrates an average speedup of 10%, with a maximum speedup of 46%, compared to the default OpenMP work stealing scheduler.
Large Scale Efficient Data Readout for Vehicle Fleets
(2025) Johnsson, Simon
As vehicles become more technologically advanced, the data generated by a single vehicle reach significant amounts. Diverse data has high potential in use cases such as machine learning by providing insights into different conditions. Currently, there is no clear solution for collecting this data as vehicle systems are restricted in terms of compute, memory, storage, and bandwidth. This thesis investigates the problem of large scale vehicle data readout and presents a solution to it, providing a significant increase by leveraging lossless streaming based compression at low cost. Furthermore, it addresses the architecture necessary in order to sufficiently process the data globally and how best to integrate this efficiently with a massive number of vehicle systems. Lastly, a generalized model is formulated at the micro scale, which establishes the requirements in terms of compute and memory on a single vehicle system based on the findings presented. At the macro scale, the infrastructure required to support the solution is discussed.
Preserving Semantics of Multi-Threaded Programs During Cross-ISA Dynamic Binary Translation
(2025) Jonsson, Martin; Vålvik, Valdemar
Dynamic Binary Translation (DBT) is a method used to emulate programs on platforms on which they cannot execute natively. In the past, DBTs either did not emulate multi-core programs or did not parallelize their execution. This is no longer the case, as modern processors are often multi-core, necessitating better scaling in DBTs. Renode [1] is one such DBT that is able to emulate multi-core programs using parallel execution. However, Renode — like many other DBTs — fails to correctly emulate the semantics of certain atomic instructions. In particular, emulation of the RISC-V instructions Load-Reserved (LR) and Store-Conditional (SC) is currently incorrect. These semantics are paramount for program correctness. In this thesis, we improve Renode’s correctness by applying the Hash table-based Store-Test (HST) — a scheme proposed by Zhao et al. [2] — to correctly emulate LR/SC instructions. Using model checking, we find that implementing HST as described by Zhao et al. in Renode results in a race condition. We show how to remediate this race condition in Renode. Furthermore, we compare the performance of two HST implementations: one written directly in an intermediate representation (IR) similar to assembly, the other written in C using helper functions. Previous work suggests that IR is faster due to less runtime overhead, which we show holds in this case. We find that the IR implementation is 34% faster than helpers in microbenchmarks and 6–18% faster in the PARSEC [3] benchmark suite. Our IR implementation of HST in Renode improves both correctness and scalability. We show that our implementation can boot Linux on an embedded platform with multi-core emulation enabled, which Renode in its current state (current Renode) cannot do due to correctness issues. Moreover, our implementation scales well when current Renode does not: in an 8-thread microbenchmark of LR/SC, our implementation is 15.6x faster than current Renode. We find that this scalability can be achieved with as little as 8 KiB of extra memory usage.
Optimization of Deep Neural Networks for Efficient Resource Utilization
(2025) Sanjay, Namratha
Deep neural networks (DNNs) are widely used in computer vision tasks such as image classification and semantic segmentation, but their high computational and memory demands limit deployment on resource-constrained edge devices. This thesis explores quantization as a model compression technique to improve inference efficiency while minimizing accuracy loss. Post-Training Quantization (PTQ) and Quantization-Aware Training (QAT) were applied to MobileNetV2 and ResNet50 for classification on the Mini-ImageNet dataset, and to FCN-ResNet18 for segmentation on the Cityscapes dataset. Additionally, mixed-precision QAT was investigated using first-order gradient-based sensitivity analysis to assign per-layer bit-widths. Maintaining activation precision at or above 6 bits during mixed-precision QAT enabled substantial compression—up to 7.8×—while keeping accuracy degradation under 1%. ResNet50 and MobileNetV2 attained compression ratios of 6.3× and 5.2×, respectively. FCN-ResNet18 preserved 57.3% mIoU with 7.8× compression and under 1% accuracy drop compared to the FP32 baseline. Conversely, reducing activation precision to 4 bits led to notable performance degradation, especially in lightweight models and segmentation tasks. Experiments were conducted on NVIDIA Tesla T4 GPU. The results demonstrate strong potential for deploying quantized DNNs on integer-based hardware such as mobile devices, embedded systems, and FPGAs.