Examensarbeten för masterexamen // Master Theses

Länka till denna samling:

https://odr.chalmers.se/handle/20.500.12380/29

Browse

Visar 1 - 20 av 29

A Social-Aware Federated Real-Time Scheduling Algorithm for Unrelated Multiprocessor Platforms
(2022) Wilkins, David; Hammargren, Oskar; Chalmers tekniska högskola / Institutionen för data och informationsteknik; Chalmers University of Technology / Department of Computer Science and Engineering; Jonsson, Jan; Pathan, Risat
Real-time systems are commonly found in the modern world, ranging from aerospace control systems to health-care equipment. Real-time systems operate under strict timing constraints, meaning each program (i.e. task) must complete before a given deadline. Thus, a Real-time scheduling algorithm needs to schedule each task such that all deadlines are guaranteed to be met. Due to the sophistication of many modern real-time applications, the workload of real-time tasks are ever increasing. This creates a demand for multiprocessor platforms that can distribute the workload among several processors. Furthermore, many multiprocessor platforms are heterogeneous, meaning they include processors of different types that offers different capabilities to different task. This allows hardware to be specialized for different types of tasks. An example of such a platform is the ARM’s big.LITTLE architecture, which combines high-performance processing unit with power-efficient processors. However, scheduling real-time tasks on multiprocessors is a difficult problem. One approach to this problem is federated scheduling, which divides tasks into two categories, light or heavy. Light tasks can meet their deadline using only one processor, while heavy tasks need more than one processors to meet their deadline. Thus, federated scheduling assigns a cluster of processors to each heavy task. The light tasks are then assigned to the remaining processors. This assignment problem is an intractable problem since every possible task-to-processor assignment need to be considered in order to find the optimal solution. The current state-of-the-art in federated scheduling on heterogeneous platforms has a limitation. Namely, each task takes its preferred processors disregarding whether these processors were critical to other tasks. We fills this gap by providing a social-aware processor assignment algorithm. This algorithm gives each processor to the tasks that needs it the most. Our social-aware processor assignment algorithm is empirically evaluated through simulation. The performance of our algorithm is compared with the current state-of-the-art. The simulation show that our social-aware algorithm performs better in most cases.
A Solution for 3D Visualization on Soil Surface Using Stereo Camera
(2023) Wang, Zilong; Chang, Qi; Chalmers tekniska högskola / Institutionen för data och informationsteknik; Chalmers University of Technology / Department of Computer Science and Engineering; Heyn, Hans-Martin; Cabrero-Daniel, Beatriz
Ground surface monitoring using 3D visualization techniques has gained significant attention in recent years, particularly in the context of autonomous driving. This research presents an implementation of a cost-efficient 3D visualization solution for soil surface analysis, aiming to explore an alternative approach to Light Detection and Ranging (LiDAR) technology. The research investigates the feasibility of utilizing stereo cameras as an affordable option for generating 3D visualization. A comparative study between various stereo vision 3D reconstruction methods is conducted to evaluate their performance. Different stereo-matching methods are employed to extract depth information from the captured stereo images, including a machine learning method and a traditional semi-global block matching method. The resulting depth maps are then projected onto a 3D space, enabling the generation of point cloud data for visualization purposes. The visualized 3D representation provides an enhanced understanding of the soil surface conditions and facilitates detailed analysis. To assess the effectiveness of the implemented visualization solution, several metrics are employed as a measure to compare the accuracy of the generated visualizations. Time measurement and Root Mean Square Error (RMSE) analysis serve as benchmarks for evaluating the performance and reliability of the proposed 3D visualization approach. The findings of this research demonstrate the potential of stereo cameras as a costeffective alternative to LiDAR sensors for soil surface analysis. The presented 3D visualization solution may contribute to autonomous construction vehicles by providing an efficient and affordable function for monitoring soil surface conditions.
Application of Bump-effects in a Postprocessing step
(2022) Tao, Anthony; Gideflod, Marcus; Chalmers tekniska högskola / Institutionen för data och informationsteknik; Chalmers University of Technology / Department of Computer Science and Engineering; Assarson, Ulf; Sintorn, Erik
With the rise of internet shopping, the importance of Computer-generated Imagery is rapidly increasing. Visualizations help companies sell their product, especially customizable ones such as bathroom interiors. Such applications must provide images with a degree of realism that accurately represents the real features of the product, but also be able to produce them within an acceptable time frame. A common way of producing realistic visualizations of products is using path-tracing, but for interactive applications, the technique is overly time-consuming. Still, it can function well as an offline method by pre-rendering images to be used on demand. However, with customizable products it is not a feasible solution as the permutations would quickly lead to an unimaginable amount of images. This thesis focuses on customizing bumps on the surface of walls and applying it as a post-process to avoid the permutation issue. The 3D scene will not be available during run-time, meaning information has to be stored beforehand. The proposed method simply accepts a bump map and computes the new shading as a result of new normals. The method only considers direct lighting where the camera position is static, but includes realistic soft shadows from static geometry. The resulting images’ visual quality comes close to a ray-traced image, but without further optimizations the method does not meet the required render time for larger scenes. The reasons for this are discussed, and ideas for improvement are suggested.
Comparative Performance and Scalability Analysis of GPUaccelerated Database Operations
(2023) Andersson, Carl; Nilsson, Jonathan; Chalmers tekniska högskola / Institutionen för data och informationsteknik; Chalmers University of Technology / Department of Computer Science and Engineering; Petersen Moura Trancoso, Pedro; Petersen Moura Trancoso, Pedro
This Master’s thesis investigates the performance dynamics of database operations - V-Search, Fuzzy Search, and Join - implemented on both Central Processing Units (CPU) and Graphics Processing Units (GPU). With the ever-increasing demand for efficient data processing, it has become crucial to understand and optimize the use of different hardware platforms for executing diverse database tasks. As such, this research sheds light on the performance of each type of processing unit when running the said operations. The study first details the design and implementation of each database operation on both CPU and GPU, taking into account the different architectural characteristics and processing capabilities of each unit. The specific operations were chosen due to their wide use in the field of data management and their different processing requirements, which allows for a comprehensive performance analysis. Next, a series of benchmark tests is conducted to evaluate the relative performance of the CPU and GPU implementations. Factors such as data size, data type, and transfer time, among others are taken into account. The results show a detailed comparison of execution times between the two implementations, offering insights into the potential advantages and limitations of each. This work contributes to a better understanding of the trade-offs involved when choosing between CPU and GPU for database operations. We hope that our findings will inform future work on hardware-specific optimization for database systems, leading to more efficient and effective solutions for large-scale data processing tasks.
Cost Model for Scalable Containerized Relay Game Servers
(2023) Randow, Sabine; Chalmers tekniska högskola / Institutionen för data och informationsteknik; Chalmers University of Technology / Department of Computer Science and Engineering; Petersen Moura Trancoso, Pedro
Previous research has looked into how to dynamically scale containerized applications with consideration to clients Quality of Experience (QoE), but there is a lack of knowledge on how relay server’s, used for multiplayer games, scale. Through investigating the largest cloud platform providers, three key metrics were identified: Network traffic, memory allocation and CPU utilization. These metrics were investigated depending on several parameters: the clients perceived frequency of messages, the number of clients connected to a server, and the number of clients playing the same game. To do this, a client-simulator program was expanded to work with a pre-existing server developed by Opera Software. The server and client-simulators were used in different environments, both bare-metal machines and containers from Amazon Web Services. Upon analyzation it was found that network traffic and memory allocation scales linearly, while the CPU utilization can only be interpolated withing a range used to train a third degree regression model. The error of all models were fairly low at a maximum of 2,658%.
Creating a Microbenchmark with wide coverage of Memory-boundedness
(2022) Côté, Niklas; Chalmers tekniska högskola / Institutionen för data och informationsteknik; Chalmers University of Technology / Department of Computer Science and Engineering; Pericas, Miquel; Goel, Bhavishya
Memory-boundedness of an application is defined as the degree to which the performance of the application depends on the size and performance of memory instead of the CPU. The degree of memory-boundedness of an application determines its speedup when the CPU frequency is increased: an application with no memoryboundedness will exhibit linear speedup with frequency increase while an application with 100% memory-boundedness will exhibit no speedup at all. Dynamic voltage and frequency scaling (DVFS) is a power saving technique which aims to save energy by dynamically reducing frequency during the memory-bound phase of the application. The DVFS decision making is based on prediction models which predict the appropriate voltage and frequency for the application phase. To increase the accuracy of prediction models, training data needs to be collected from applications which exhibit varying degrees of compute-bound and memory-bound behavior. A single microbenchmark which can simulate wide variations of memory-boundedness behaviour could reduce the time required to train the prediction models and improve prediction accuracy. This thesis analyzes different measurement methods for memory-boundedness and proposes a new formula based on L3 cache misses which is believed to be a more suited fit for the definition of memory-boundedness. The testing of formulas was done on a benchmark suite from NASA called NAS Parallel Benchmarks (NPB), where the consistency and values of the formulas were evaluated. The result of measurements with the new L3 cache miss formula was then used to create a microbenchmark which can produce large variations of memory-boundedness.
Data Prefetcher Based on a Temporal Convolutional Network
(2022) LARSSON, MATTIAS; Chalmers tekniska högskola / Institutionen för data och informationsteknik; Chalmers University of Technology / Department of Computer Science and Engineering; Pericas, Miquel; Petersen Moura Trancoso, Pedro
Cache memory serves a crucial role in alleviating the difference in speed between the computer’s processor and main memory, which has become a growing problem over the years. However, the cache can only hide the whole memory access latency if the requested data is present in it, and only parts of it if the data is already on its way. For this reason, the technique called data prefetching has proven to be an effective way of increasing performance. This technique entails predicting which memory addresses will be accessed in the future and bringing the corresponding data to the cache ahead of time. This thesis explores the design of a data prefetcher based on a Temporal Convolutional Network (TCN), focusing on low storage overhead to make its corresponding implementation size realistic for hardware implementation. In performance simulation tests performed on 15 memory-intensive benchmarks, the TCN prefetcher achieved an average speedup of 30.5 % over a no prefetching baseline, while adding only 14.4 KB of storage overhead. The result shows that the TCN architecture can be a contender for future ML-based prefetchers and that it might work as a good substitute for larger multilayer perceptron (MLP) models. However, the results also suggest that the trade-offs necessary for practical implementation size of a neural network prefetcher make it challenging to advance the average performance beyond rule-based offset prefetchers.
Diffuse Global Illumination using Surfels
(2022) Ekberg, Hampus; Chalmers tekniska högskola / Institutionen för data och informationsteknik; Chalmers University of Technology / Department of Computer Science and Engineering; Assarsson, Ulf; Sintorn, Erik
Simulating global illumination is an important part of rendering realistic-looking scenes. Diffuse global illumination is a subset of global illumination which focuses on the diffuse reflections of light. Methods for solving diffuse global illumination usually require either pre-computations of the light or a lot of raycasts every frame. This thesis explores an alternative approach inspired by Global Illumination Based on Surfels to see if it is possible to reduce the number of rays cast each frame while maintaining a similar visual quality as previous methods, thus reducing the computation cost. This exploration was accomplished by implementing the alternative approach and comparing both performance and visual quality results to a pre-existing diffuse global illumination solution. The results show that it is possible to limit number of rays while keeping a similar visual quality, but the implementation as described in the thesis has other computation bottlenecks that end up overriding the gains from reducing number of rays in many cases.
Enabling Moldability in OpenMP
(2023) Sundqvist, Pontus; Sundqvist, Simon; Chalmers tekniska högskola / Institutionen för data och informationsteknik; Chalmers University of Technology / Department of Computer Science and Engineering; Pericàs, Miquel; Papadopoulou, Nikela
OpenMP has long been a ubiquitous technology in High-Performance Computing (HPC), making parallel programs simple to reason about and portable to many different systems. When an OpenMP runtime decides which threads should run tasks, it often uses a simple work-stealing scheduler as they evenly distribute tasks among cores. This is the method used by LLVMs OpenMP runtime. But today, HPC systems often consist of multiple sockets, each with many cores and non-uniform memory access (NUMA). This creates a complicated memory hierarchy which isn’t accounted for by simple work-stealing schedulers. Another feature not supported well by simple work-stealing schedulers is nested parallelism, where each task runs multiple threads in parallel. It isn’t clear how many threads each task should be allocated i.e. the width of the task. If it’s too high there will be over-subscription while if it’s too low there will be load imbalance. This can be solved by supporting moldable tasks, which are tasks where the scheduler decides each task’s width. We extend LLVM’s OpenMP runtime with support for moldable tasks scheduled using a locality-aware scheduler.
Energy-efficient OpenMP Programming
(2022) Karlsson, Axel; Valter, Henrik; Chalmers tekniska högskola / Institutionen för data och informationsteknik; Chalmers University of Technology / Department of Computer Science and Engineering; Karlsson, Johan; Pericàs, Miquel
OpenMP is the de facto API for parallel programming in HPC applications. These programs are often computed in data centers, where energy consumption is a major issue. Whereas previous work has focused almost entirely on performance, we here analyse aspects of OpenMP from an energy consumption perspective. This analysis is accomplished by executing novel microbenchmarks and common benchmark suites on data center nodes and measuring the energy consumption. Three main aspects are analysed: directive-generated loop tiling and unrolling, parallel for loops and explicit tasking, and the policy of handling blocked threads. For loop tiling and unrolling, we find that tiling can yield significant energy savings for some, mostly unoptimised programs, while directive-generated unrolling provides very minor improvement in the best case and degenerates performance majorly in the worst case. For the second aspect, we find that parallel for loops yield better results than explicit tasking loops in cases where both can be used. This becomes more prominent with more fine-grained workloads. For the third, we find that significant energy savings can be made by not descheduling waiting threads, but instead having them spin, at the cost of a higher power consumption. We also analyse how the choice of compiler affects the above questions by compiling programs with each of ICC, Clang and GCC, and find that while neither is strictly better than the others, they can produce very different results for the same compiled programs. As a final step, we combine the findings of all results and suggest novel compiler directives as well as general recommendations on how to reduce energy consumption in OpenMP programs.
Energy-Performance Balancing Task Scheduler for Asymmetric Platforms
(2023) Andersson, Henrik; Wiede, Carl; Chalmers tekniska högskola / Institutionen för data och informationsteknik; Chalmers University of Technology / Department of Computer Science and Engineering; Pericàs, Miquel; Chen, Jing; Goel, Bhavishya
Sustainability is a growing concern for society and computer science is no exception. Power consumption by computers may be reduced by lowering the frequency of the processor as well as meticulously limiting the hardware resource usage. An optimally energy efficient computation may, however, cause an impractically long execution time. Previous work has successfully provided a framework that minimizes energy consumption using task-based computation. One way to develop the framework concerns efforts to strike a balance between performance and energy efficiency to find the optimal trade-off between increased execution time and reduced energy cost. An option to utilize such a trade-off could incentivize a greater adoption of aforementioned energy reduction techniques. This thesis presents various efforts to modify an existing energy efficient task scheduling framework in order to balance energy efficiency and performance. The framework was further generalized and tested on multiple platforms for the sake of affirming its generic applicability. The Simics hardware simulator was assessed in hopes of enabling the possibility to test the framework on a myriad of virtual platforms. The evaluation shows that the modified framework can successfully determine the task scheduling decisions that yield optimal trade-off between performance and energy efficiency. After some additional modifications, the framework could seamlessly run on other platforms than the one it was designed for. Although the attempts to use the framework within the select virtual environment were somewhat futile, promising directions for future research were discovered.
Evaluation of Acceleration Structures for Ray Casting in Physics Scenes
(2023) Guo, Chenxu; Chalmers tekniska högskola / Institutionen för data och informationsteknik; Chalmers University of Technology / Department of Computer Science and Engineering; Sintorn, Erik; Sintorn, Erik
Ray casting is a fundamental feature of game engines, and has a wide range of applications in the field of video game development. As a performance-critical task, a significant number of studies have proposed various acceleration structures to improve the efficiency of ray casting. In this study, we collaborated with Massive Entertainment to investigate optimal acceleration structures for ray casting within their in-house game engine, Snowdrop. We implemented several promising acceleration structures and developed a testing framework to evaluate their performance. The acceleration structures we implemented include uniform grids (UG), hierarchical hash grids (HG), dynamic bounding volume hierarchies (DBVH) and linear bounding volume hierarchies (LBVH). To obtain representative results, we tested these algorithms on a set of uniform scenes generated by the Unity3D engine, as well as on irregular scenes exported by the Snowdrop engine. The test items included the build time of the acceleration structure, the update time, and the time taken to perform 1000 ray castings. The results were used as a basis for evaluating the performance of the different acceleration structures. Furthermore, to gain a deeper understanding of the reasons for the differences in performance of these acceleration structures, we also introduced a performance model to analyze the details of the execution of these structures. Finally, we found that HG and DBVH achieved the best balance of query speed and update speed among all the acceleration structures involved in the comparison.
Exploratory machine learning strategies for predicting thermal conductivity of materials from transient plane source measurement
(2023) LEE, BITNOORI; Chalmers tekniska högskola / Institutionen för data och informationsteknik; Chalmers University of Technology / Department of Computer Science and Engineering; Petersen Moura Trancoso, Pedro; Cornelis Jacobus Bruinsma, Sebastianus
This study introduces the application of machine learning to the Hot Disk Transient Plane Source (TPS) method, aimed at enhancing the precision and efficiency of thermal conductivity prediction. Comprising two distinct parts, Part I, the research addresses the prediction of thermal conductivity in low-density/high-insulation materials. Part II is the thermal conductivity measurement under high-temperature conditions with noise. Four prediction algorithms were systematically applied and assessed for accuracy to predict thermal conductivities. Experimental data obtained through the TPS method served as the basis for machine learning training data, augmented with simulated data to make up for insufficient data. The outcomes of this study provide a conclusive response to a critical research question: Can machine learning accurately predict thermal conductivity from transient curves? In Part I, machine learning consistently and accurately predicts thermal conductivity for low-density/high-insulation materials devoid of CL values, underscoring its complementary utility. In Part II, machine learning demonstrates its proficiency in accurately predicting thermal conductivity, even in noisy transient curves at extreme temperatures. However, challenges stemming from insufficient data issues and the absence of reference points introduce variability in accuracy.
Federated Scheduling of Mixed-Criticality Sporadic DAG Tasks on Uniform Multiprocessors
(2022) Huang, Chengzi; Chalmers tekniska högskola / Institutionen för data och informationsteknik; Chalmers University of Technology / Department of Computer Science and Engineering; Jonsson, Jan; Pathan, Risat
In designing real-time systems, there is an emerging trend in moving towards the mixed-criticality(MC) system, where functionalities with different degrees of importance (i.e., criticality) are implemented upon a shared platform, and the level of heterogeneity in the modern multiprocessors systems is gradually increasing. This thesis develops algorithms to schedule and allocate implicit-deadline sporadic mixedcriticality DAGs upon the uniform heterogeneous multiprocessors. A two-level scheduler is designed based on the federated scheduling paradigm. Tasks are categorized into heavy and light tasks according to their utilization. Each heavy task exclusively executes on a number of dedicated processors (cluster). Light tasks are treated as sequential tasks and share the remaining processors. The workconserving scheduler is used at the cluster level, and EDF-VD is used to schedule the light tasks. The upper bound of the response time for a heavy task under the work-conserving scheduler and utilisation bound for light tasks under EDF-VD are proposed to verify offline that the design constraints are met. Task allocation upon the multiprocessors is known to be NP-Hard. This thesis describes an approach to solving the task allocation problem using bin-packing heuristics and simulated annealing. There are two stages for task allocation. The light tasks are assigned to processors using partitioned scheduling in the first stage. A group of bin-packing heuristics will be considered, and a metric QoP is defined to compare the quality of partitioned scheduling under different heuristics. The allocation partition with the best QoP will be used. In the second stage, simulated annealing is employed and tries to find a feasible solution by gradually minimizing the total task lateness in the system. There is a service abrupt problem at the traditional mixed criticality system. Elastic mixed criticality task model is introduced to address this problem. This thesis also develop schedulability test and discusses the task allocation for elastic mixed criticality task. Empirical evaluation is presented to show the effectiveness of our approach.
Hardware Acceleration of Machine Learning
(2023) Chen, Fangzhou; Sköld, William; Chalmers tekniska högskola / Institutionen för data och informationsteknik; Chalmers University of Technology / Department of Computer Science and Engineering; Petersen Moura Trancoso, Pedro; Petersen Moura Trancoso, Pedro
The Transformer architecture has been widely used in various fields, as demonstrated by GPT-3, a large language model that shows impressive performance. However, achieving such excellent performance requires high computational capabilities. Therefore, improving the computational power of current machine learning systems is of great importance. This thesis aims to optimize and accelerate fine-tuning of Transformer-based models while taking into account several evaluation criteria, such as training time, energy consumption, cost, and hardware utilization. Additionally, a comparison is made between GPU training settings and specialized AI accelerators, such as TPU training settings. In our study, a high-performance kernel for the Adan optimizer was introduced, and the LightSeq library is applied to accelerate existing Transformer components. We also introduce mixed precision training into our workflow and compare all these optimization techniques step by step with baseline performance. In addition, our analysis includes distributed training with multiple GPUs, and a backpropagation time estimation algorithm is introduced. Next, Google’s TPU accelerator is used to run our task, and its performance is compared to the similar GPU setup used in our study. Finally, the advantages and disadvantages of different methods are systematically analyzed, while training on V100, A100, A10 and T4 with different configurations. Meanwhile, the workflow between GPUs and TPUs is analyzed, illustrating the pros and cons of different accelerators. Various weights for measuring optimization methods based on time, energy consumption, cost, and hardware utilization are proposed. Our analysis shows that optimal scores in all metrics can be achieved by implementing the optimized LightSeq model, kernel fusion for the Adan optimizer, and enabling mixed precision training. While training with TPU offers certain advantages, such as large batch sizes when loading training data, the ease of use, reliability, and software stability of GPU training surpasses that of TPU training.
Hardware BVH builder based on the PLOC++ algorithm
(2023) Saberian, Keivan; Chalmers tekniska högskola / Institutionen för data och informationsteknik; Chalmers University of Technology / Department of Computer Science and Engineering; Sintorn, Erik; Assarsson, Ulf
The demand for high-quality visual effects in 3D rendering of real-time applications is on the rise. To meet this demand, researchers have focused on integrating ray tracing support into graphics hardware. However, support for dynamic scenes still poses a significant challenge. This is due to the fact that the underlying spatial data structures, most commonly the bounding volume hierarchy, must be rebuilt every frame in the worst case. This thesis introduces a hardware accelerator for the construction of bounding volume hierarchies. The proposed hardware is based on the state-of-the-art PLOC++ algorithm, and aims to address the memory-intensive construction through a bandwidth economical approach where most external memory traffic is converted into on-chip streaming traffic similar to PLOCTree [Viitanen et al. 2018]. The proposed unit is on average 2.19 times faster in simulation, with a 3.94 times improvement in memory traffic.
Hierarchical Reconstruction of Quadtree-Based Approximations of Incident Radiance
(2023) Nilsson, Johannes; Chalmers tekniska högskola / Institutionen för data och informationsteknik; Chalmers University of Technology / Department of Computer Science and Engineering; Assarsson, Ulf; Sintorn, Erik
This thesis presents a method of hierarchically reconstructing quadtree-based approximations of incident radiance for path guiding. To that end, Gaussian denoising is applied by mapping quadtree data to matrices, performing regular matrix convolutions, and then mapping the data back to the quadtree. The reconstructed guiding distributions are shown to outperform previous work in most cases, especially when the number of path samples is limited. Additionally, using a simple target distribution in the beginning of the learning process before switching to the full version is shown to speed up the learning. To limit the overhead of the reconstruction procedure, an automatic workload budgeting algorithm is presented. While the improved quality of the guiding distributions seems promising, the implemented reconstruction algorithm imposes a large enough overhead to nearly cancel out the benefits.
High-speed Serial SpaceFibre link Software Evaluation
(2023) Mass, Jesper; Chalmers tekniska högskola / Institutionen för data och informationsteknik; Chalmers University of Technology / Department of Computer Science and Engineering; Jonsson, Jan; Waqar Azhar, Muhammad
SpaceFibre is an emerging standard for onboard spacecraft communication. Cobham Gaisler has recently developed an IP core to communicate through a SpaceFibrelink. However, no software driver API to use the IP is currently available. By using SpaceFibre the speed of communication can be increased by up to 15 times compared to the previously used SpaceWire. This increased speed could enable more data to be sent from sensors quicker for processing at the onboard computer.Currently, few other drivers for SpaceFibre exist and there is little benchmarking for how well it actually performs. The aim of this thesis was to design a software driver to benchmark the actual performance and validate the SpaceFibre IP developed atCobham Gaisler. To accomplish this an external tool, developed by the creators ofthe SpaceFibre standard, was used: the STAR Fire Mk3. With this tool, a test using the driver designed in this thesis was performed where the STAR Fire Mk3 was used to measure the statistics and act as both a recipient and transmitter of messages. As a result, it was found that the IP core does reach speeds up to 1.91 Gbps for reception and 1.5 Gbps for transmission with a link running at effectively 2 Gbps. Using this the user can reach speeds of at least ten times the speed of SpaceWire, and including all the standardised quality of service the protocol provides.
Hybrid Compression: Exploiting Model and Data Compression for Deep Neural Network Workloads
(2022) Xie, Yunyao; Li, Naicheng; Chalmers tekniska högskola / Institutionen för data och informationsteknik; Chalmers University of Technology / Department of Computer Science and Engineering; Pericas, Miquel; Petersen Moura Trancoso, Pedro
Nowadays, various kinds of deep neural networks (DNNs) show human-level capabilities in their domain. But the networks usually have millions or billions of parameters and significant computation cost. To bring the artificial intelligence into people’s daily lives, deploying DNNs efficiently on various hardware (e.g., resourcelimited edge devices) has become a popular topic. In our thesis work, we explore the model compression and data compression and then combine them to build hybrid compression to reduce the computation cost, memory cost of DNNs and speed up the model inference speed. The hybrid compression gives us compression ratios of 5.15 and 5.57, speedup of 2.66 and 3.93, and accuracy losses of 2.38% and 2.64%, if we start from pruning 30% of floating point operations (FLOPs) of MobileNetV1 and ResNet50 respectively. Starting from models which are pruned 50% FLOPs, hybrid compression can achieve compression ratio of 6.29 and 7.53, speedup of 2.74 and 4.21 and with accuracy drop of 7.71% and 9.27% for MobileNetV1 and ResNet50. To verify the effectiveness of hybrid compression, we evaluate the inferece speed of compressed model on two edge devices, Nvidia Jetson Nano and Nvidia Jetson Xavier NX. We find that the gains of hybrid compression are hardware related and NX shows more impressive strengths than Nano. At last, we give users recommendations about how to apply the hybrid compression from different aspects.
Investigating Dynamic User-Level Scheduling to Improve AI-Based Intrusion Detection Systems on IoT
(2022) Coban, Ali Zulfukar; Mirzai, Aria; Chalmers tekniska högskola / Institutionen för data och informationsteknik; Chalmers University of Technology / Department of Computer Science and Engineering; Petersen Moura Trancoso, Pedro; Almgren, Magnus
Internet of things devices with their inherent convenience factor have exploded in numbers during the latest decade, however at the cost of rising security concerns. This is largely due to their incapability of solving complex and computationally heavy numerical problems especially when dealing with large data-sets, a key component for computers in today’s world for fending off attacks. The main contribution of this thesis is investigating how a dynamic user-level scheduler can improve the detection capabilities of AI-based intrusion detection systems and to enable retraining of an AI algorithm on an IoT device. The models are assumed to be made of lightweight and data-driven machine learning algorithms, such as ”PASAD” which we chose to utilize for this work. The scheduler was created after having initially developed a basic framework for allowing the PASAD models to detect attacks, denoted as our ”baseline” system. The experiments that followed proved that the dynamic user-level scheduler provides several additional advantages compared to the baseline, mainly a substantial throughput increase which reduces the time until attacks are detected, a critical factor from the security aspect. Additionally, a model prioritization feature was built to allow the scheduler to allocate more processing resources towards nodes it is suspecting to be under attack. Both of these variables play an important role in pawing the way to having our IoT devices being protected by more robust security schemes, even for those devices considered too resource limited today. With our scheduler implemented on an Nvidia Jetson Nano, is it possible to calculate approximately 57,000 anomaly scores per second, which are used in the attack monitoring process, for roughly 97 detection models while simultaneous retraining is taking place (results are for when PASAD is the utilized detection algorithm). Furthermore, with 75 PASAD models the scheduler is able reach ≈1.46 times the performance of the baseline with retraining enabled and with retraining disabled it reaches ≈2.15 times the performance of the baseline.

Browse

Browsar Examensarbeten för masterexamen // Master Theses efter Program "High-performance computer systems (MPHPC), MSc"

Sökresultat per sida

Sortera efter