Visar 1 - 5 av 1712
- PostLeveraging Data Augmentation for Better Named Entity Recognition in Low-Resource Settings(2024) Björnerud, Philip; Chalmers tekniska högskola / Institutionen för data och informationsteknik; Chalmers University of Technology / Department of Computer Science and Engineering; Bernardy, Jean-Philippe; Dannélls, Dana; Kokkinakis, DimitriosThis thesis investigates the challenges in the field of Natural Language Processing (NLP), with a focus on Named Entity Recognition (NER), a subtask within NLP that involves classifying entities. Addressing the issue of data scarcity, which is particularly critical in non-English languages like Swedish, this study investigates various data augmentation methods by fine-tuning the transformer-based model, KB-BERT. The datasets are simulated as low-resource settings, drawing inspiration from the study X Dai and H Adel (2020)  work, using three sets of training data containing 50, 150, and 500 instances respectively. The thesis also explores whether a newly developed state-of-the-art data augmentation method can outperform other data augmentation methods in enhancing an NLP model, centering on three data augmentation methods: Synonym replacement, Mention replacement, and AugGPT, the last being a state-of-the-art method. The findings of this study highlight that synonym replacement emerged as the most effective data augmentation method across various low-resource settings, achieving the highest F1-score increase in all scenarios. AugGPT achieved the second highest average F1-score, while mention replacement achieved the lowest across the tested settings.
- PostLayout Syntax Support in the BNF Converter(2023) Burreau, Beata; Chalmers tekniska högskola / Institutionen för data och informationsteknik; Chalmers University of Technology / Department of Computer Science and Engineering; Ranta, Aarne; Abel, AndreasMany programming languages, such as Haskell and Python, use layout as part of their syntax. We can expect future programming languages to also be layoutsensitive. Therefore, the toolchains for implementing programming languages must support layout-sensitive languages. This thesis presents a declarative approach to describing layout-sensitive languages and parsing programs written in them. We reserve the terminals newline, indent, and dedent for describing layout syntax in BNF grammar and provide an algorithm for representing the layout of a program with these terminals, before parsing it. By verbalising layout syntax this way, mainstream parser generators, and their parsing algorithms, can be used. This approach is successfully implemented in BNF Converter (BNFC), a tool that generates a compiler front-end from a context-free grammar in Labelled BNF (LBNF) form. With a special kind of LBNF rule, called pragma, it is possible to declare global layout syntax rules, such as the offside rule, which affects the insertion of layout terminals by the aforementioned algorithm. The reserved terminals and the pragmas can together describe popular layout syntax. Furthermore, both purely layout-sensitive languages and those mixing layoutsensitive and insensitive syntax are describable in LBNF.
- PostDetecting Metastable States in Proteins using E(3) Equivariant VAMPnets(2023) Arnesen , Sara; Nordström, David; Chalmers tekniska högskola / Institutionen för data och informationsteknik; Chalmers University of Technology / Department of Computer Science and Engineering; Dubhashi, Devdatt; Olsson, SimonAs proteins fold, they encounter intermediary conformations, often denoted metastable states, that are vital to deciphering diseases related to malfunctions in conformational changes. To detect these metastable states, a deep learning framework using the variational approach for Markov processes (VAMP) has been proposed, dubbed VAMPnets. In this master’s thesis, we improve the training of VAMPnets through the use of E(3) equivariant neural networks. These networks incorporate the symmetries of Euclidean space, facilitating faster and more data-efficient learning. To study the effectiveness of these networks, we benchmark two different equivariant Transformer architectures and an equivariant convolutional network against both a simple and an invariant multilayered perceptron. The models are evaluated on molecular dynamics trajectories of alanine dipeptide and protein folding datasets. The use of E(3) equivariant neural networks in training VAMPnets is shown to significantly improve the prediction accuracy on random downsampled data. Using only 1% of the dataset, the equivariant Transformer achieves almost twice the VAMP-2 score as the benchmarks. Furthermore, the model exhibits improved robustness. With only 20% data remaining, the model scores on par with the complete dataset. On average, the model requires significantly fewer backward passes, converging more than twice as fast as the benchmark models, showing enhanced data efficiency. Furthermore, the results highlight the significant computational burden that equivariant neural networks pose, especially for larger molecules, proving almost 1,000 times slower on the protein folding dataset. Finally, we propose a novel algorithm for detecting the number of metastable states of a molecule using the VAMP-2 score and provide estimates for the 12 proteins in the protein folding dataset.
- PostOptimization of Test Execution(2023) Brink, Erik; Risne, William; Chalmers tekniska högskola / Institutionen för data och informationsteknik; Chalmers University of Technology / Department of Computer Science and Engineering; Coquand, Thierry; Pope, JeremyTesting software is among the most fundamental practices of programmers. Though sometimes daunting to carry out, testing still fills an important role of assuring correct software in various forms. The possibly daunting part of testing is the time it takes to execute an entire test suite potentially containing millions of test cases. Such test suites might end up taking days to run, which might leave developers with idle hands. Various solutions has been proposed to solve the problem of optimizing test suite execution in terms of time efficiency. The time from the start of the execution until receiving an error can be minimized by using test case prioritization. This could involve ordering test cases in a test suite, such that the test cases with higher probability to fail (to produce an error) based on modification to a piece of software, are prioritized in the order of execution. In this thesis, we implement test case prioritization using a Deep Neural Network that produces an order of test cases to be executed. We refer to this model as Prioritized Order Model (POM). We also use test case selection, which involves taking a subset of a test suite based on some criteria. In the case of this thesis, the criteria is based on time limitations of the execution of tests. This is done by using an approach that utilizes the Knapsack Problem. We found that POM performs well given a sufficient amount of data on test suite error reports and modified files in a software repository. POM is compared to different orderings and their time efficiency, which indicated superior performance by POM.
- PostINDAGO(2023) Arfvidsson Nilsson, Max; Backman, Pontus; Chalmers tekniska högskola / Institutionen för data och informationsteknik; Chalmers University of Technology / Department of Computer Science and Engineering; Feldt, Robert; Hassan, AhmedA seemingly never-ending issue with cryptocurrencies is their association with illegal activities. In 2020, it was estimated that roughly 3% of the transaction volume of Bitcoin consisted of transactions performed by known illicit actors. This is a problem for financial institutions wanting to integrate with cryptocurrencies since they risk incurring large fines if they are found to be complicit in illegal activities. This thesis set out to provide insight into this issue by developing a tool capable of detecting illicit funds on the Ethereum blockchain. By utilising DAR clustering and four different blacklisting algorithms and running them on publicly available Ethereum transaction information, the tool was able to detect approximately 160 million possibly illicit Ethereum addresses at varying levels of suspicion. It was also able to detect 965,719 unique clusters, of which 238,536 contained illicit addresses. The blacklisting algorithms involved had previously been described in the literature, but this is, as far as we know, the first time concrete implementations have been created and tested on real data. The effectiveness of the algorithms was evaluated in isolation and in aggregate. It was found that all of the blacklisting algorithms had possible use cases, though haircut and seniority showed the most potential for use in real-world scenarios as they spread the funds in a desirable way while also having a runtime considerably less than that of FIFO. DAR Clustering in combination with at least one of the blacklisting algorithms also showed potential as it was able to detect illicit addresses inside otherwise clean clusters. The findings of this thesis are limited to Ethereum with only partial generalizability to other cryptocurrencies.