Ideology and Power Identification in Parliamentary Debates Master’s thesis in Data Science and AI Johan Jiremalm Oscar Palmqvist DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING CHALMERS UNIVERSITY OF TECHNOLOGY AND UNIVERSITY OF GOTHENBURG Gothenburg, Sweden 2024 www.chalmers.se www.chalmers.se Master’s thesis 2024 Ideology and Power Identification in Parliamentary Debates Johan Jiremalm Oscar Palmqvist Department of Computer Science and Engineering Division of Data Science and AI Chalmers University of Technology and University of Gothenburg Gothenburg, Sweden 2024 Ideology and Power Identification in Parliamentary Debates Johan Jiremalm Oscar Palmqvist Supervisor: Pablo Picazo-Sanchez, Computer Science and Engineering Examiner: Moa Johansson, Computer Science and Engineering Master’s Thesis 2024 Computer science and Engineering Division of Data Science and AI Chalmers University of Technology and University of Gothenburg SE-412 96 Gothenburg Telephone +46 31 772 1000 Typeset in LATEX, template by Kyriaki Antoniadou-Plytaria Gothenburg, Sweden 2024 iv Johan Jiremalm Oscar Palmqvist Department of Computer science and Engineering Chalmers University of Technology and University of Gothenburg Abstract Political debates are vital in shaping public opinion and influencing policy decisions. However, understanding the complex linguistic structures used by politicians to as- certain their orientations and power dynamics can be challenging. In this paper we explore Natural Language Processing techniques for identifying political orien- tation and power structures in parliamentary debates. We introduce a Located Missing Labels-loss in order to train jointly to predict both power and ideology. Furthermore, our proposed method also trains to predict a third synthetically gen- erated polarity label. Finally, we combine this training method with pre-processing steps including back-translation and meta data inclusion. Our results show that our method manages to improve upon conventional methods of fine-tuning. We take part in the Touché competition as part of CLEF 2024 and find that our method achieves the highest performance out of all participants [1]. Keywords: Political Classification, NLP, LLM, MLML v Acknowledgments We would like to acknowledge our supervisor, Pablo Picazo-Sanchez, for his contin- uous guidance, feedback and assistance in this project. Furthermore, we want to express our gratitude for the computational resources provided by the Data Science and AI division at Chalmers University of Technology and University of Gothen- burg. Finally, we would like to thank our examiner Moa Johansson for providing feedback on early and intermediary versions of this paper. vii List of Acronyms AI Artificial Intelligence. BERT Bidirectional Encoder Representations from Transformers. CE Cross Entropy. CLEF Conference and Labs of the Evaluation Forum. CNN Convolutional Neural Network. DeBERTa Decoding-Enhanced BERT with Distangled Attention. LLM Large Language Model. LML Located Missing Labels. LoRA Low Rank Adaptation. LR Logistic Regression. mBERT Multilingual BERT. MLM Masked Language Model. MLML Multi-label Learning with Missing Labels. NER Named-Entity Recognition. NLP Natural Language Processing. NSP Next Sentence Prediction. RF Random Forest. RNN Recurrent Neural Network. SA Sentiment Analysis. SVM Support Vector Machine. ix Contents List of Acronyms ix List of Figures xiii List of Tables xv 1 Introduction 1 2 Background 3 2.1 Transformers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.2 Modern LLMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.3 Sentiment Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.4 Efficient fine-tuning of large models . . . . . . . . . . . . . . . . . . . 4 2.5 Ensemble Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.6 In-context learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 3 Related Work 7 3.1 Model and training method . . . . . . . . . . . . . . . . . . . . . . . 7 3.2 Domain-specific pre-training . . . . . . . . . . . . . . . . . . . . . . . 8 3.3 Back-translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 3.4 Multi-label learning with missing labels . . . . . . . . . . . . . . . . . 8 3.5 Political identification in NLP . . . . . . . . . . . . . . . . . . . . . . 9 3.6 Performance of models in similar contemporary competitions . . . . . 10 4 Dataset 13 5 Method 19 5.1 Dataset and Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . 19 5.1.1 Back-translation . . . . . . . . . . . . . . . . . . . . . . . . . 20 5.1.2 Meta data inclusion . . . . . . . . . . . . . . . . . . . . . . . . 20 5.2 Combined training . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 5.2.1 Located missing labels loss function . . . . . . . . . . . . . . . 21 5.3 Polarity label extension . . . . . . . . . . . . . . . . . . . . . . . . . . 21 5.4 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 5.5 Hyperparameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 5.6 Ensemble modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 5.7 Additional data extraction for test set . . . . . . . . . . . . . . . . . . 24 xi Contents 6 Results 27 6.1 Method components . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 6.2 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 6.3 Translated vs. multilingual . . . . . . . . . . . . . . . . . . . . . . . . 31 6.4 Ensemble modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 6.5 Test set results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 7 Discussion 33 8 Ethics and Risk Analysis 37 9 Conclusion 39 A List of countries in the dataset I B Polarity base prompt III C Hyperparameters V C.1 BERT method components . . . . . . . . . . . . . . . . . . . . . . . . V C.2 Model hyperparameters . . . . . . . . . . . . . . . . . . . . . . . . . . V D Figures and illustrations VII D.1 Multilingual model performance figures . . . . . . . . . . . . . . . . . VII D.2 Back-translation impact on specific parliaments . . . . . . . . . . . . VIII D.3 Power dataset illustrations . . . . . . . . . . . . . . . . . . . . . . . . VIII xii List of Figures 4.1 Distribution of the Opposition, Power, Left and Right-wing labels for each country in the orientation dataset. . . . . . . . . . . . . . . . . . 14 4.2 Mean and standard deviation for text lengths per country in the ori- entation dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 4.3 Density distribution of text lengths for the 3 countries with the high- est, and lowest average text lengths in the orientation dataset. . . . 15 4.4 Venn diagram where the red region represents the number of speakers that only appear in the orientation dataset (Orientation \ Power). The green region represents the number of speakers that only appear in the power dataset (Power \ Orientation). The centre region rep- resents the number of speakers that appear in both datasets, in other words the intersection (Orientation ∩ Power). . . . . . . . . . . . . 16 5.1 Overarching method illustration. The method is divided into a data preprocessing stage and a model training and prediction stage. The test set is inserted as input, but it could potentially be any speech. . 19 5.2 Visualised process for extracting the speaker for texts in the test data for the orientation task. The bottom flow chart then shows how we take the average prediction for all texts by speaker sp1. Note that we illustrate the predictions as ranging from 0 to 1 for simplicity whilst in actuality the logits were used. . . . . . . . . . . . . . . . . . . . . . 25 6.1 Comparison of base BERT and with different combinations of the components of our method over the training epochs. CT stands for combined training, BT for back-translation, PLE for polarity label extension and MDI for meta data inclusion. . . . . . . . . . . . . . . 28 6.2 Comparison of RoBERTa trained for political orientation classifica- tion using conventional fine-tuning and our method. . . . . . . . . . . 30 D.1 The comparison of BERT and RoBERTa fine-tuned on the English translations vs. mBERT and XLM-RoBERTa fine-tuned on the orig- inal texts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . VII D.2 Difference in macro-average F1-score comparing combined training BERT with and without back-translation on the orientation task. The countries are ordered by the number of speeches before back- translation, from left to right in increasing order. . . . . . . . . . . . VIII xiii List of Figures D.3 Mean and standard deviation for text lengths per country in the power dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . IX D.4 Density distribution of text lengths for the 3 countries with the high- est, and lowest average text lengths in the power dataset. . . . . . . IX xiv List of Tables 2.1 Example of zero- and one-shot classification. In zero-shot, the model is not given an exemplary stance classification before being asked to provide its own. The prompt column contains the prompts for the model and the output column, in blue, represents the next word as predicted by the model. . . . . . . . . . . . . . . . . . . . . . . . . . 6 3.1 Results and techniques used in the EXIST 2021 competition [2] sorted in order of average ranking. Random Forest (RF) stands for Random Forest, Logistic Regression (LR) stands for Logistic Regression. . . . 11 4.1 Distribution of the opposition and power labels in the power dataset as well as Left and Right labels in the orientation dataset. . . . . . . 14 6.1 Baseline macro-average F1-scores on our validation set for Orientation and Power tasks from provided linear logistic regression model. . . . . 27 6.2 BERT component study showing how combinations of components in our method impacted the macro-average F1-score for the two tasks compared to conventional fine-tuning. CT stands for combined train- ing; BT for back-translation; PLE for polarity label extension, and; MDI for meta data inclusion. . . . . . . . . . . . . . . . . . . . . . . 28 6.3 Highest attained macro-average F1-scores of our examined models. . . 30 6.4 Macro-average F1-scores of DeBERTa when evaluated using different sequence lengths. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 6.5 Macro-average F1-scores on our base validation set of base models and the ensemble of them. . . . . . . . . . . . . . . . . . . . . . . . . 31 6.6 Macro-average F1-scores on test set of our final ensemble and the provided baseline model . . . . . . . . . . . . . . . . . . . . . . . . . 32 C.1 Non-default hyperparameter values for the BERT method compo- nents experiment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . V C.2 Best performing examined non-default hyperparameter values for var- ious models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . VI C.3 Best performing examined non-default hyperparameter values for Gemma VI xv List of Tables xvi 1 Introduction Parliamentary debates play a vital role in political communication and society [3]. During these debates, representatives from diverse parties and ideologies share their opinions, arguments, and stances on issues impacting society. Making debates more accessible and easy to follow not only serves to inform people but offers a basis for seeking further information and engaging in the democratic process [3]. The ability to detect and classify political motives from speech may also be utilised when the speaker is not forthcoming with their political agenda. Detecting hidden political motives in media such as news reporting and advertisements may benefit society by providing transparency [4]. The complex nature of politics makes these debates challenging to understand [5]. According to a recent survey, 65% of Americans say they always or often feel ex- hausted when thinking about politics [6]. In addition, 42% of adults in the U.S. reported to have watched none, or very few, of the presidential debates in 2020. Moreover, analysing political speeches and making classifications can be challenging due to complex rhetorical strategies such as metaphors, parallelism, and suggesting answers [7]. Political context and the speaker’s background influence how messages are conveyed and interpreted. The challenge of analysing and making classification on political speech may be approached from the perspective of Natural Language Processing (NLP). NLP is a field in artificial intelligence that focuses on analysing, understanding, and pro- cessing natural language data using computers [8]. Sub-tasks within NLP involve, among others, text summarisation, machine translation, and sentiment analysis. More recently, the field of NLP has surged in popularity with the development of chatbots such as ChatGPT. Besides the massive Large Language Models (LLMs) such as GPT-4, there have also been multiple other different approaches for NLP such as rule-based and probabilistic approaches [9]. The incredible performance of these LLM can be applied to the complex political realm with great success [10]. Conference and Labs of the Evaluation Forum (CLEF) [11] hosts an open compe- tition in 2024 called Touché as part of one of their so called labs [12]. The task, Ideology and Power Identification in Parliamentary Debates, is one of the four com- petitions as part of Touché’s presence at CLEF 2024 [13]. This document is an entry to that competition. It will, therefore, investigate which NLP tools are best suited 1 1. Introduction for identifying ideology and power in parliamentary debate speeches. Research goals. The research goals1 for this project are inspired by and corre- spond to the two sub-tasks of the Touché competition Ideology and Power Identifi- cation in Parliamentary Debates mentioned above [13]. RQ1 Investigate what the best methods and practices are for identifying the polit- ical orientation in a parliamentary speech. RQ2 Investigate what the best methods and practices are for identifying whether a parliamentary speech is made by a speaker in opposition or in power. Evaluation The results of both research questions will be evaluated against a test set provided by Touché using a macro-averaged F1-score [14], as it is the performance metric of the Touché competition [13]. Paper structure The rest of this document is organised as follows. In Chapter 2, we explore the foundational aspects of modern NLP. Chapter 3 investigates a much more narrow selection of related work which is closely related to our problem at hand and which influenced our method. We outline and describe the datasets set in Chapter 4. We present our full proposed method in Chapter 5 and share the results from its application in Chapter 6. We discuss and explain these findings in Chapter 7. In Chapter 8 we provide a short ethical disclaimer and risk analysis before concluding the document in Chapter 9. 1Also referred to as sub-tasks. 2 2 Background In the following, we explore the foundational aspects of NLP, discussing its evolution, common practices, and prevalent tools and models. 2.1 Transformers In 2017, the field of Artificial Intelligence (AI) significantly changed with the in- troduction of transformers [15]. This innovation was an improvement compared to previous methods like Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) by offering parallelised training and enhanced capability to grasp long-term relationships. The core of the transformer relies on attention, a function that gauges the relevance or similarity among different elements within a sequence of, for example, words, tokens, or pixels. The typical transformer architecture con- sists of an encoder and a decoder. However, variations exist where structures may include only encoders or decoders. 2.2 Modern LLMs One widely adopted pre-trained transformer model is Bidirectional Encoder Repre- sentations from Transformers (BERT), an encoder-only network featuring 12 trans- former layers and approximately 110 million parameters [16]. The bidirectional encoder ensures that the model is able to consider proceeding and following text from a certain point simultaneously. This is utilised in the pre-training which uses Masked Language Model (MLM). Conclusively, this means that the model is made to predict randomly missing words from a corpus. BERT is also fine-tuned for specific downstream tasks such as question-answering and sentiment analysis. The context length of BERT, that is, the maximum number of tokens it can process at once, is 512 tokens. Multilingual BERT (mBERT) shares the same architecture as BERT but is pre- trained on a multilingal corpus with the 104 highest resource languages on Wikipedia which is different compared to BERT which is primarily trained on English data [17]. The tokenisation of mBERT is also made to better represent multilingual data. RoBERTa is a re-implementation of BERT [18]. Some of the main differences be- 3 2. Background tween BERT and RoBERTa include the pretraining procedure where RoBERTa, for instance, uses dynamic masking, trained on full sentences without Next Sentence Prediction (NSP) loss. RoBERTa is also trained on larger batches and for longer duration compared to BERT. It also has the same context length of 512 tokens. Llama-2, developed by Meta, uses the decoder from the traditional transformers ar- chitecture [19]. This is in contrast to BERT which only uses the encoder. There are multiple Llama-2 models with different sizes ranging from 7B to 70B parameters. They are all pre-trained on 2 trillion tokens from publicly available multilingual sources. The 34B and 70B models use Grouped-Query Attention during the pre- training to improve the inference scalability. LLama2 has a context length of 4096 tokens which is 8 times larger than the context length of BERT and RoBERTa. Decoding-Enhanced BERT with Distangled Attention (DeBERTa) is another LLM that improves on BERT and RoBERTa [20]. This is mainly done through the in- troduction of the distangled attention mechanism and an enhanced mask decoder. There are multiple versions of DeBERTa, and one of those are DeBERTaV3, which improves the efficiency of previous DeBERTa versions [21]. It does this by us- ing ELECTRA-Style pre-training with Gradient Disentangled Embedding Sharing. There are also different versions of DeBERTaV3, and one of those is DeBERTaV3- large which has 304M backbone parameters. GPT-4 developed by OpenAI is a decoder only architecture which is different from the BERT architectures [22]. Moreover, the model is also multimodal which means that it can process, and produce both text and images. The parameters and the weights are not publicly available which means that it is unfeasible to use GPT-4 for this project. 2.3 Sentiment Analysis The two sub-tasks covered in this project are similar to textual binary classification tasks, drawing parallels to a well-explored domain in NLP—Sentiment Analysis (SA). The task of SA is to identify the sentiment expressed in a text, and then classify it. This task is generally split into three classification levels: document- level [23], sentence-level [24], and aspect-level [25]. There are many techniques for SA, such as lexicon-based approaches and machine learning approaches [9]. The sentiment of a text can also be called polarity. 2.4 Efficient fine-tuning of large models Full fine-tuning of pre-trained LLMs with billions of parameters is expensive, and is not feasible for most local machines [26]. To combat these problems, various tech- niques have been developed to both reduce the amount of memory, and computing power required to fine-tune LLMs. Low Rank Adaptation (LoRA) is a technique where, instead of retraining all the 4 2. Background models pre-trained weights, they are frozen and a trainable rank decomposition matrix is injected into each layer of the transformer architecture [27]. This can significantly reduce both the number of trainable parameters and the memory re- quirements. LoRA is based upon the low intrinsic rank hypothesis, specifically that the matrix of weight updates could be effectively represented by a lower-dimensional representation. Essentially stating that fewer wider changes to the weights can effec- tively represent the necessary updates instead of updating each weight individually. Quantisation refers to the practice of reducing the computational cost and memory size of a model by representing weights using a smaller and less precise data type, such as 16-bit float or 8-bit integer [28]. Studies have also shown that specialised datatypes such as 16-bit brain floating point (bfloat16) [29] and 4-bit normal float (nf4) [26] yield increased performance for machine learning applications over regular floating point units of equal or even greater size. QLoRA combines quantisation (Q) with LoRA, along with other innovations [26]. 2.5 Ensemble Learning Ensemble learning involves the process of combining multiple individual models to get a better generalisation performance [30]. There are a few different strategies of how to combine the models, including: stacking, boosting, and bagging. The latter of the three involves the process of training multiple models independently on different subsets of the training data. The outputs of the models are then usually combined via majority voting or averaging the outputs. An example of a bagging ensemble model is the random forest classifier which averages the predictions of multiple decision tree classifiers. Ensemble modelling has previously shown great results in classification task competitions [31]. 2.6 In-context learning In-context learning is the technique of having a language model predict based on a textual input consisting of any number of labeled and unlabeled examples, without any update to the model itself [32]. In-context learning is useful since it allows a model to be adapted to perform many different tasks without having to perform the often times expensive process of retraining the model [33]. An example of zero- shot and one-shot classification, based upon in-context learning later utilised in Section 5.3, is provided in Table 2.1. 5 2. Background Prompt Output Zero-shot Label the polarity of the following text. Text: The south-west was cut off from the UK [...] Polarity: Negative One-shot Label the polarity of the following text. Text: I congratulate the hon. Gentleman on [...] Polarity: Positive Label the polarity of the following text. Text: The south-west was cut off from the UK [...] Polarity: Negative Table 2.1: Example of zero- and one-shot classification. In zero-shot, the model is not given an exemplary stance classification before being asked to provide its own. The prompt column contains the prompts for the model and the output column, in blue, represents the next word as predicted by the model. 6 3 Related Work Here, we discuss previously explored techniques used for political classification within NLP and their relevance to our project. We also examine a range of studies show- casing the performance of models such as RoBERTa and BERT in tasks such as multi-lingual political orientation classification and stance classification of politi- cal tweets. Additionally, we explore domain-specific pre-training, back-translation, multi-label learning with missing labels, and other methodologies relevant to our research objectives. 3.1 Model and training method Fine-tuned RoBERTa has demonstrated significant superiority over zero- and few- shot GPT-3 for multi-lingual political orientation classification [34]. For the task of implicit ideology prediction, fine-tuned RoBERTa has also been shown to outperform GPT-4, Llama-2-13B and Llama-2-70B using in-context learning, as well as Llama- 2-70B utilising LoRA-fine tuning [35]. Furthermore, it is noteworthy that in the same study, the only model which beat fine-tuned RoBERTa for Named-Entity Recognition (NER) was the LoRA fine-tuned Llama-2-70B. It has been shown that even when only fine-tuning a classification head, BERT has approached the performance of few-shot GPT-3 models for stance classification of political tweets [36]. In a similar manner, a comparison of the performance of “Small Language Models” and modern LLMs for sentiment analysis tasks has been performed [37]. The study compares T5large (770M parameters) to Flan-T5, Flan-UL2, text-davinci-003, gpt- 3.5-turbo (11B, 20B, 175B, undisclosed, parameters respectively) across 13 different sentiment analysis tasks and 26 datasets. For context, the base version of BERT has 110M parameters whilst the large version totals 340M parameters. The T5- model was trained on the entire training dataset whilst the other LLMs utilised zero- or few-shot classification. They divide their sentiment analysis tasks into three categories, and find that the smaller fully trained T5-model achieves the best results for all categories, outperforming zero-, one-, five-, and ten-shot classification versions of the larger models. The authors conclude that whilst the larger LLMs perform adequately in simpler tasks, they are outperformed in complex tasks which require structured sentiment information or deeper understanding. 7 3. Related Work RoBERTa has been shown to be state-of-the-art for propaganda classification [38]. The same RoBERTa-model outperforms fine-tuned versions of GPT-3 and multiple in-context learning versions of GPT-4 [39], yielding a micro average F1-score of 63.4% as compared to that of the best GPT-model (base GPT-4) of 58.11%. New and large decoder-only models have been compared to encoder-only models for the tasks of intent classification and sentiment analysis [40]. The study reveals that, in general, encoder-only models provide superior performance, at a fraction of the computational demand, for natural language understanding tasks. 3.2 Domain-specific pre-training Domain-specific pre-training refers to the process of training a model on domain- specific texts before fine-tuning for a specific task within that domain [41]. Domain- specific pre-training has shown great utility for domains with abundant unlabeled text, such as the biomedical field [42]. For the task of multi-lingual political orientation classification, however, some au- thors recently showed that domain-specific pre-training does not greatly impact results [34]. Also, after a threshold of approximately 10,000 sentences, the general- domain pre-training seems to be sufficient as to not benefit from additional domain- specific pre-training. 3.3 Back-translation Back-translation involves the process of translating one text into another language, and then translating the new text back into the original language [43]. In our case this technique will be used to create artificial data that is similar to the original data, as further explained in Section 5.1.1. Back-translation has shown widespread utility for machine translation tasks [44]. Furthermore, using back-translation to artificially extend datasets has also shown promising results for hate speech detection tasks [45]. Back-translation for classification tasks has also shown itself to be particularly useful when there is less training data [46]. 3.4 Multi-label learning with missing labels Multi-label Learning with Missing Labels (MLML) has shown great utility for image classification tasks [47]. In these tasks, a “missing” label most often refers to a false negative, and the challenge is to differentiate between true negatives and false negatives caused by incomplete or faulty annotation. Many methods have been proposed to handle these missing labels [48, 49, 50]. As will be shown in Chapter 5, our two sub-tasks can be combined into a single multi-label classification problem with located missing labels. Traditional MLML methods are, however, not suited for our task since the locations of our missing labels are known and not hidden as false negatives. A study on MLML for image and facial-expression classification 8 3. Related Work from 2014 shares our definition of MLML where missing labels are located but the technique is not appropriate for our project since it is tailored for a label-space magnitudes larger than our own [51]. Our two sub-tasks can be combined into a single multi-label classification problem, as will be shown in Chapter 5. Similar approaches of combining tasks have shown to be beneficial. For the task of peer-assessment evaluation, multi-task learning BERT has been shown to outperform its single-task counterparts [52]. For the multi-task learning BERT, three separate classification heads were added to the same base BERT model and the loss for fine-tuning was the sum of the Cross Entropy (CE)- loss from each classification task. 3.5 Political identification in NLP Many approaches for analysing political textual data have been previously proposed. For instance, it has been shown that GPT-4 exhibits a deep understanding of and ability to quantify broad political terms such as “ideology” and “power” [53]. How- ever, it has also been shown that the GPT-family of models, specifically GPT-3 and GPT-4, exhibit a liberal-leaning political bias [54, 55, 56]. NLP has already been used to process political texts. A minimum-cut classification framework that combined Support Vector Machines (SVMs) and speaker agreement whose goal was to classify speeches from the house of representatives as either being for, or against a proposed legislation was proposed in [57]. In a similar manner, relationships between debating participants yielded a 70% accuracy on a testing set with 10 debates summing up to 860 speech segments [58]. Furthermore, the analysis of relative word frequencies as a method for examining political texts has been explored. Relative word frequencies have been utilised in order to extract policy positions by examining manifestos and legislative speeches of various British and Irish parties in the years 1991 and 1992 [59]. This technique was employed to infer policy stances not only from parties but also from individual politicians. Whilst the authors deemed their approach successful for both tasks, they did not provide a precise evaluation metric. Attempts to classify ideology from political speech have also been performed. SVMs have been used to classify ideology of speakers in legislative speech records from the 101st to 108th Congresses of the US Senate [60]. This approach yielded a 92% accuracy score. It was also found that identifying which party a certain individual belonged to was easier than extrapolating their ideology. However, they found that there is a high correlation between party membership and ideology. Building on these findings, similar strategies were used to identify party affiliation from speech [61]. This approach achieved an accuracy of up to 98% on debates in the Canadian House of Commons. Sentiment analysis has also been used to identify trends in parliamentary speeches for parties in power and parties in opposition. For instance, using SA on the ParlSpeech 9 3. Related Work V2 dataset, political parties in power tend to use a speech with a more positive sentiment than parties in opposition [62]. Also, political parties that transition from being in opposition to in power tend to exhibit a more positive sentiment in their speeches and vice versa. Thus, indicating that the correlation is not strictly a result of positive parties being more popular and therefore having an easier time getting in power. 3.6 Performance of models in similar contempo- rary competitions In 2021, amongst other years, CLEF organised a competition called EXIST [2]. The two task for the competition were: • Task 1: Identifying Sexist Content In this task, the system is supposed to perform binary classification. It must determine whether a given text (tweet or gab) exhibits sexism, whether directly, by describing a sexist scenario, or by criticising sexist behaviour. • Task 2: Categorising Sexist Content Following the identification of sexist content, the subsequent task involves categorising the content based on the type of sexism present. The results, and techniques used in the two sub-tasks in the competition are com- piled in Table 3.1. Some takeaways from the approaches in the competition comes from the datasets including Spanish and English text. When participants used Beto (Spanish version of BERT), it was exclusively used to analyse the Spanish texts which means that it had to be combined with other models for English. The same is true for BERT, almost all participants that used BERT for the English texts also ended up using other models for the Spanish texts. Lastly, the most common and best performing LLMs for handling multiple languages in this competition were mBERT and XLM-R. 10 3. Related Work Task1 Task 2 Average Ranking Ranking Ranking Be rt Be to m Be rt X LM -R R oB ER Ta R F LR SV M fa st Te xt Ai-UPV 1 1 1 x x x SINAI-TL 2 3 2,5 x x AIT FHSTP 3 5 4 x Multiaztertest 4 4 x x LHZ 8 2 5 x nlp uned team 5 9 7 x QMUL-SDS 11 4 7,5 x Alclatos 10 6 8 x x ZK 9 8 8,5 x x x GuillemGSubies 7 14 10,5 x x IREL hatespeech group 14 7 10,5 x Codec 12 12 x x S_exist 12 13 12,5 x x MiniTrue 13 13 x x x UMUTeam 16 11 13,5 x x Free 6 24 15 x x ZZW 15 18 16,5 x Zimtstern 17 16 16,5 x LaSTUS 18 15 16,5 x Recognai 26 10 18 x Andrea Lisa 21 17 19 x CIC 19 19 19 x x x MessGroupELL 20 20 x x MB-Courage 22 22 22 x Nerin 24 20 22 x x Soumya 23 23 23 x x UNEDBiasTeam 25 21 23 x BilaUnwanPk1 27 25 26 x Almuoes3 27 27 x x x ORDS_CLAN 29 26 27,5 x Uja 28 28 x x Table 3.1: Results and techniques used in the EXIST 2021 competition [2] sorted in order of average ranking. RF stands for Random Forest, LR stands for Logistic Regression. 11 3. Related Work 12 4 Dataset Touché provides a dataset [63] for our task which contains a selection of speeches from the ParlaMint corpora [64]. This dataset contains data from parliamentary speeches in multiple European parliaments—we include the countries covered in the dataset in Appendix A. More precisely, the dataset consists of two separate subsets, one for each sub-task. These subsets are also divided into multiple sub-subsets each containing data from only one country. The organisers altered the dataset to provide less information than the original one, but also includes an automatic translation to English for most non-English texts. The provided training dataset consists of 6.5GB of text files [63] divided among the orientation and power datasets and contains the following fields: id is a unique (arbitrary) ID for each text. speaker is a unique (arbitrary) ID for each speaker. There may be multiple speeches from the same speaker. sex is the (binary/biological) sex of the speaker. This information is collected from varying sources (typically data published by the respective parlia- ment), and in some cases it may be unspecified or unknown. text is the transcribed text of the parliamentary speech. Real examples may include line breaks, and other special sequences escaped or quoted. text_en is an automatic English translation of the corresponding text. This field may be empty for speeches in English. There might be missing transla- tions for a small number of non-English speeches. label is the binary/numeric label. For political orientation, 0 is left and 1 is right. For power identification 0 indicates coalition (or governing party) and 1 indicates opposition. Data imbalance There is an uneven distribution of data where some countries have more data than others. In addition, the label distribution among the countries is also skewed. For instance, in Figure 4.1 we can see that Serbia has more data for speeches connected to right wing parties while for the power dataset Serbia has more 13 4. Dataset Spain Sweden The Netherlands Turkey Ukraine Latvia Norway Poland Portugal Serbia Slovenia Galicia Great Britain Greece Hungary Iceland Italy Croatia Czechia Denmark Estonia Finland France Austria Basque Country Belgium Bosnia and Herzegovina Bulgaria Catalonia 0 5000 10000 15000 0 5000 10000 15000 0 5000 10000 15000 0 5000 10000 15000 0 5000 10000 15000 #S pe ec h Power Opposition Left Right Figure 4.1: Distribution of the Opposition, Power, Left and Right-wing labels for each country in the orientation dataset. speeches from speakers in power. Table 4.1 represents the overall label distribution of the dataset. The text lengths of each country also vary and are displayed in Figures 4.2 and 4.3 for the orientation dataset. Shorter texts may contain less helpful information for the predictions and thus, decrease the performance. Moreover, the limited context length of the models may not be able to capture all relevant information in the longer texts. For instance, BERT has a context length of 512 tokens which means that it can not process the entirety of most texts at once. Shared information Even though the datasets are split, 47.5% of the speakers that appear in one of the datasets appear in both datasets, as shown in Figure 4.4. # of speeches % of task data % of all data Left 58,146 39.0 16.2 Right 90,797 61.0 25.3 Power 111,127 53.1 31.0 Opposition 98,114 46.9 27.4 Table 4.1: Distribution of the opposition and power labels in the power dataset as well as Left and Right labels in the orientation dataset. 14 4. Dataset Aus tria Bas qu e C ou ntr y Bos nia an d H erz eg ov ina Belg ium Bulg ari a Cata lon ia Croa tia Cze ch ia Den mark Esto nia Finl an d Fran ce Gali cia Grea t B rita in Gree ce Hun ga ry Ice lan d Ita ly La tvi a Norw ay Pola nd Port ug al Serb ia Spa in Swed en Slov en ia The N eth erl an ds Tu rke y Ukra ine Country 0 200 400 600 800 1000 1200 1400 1600 M ea n Figure 4.2: Mean and standard deviation for text lengths per country in the orientation dataset. 500 0 500 1000 1500 2000 2500 3000 3500 Word Count 0.000 0.001 0.002 0.003 0.004 D en si ty Finland Estonia Ukraine Spain Greece Galicia Figure 4.3: Density distribution of text lengths for the 3 countries with the highest, and lowest average text lengths in the orientation dataset. 15 4. Dataset 1832 65407572 Orientation Power Overlap of Speakers Between Orientation and Power Datasets Figure 4.4: Venn diagram where the red region represents the number of speakers that only appear in the orientation dataset (Orientation \ Power). The green region represents the number of speakers that only appear in the power dataset (Power \ Orientation). The centre region represents the number of speakers that appear in both datasets, in other words the intersection (Orientation ∩ Power). Moreover, 51.4% of speeches in the power dataset are made by a speaker who also appears in the orientation dataset. This amount equates to 72.2% of the total number of speeches in the orientation dataset. Isolated speeches The dataset does not include the date of when the parliamen- tary speeches occurred. This means that modeling changes in party ideas over time or fully encapsulate how politics shift is infeasible. Moreover, simply connecting a speech to a certain ideology will not be as useful when it comes to predicting whether the speaker is currently in a governing position or opposition. This is be- cause a country with a left leaning government one year could have a right leaning government the next. Even though speeches are originally part of debates and exchanges in parliament, the dataset contains no data for connecting multiple speeches to a single conversation or exchange. This prevents approaches which would model an entire debate and label the participants in the debate rather than the individual speeches themselves. Privacy in the dataset The dataset uses arbitrary codes for people’s names. This makes it difficult to check if our results match up with real-world politicians. Additionally, the test set for the orientation task does not contain any speakers in the original orientation dataset. As a result, a solution which attempts to con- nect specific speeches to the correct political parties using the speakers identity is infeasible. Test set The test set for the orientation sub-task contains randomly sampled speeches by speakers who do not appear in the training set. The test set for the power 16 4. Dataset sub-task does, to a large extent, contain speakers that also appear in the training set. However, speakers recurring in the test set will tend to have a different label distribution compared to the training set. For both sub-tasks, the test set contains approximately 2,000 speeches for each parliament, whose general label distributions resemble those in the training set. The test data follows a similar structure to the training data apart from the speaker_id and label field being hidden. 17 4. Dataset 18 5 Method In the following, we describe how we processed and prepared the data as well as how we selected and trained models. An overarching view of our method is illustrated in Figure 5.1. 5.1 Dataset and Preprocessing We decided not to extend the dataset using external sources. This was partly due to other parliamentary debates in our selection of countries either being unavailable or already in the original ParlaMint corpora. It would also require a lot of work in order to create properly labeled datasets in the same format as our base dataset. Finally, the amount of available data is of substantial size and a larger amount would put further pressure on the need for computational resources. We generated the training dataset for each task by sampling 70% of the provided datasets, leaving the remaining 30% of the datasets for the validation set. Note that the 70/30 split is a commonly used rule-of-thumb which has shown some empirical optimality [65, 66]. We used this split for evaluating the individual and combined parts of our model. Once we got the most effective techniques, we switched to a different split for our Process Orientation Power Validation Train ModelModelFine-tuned Models Ensemble Prediction Finetuning Input Data preprocessing Model training and predictions ModelModelModels Model predictions Combining datasets Back- translation Mistral-7B Splitting Polarity Label Extension Meta Data Inclusion Figure 5.1: Overarching method illustration. The method is divided into a data preprocessing stage and a model training and prediction stage. The test set is inserted as input, but it could potentially be any speech. 19 5. Method final ensemble model as explained in Section 5.6. Missing translations Firstly, some English translations were missing in the pro- vided Finnish dataset. Specifically, there were 271 translations missing in the train- ing dataset and 88 in the test set. These texts had to be manually machine trans- lated. For the translations we used a Python package called mrTranslate. 5.1.1 Back-translation To address country distribution imbalance, we applied back-translation to the data from countries with fewer than 15,000 entries in the power and orientation datasets combined. We chose this threshold in order to strike a balance between improv- ing the amount of speeches available for parliaments with less representation in the dataset and not increasing the training time excessively. The back-translation pro- cess involved translating English text to the original language and back, followed by translating the original text to English and back to the original language. We also used mrTranslate for the translations and appended the resulting data to the dataset, keeping all other fields unchanged from the original entry. Sometimes, how- ever, the translations would fail. In these cases, we manually translated the texts using Google Translate. 5.1.2 Meta data inclusion By prepending each text with the corresponding country and gender of the speaker, i.e., “Germany, Female”, models got access to the available contextual information not included in the speeches themselves. We hypothesise that since parliaments and contexts of debates vary, so should their analyses. By giving the model access to all available metadata, i.e., all available context, we suspect that models might be able to better adapt their predictions. An example of adapting to such a context might be adapting the prediction of someone advocating for a certain law depending on what the law is currently in a given country. There might also be useful information in the meta data itself, such as gender making a politician more likely of belonging to a certain ideology in some countries. These examples are of course purely speculative and therefore part of the reason why we chose to include all meta data instead of selecting fields based only on our own speculations. 5.2 Combined training We merged the datasets by combining them where the label column was split into orientation and power labels. Despite being separate, both datasets contain shared elements, allowing the extraction of some data in the power dataset data into the orientation dataset, as explained in Chapter 4. We first verified that each speaker in the orientation dataset consistently had the same label. Once a speaker was confirmed as label y in the orientation dataset, all texts by that speaker in the power dataset were also classified as label y for orientation. When the datasets shared a text, the text from the orientation dataset was removed (since the text 20 5. Method from the power dataset already had received the orientation label). This increased the number of orientation entries by 72.2% which equates to 51.4% of the original power dataset. 5.2.1 Located missing labels loss function Since the combined dataset has many missing labels, we needed to create a custom loss function. We calculated a filter tensor for each batch and label, marking rows with true labels as one and rows without true labels as zero. Then, we used this filter tensor as weights for the entries in the batch when computing the CE-loss. To account for the number of incomplete labels, we divided each label-specific loss by the sum of its filter tensor. By summing the loss for our two labels, we had created a multi-label loss function which could account for Located Missing Labels (LML). Let us consider this loss function for a single label, i.e., orientation. Formally, let C be the set of classes (i.e., left and right) and pi the output predictions for all classes c ∈ C for entry i in the batch. pc,i is then the predicted probability of class c for entry i in the input batch such that pc,i ∈ [0, 1] and ∑ c pc,i = 1, ∀i. Also let y be the true labels such that yi ∈ {−1} ∪ C (where −1 corresponds to a missing label) represents the true label of entry i. Then our custom LML-Loss can be expressed as Equation (5.1). LMLLoss(p, y) = 1∑ i fi ∑ i fiCELoss(pi, yi) fi = 1, yi ∈ {C} 0, yi = −1 (5.1) As each label-specific loss is divided by the sum of its filter tensor, the resulting loss maintains a consistent size regardless of the number of samples with a true label in the batch. Consequently, the final summed loss represents the combined loss for each task, regardless of its prevalence in the batch. 5.3 Polarity label extension If training to predict both the power and orientation labels at the same time yields increased performance, then training to predict a third label, which shares useful features with the first two, might yield further improvements. We chose polarity as the label to add to our dataset, i.e., whether a text carries a positive, negative or neutral sentiment. We chose polarity since it is an effective metric for identifying trends in parliamentary speeches, as explained in Section 3.5. To obtain polarity labels for our dataset, we used version 0.2 of the instruction fine- tuned Mistral-7B, as available under mistralai/Mistral-7B-Instruct-v0.2 on Hugging Face. We chose this model since it outperforms other open source LLMs of similar or larger size, such as 7- and 13-billion parameter versions of Llama-2 [67]. We 21 5. Method chose one example text for each polarity label and had GPT-3.5 explain why it assigned that label to that text. With these examples, we constructed our in-context learning prompt using 3-shot classification. The base textual prompt, which we then formatted into the instruction chat format of the model, can be found in Appendix B. We double-quantised the Mistral model to a 4-bit normal float with a 16-bit float compute type to fit the model in memory and for faster inference. To fit the entire base prompt along with the text to label, we defined a context length of 4,096. To generate from the model, we used sampling beam search with 3 beams along with forcing the output to contain “Positive”, “Negative”, or “Neutral”. Finally, we assigned the polarity label as 0, 1 or 2 depending on whether the first word of the output was “Negative”, “Positive” or “Neutral” respectively, assigning −1 otherwise. To perform the polarity classification, we added three output nodes to our classifica- tion head, corresponding to each polarity class. To account for holes in the polarity data caused by failed generations, we also used our LML-loss to calculate the loss from the polarity predictions. We only added half of the polarity LML-loss to the base LML-loss in order to prioritise our two core tasks. 5.4 Models We restricted the models to those we could effectively train. Therefore, we ex- cluded more extensive and capable models, such as the 70-billion parameter version of Llama-2 [19]. Furthermore, we were forced to limit hyperparameters, such as batch size and learning rate, to less-than-ideal values to comply with our limited computational resources. Our selection of models was also influenced by the notion that encoder only models outperform modern large decoder models for similar tasks, at a lower computational demand [40]. We compared different modern transformer-based models to find which models per- formed the best for our task. Different models necessitated different hyperparameter values due to their differing sizes and designs. For all models, any implemented clas- sification heads took the place of the last layer of the model as provided by their Hugging Face sequence classification implementation, leaving the method of pooling as implemented in the base model. Finally all models discussed bellow can be found on Hugging Face. BERT and mBERT We evaluated the uncased version of BERT [16], available under bert-base-uncased, using the provided English translations. We chose this model since it is competent while being much smaller than other modern models (see Section 3.1) and since it had previously shown outstanding performance in similar competitions (see Section 3.6). We also evaluated the uncased version of multilingual BERT, available under bert- base-multilingual-uncased, on the speeches in their original languages. We chose this model because it is a multilingual version of BERT and because it demonstrated excellent results for multilingual tasks in similar competitions (see Section 3.6). 22 5. Method RoBERTa and XLM-RoBERTa We evaluated the large version of RoBERTa, available under FacebookAI/roberta-large, since it has been shown to be state-of- the-art for similar tasks (see Section 3.1). We also evaluated XLM-RoBERTa, available under FacebookAI/xlm-roberta-large, which is RoBERTa pre-trained for multi-lingual tasks [68]. DeBERTa V3 DeBERTa V3 is an improvement upon the original DeBERTa model [21]. The original DeBERTa model outperforms the large version of RoBERTa using less training data on a wide range of NLP tasks [20]. The DeBERTa family of models also utilize distangled attention and relative position embeddings which al- low it to process longer sequences than BERT and RoBERTa. Due to these factors, we chose to evaluate DeBERTa V3, as available under microsoft/deberta-v3-large. Gemma The 7-billion parameter version of Gemma outperforms similar models of equal size such as Mistral-7B and Llama-2-7B [69]. Limited by our computational resources, we evaluated the smaller 2-billion parameter version of Gemma, as avail- able under google/gemma-2b. This smaller version still necessitated using techniques such as LoRA and double quantising the model to 4-bit normal float. We applied LoRA to all matrices in the self attention- and mlp-layers of Gemma. 5.5 Hyperparameters Due to our limited computational resources, unfortunately, no experiments could be exhaustive. When we discovered that a certain hyperparameter value worked well, for instance using a warm-up period, we could not then afford to repeat experiments for all previous models to include this choice. Efforts where instead directed as to for each model balance the necessity to fit to the data in a manageable time frame with the desire not to cause excessive unlearning in the base model. The main parameters which had to be adapted depending on the size of the models was the learning rate and the warm-up ratio. For instance, if a large model was trending downwards by the second epoch, then the learning rate might be lowered and/or a warm-up period added, in this way experiments where exploratory. We set specific hyperparameters to increase training speed and reduce memory con- sumption to accommodate larger models. For instance, all models except BERT and mBERT used 16-bit floating point mixed-precision training to accelerate the training process. BERT and mBERT used regular full-precision training. Addition- ally, Gemma required LoRA and 4-bit quantisation to make fine-tuning feasible for us. All models trained using a maximum sequence length of 512 tokens. Whilst we would have preferred to train with longer sequence length for the models which can handle it (DeBERTa-V3 and Gemma), this was not computationally feasible. However, to still utilise the longer context lengths DeBERTa-V3 can handle, it was re-evaluated using a maximum sequence length of 4096 tokens after training was 23 5. Method complete. In this way we could train and evaluate the models more efficiently with 512 tokens and then afterwards leverage longer context lengths for only the final version of the model. We would have preferred to also re-evaluate Gemma using a longer context length, however, due to unforeseen limitations on computational resources, only DeBERTa could be re-evaluated using a longer sequence length. 5.6 Ensemble modelling After identifying the best performing models and training methods using our val- idation set, we created new training and validation sets which we used to re-train the selected models for ensemble modelling. These new validation sets contained disjoint selections of 10% of the available data, with a minimum of 5 samples for each country and label. We chose this data split to increase the amount of data available to our models. By using bagging (letting each model in the ensemble having a separate validation set), each model could be monitored for over-fitting whilst the ensemble as a whole had still trained on the entire dataset. We speculate that by using this approach, our ensemble will be able to leverage the entire training set. This is since if a single model has not been able to learn something useful due to the required data being in its validation set, the other models of the ensemble will have had access to that data. We also chose to decrease the ratio of the validation set as we deemed the necessity of it being representative and reliable to be diminished once we had already determined and validated our method. We created the ensemble by selecting the best performing multilingual and English- only models. We then trained two instances of the best performing of the two models as well as one instance of the other model using our newly created ensemble training and validation sets. The main reason for using two of the best model was to make sure that model had the most influence over the final prediction. We decided to use one multilingual model to capture potentially new information not available in the English translations. For a given prediction, we ran each of these models and averaged their output logits before applying the sigmoid function and rounding to receive a final prediction. 5.7 Additional data extraction for test set Roughly 23% of the texts appearing in the provided test data for the orientation task also appear in the training data for the power task. Using this information we can extract the speaker id for the overlapping texts. Then, since the orientation label is always the same for each speaker, we can use these additional speeches to influence our prediction on the test set. For a given speech in the orientation test set, we averaged the logits of the examined model on that speech and the logits of our best performing model on all other speeches by the same speaker. In other 24 5. Method Test data for orientation task: id, text, text_en, sex se1, txt1, txt1_en, M Training data for power task: id, speaker, text, text_en, sex se6, sp1, txt1, txt1_en, M se5, sp1, txt4, txt4_en, M se8, sp1, txt3, txt3_en, M Prediction: 0.51 Speaker Predictions: 0.51 0.001 0.01 Base Output: 1 Average Prediction: 0.17 Updated Output: 0 Figure 5.2: Visualised process for extracting the speaker for texts in the test data for the orientation task. The bottom flow chart then shows how we take the average prediction for all texts by speaker sp1. Note that we illustrate the predictions as ranging from 0 to 1 for simplicity whilst in actuality the logits were used. words, predictions on the test data were averaged with those produced by our best model on speeches by the same speaker. This way if the text in the test data lacks clear ideological signals, we can instead rely on other texts by that speaker to make our prediction. The process is visualised in Figure 5.2. 25 5. Method 26 6 Results In the following chapter, we present the results of our method application to the provided dataset and evaluated models. We fine-tuned the models using the transformers library for Python as provided by Hugging Face, running on a NVIDIA V100 with 32GB of memory. Also, if not mentioned, we set the hyper-parameter values to their default values as provided by the library. Baseline The competition organisers provided a baseline of a simple linear logistic regression model. When we fitted this model to our training set and then applied this model to our validation set, we achieved the macro-average F1-scores as shown in Table 6.1: Sub-task Macro-average F1-score Orientation 0.6755 Power 0.7149 Table 6.1: Baseline macro-average F1-scores on our validation set for Orientation and Power tasks from provided linear logistic regression model. 6.1 Method components To understand the effects of each component of our method, we fine-tuned BERT multiple times with different combinations of the components in our base method. These components, as presented in Chapter 5, were combined training, back-translation, polarity label extension and meta data inclusion. All runs used the same hyper- parameters, which can be found in Appendix C.1. Results are illustrated over the training epochs in Figure 6.1 and summarised in Table 6.2. The results as a whole show that the combined training beats both the conventionally trained BERT and the baseline. The other components all individually improve performance of the combined training further. Finally, all components together yielded the best performance. 27 6. Results (a) Orientation (b) Power Figure 6.1: Comparison of base BERT and with different combinations of the com- ponents of our method over the training epochs. CT stands for combined training, BT for back-translation, PLE for polarity label extension and MDI for meta data inclusion. Model Orientation Power baseline 0.6755 0.7149 BERT 0.7596 0.7865 +CT 0.8271 0.8041 +CT+BT 0.8333 0.8059 +CT+PLE 0.8317 0.8101 +CT+MDI 0.8377 0.8111 +(all) 0.8493 0.8152 Table 6.2: BERT component study showing how combinations of components in our method impacted the macro-average F1-score for the two tasks compared to conventional fine-tuning. CT stands for combined training; BT for back-translation; PLE for polarity label extension, and; MDI for meta data inclusion. 28 6. Results Combined training (CT) Training to predict both labels at once using our LML-loss showed significant improvements in comparison to when only training for a single label at a time. Back-translation (BT) Our results indicate that back-translation yielded an improvement in the orientation task over all epochs whilst only yielding an non- marginal improvement in the first epochs of the power task when training for both tasks using CT. To further investigate if back-translation helped improve the perfor- mance of countries with less data, we also visualised the results for each parliament individually. Results can be found in Appendix D.2 and show that, on average, par- liaments with back-translation saw a significant improvement whilst the remaining parliaments did not. Polarity label extension (PLE) Extending the combined training by adding a third label yielded an increase in performance over all epochs. Meta data inclusion (MDI) Including the available meta data by prepending it to each text resulted in a improvement by the second epoch and onwards for both tasks. However, for the first epoch it caused a decrease in performance for the power task whilst not impacting the orientation task. Method components conclusions The examination of the components in our method seems to indicate that all components of our method are beneficial for BERT. This is especially clear due to the combination of two factors. The first factor is that the conventionally trained model had seemingly started to stagnate or over-fit whilst our proposed method was still improving through out all training epochs. The second factor is that our method exceeds the conventional fine-tuning already by the first epoch. In combination then, we may reason that our method provides an intrinsic advantage since it both converges faster, by the second factor, and does seems to cause less over-fitting or unlearning, by the first factor. These factors may also be reasoned as to guaranteeing that we actually make better update steps instead of just smaller (by the second factor) or larger (by the first factor). In order to to validate that the improvement in performance on the orientation task from our method is not only due to the increased amount of data, we conventionally fine-tuned RoBERTa on the full set of available orientation data. We compare this to RoBERTa fine-tuned using the same hyperparameters, as found in Appendix C.2, but using our full method. Results are illustrated in Fig. 6.2 and show that even when using the same training data, our method exceeds conventional fine-tuning over all epochs for orientation classification. As discussed in Section 3.1, fine-tuned RoBERTa has been shown to outperform very capable models and to be state-of- the-art for similar tasks. It is therefore very encouraging to note that our method managed to significantly improve upon the performance of fine-tuned RoBERTa for political orientation classification. 29 6. Results 2 4 6 8 10 epoch 0.78 0.80 0.82 0.84 0.86 M ac ro -a ve ra ge F 1- sc or e RoBERTa Conventional Training vs. Our Method Our method Conventional Figure 6.2: Comparison of RoBERTa trained for political orientation classification using conventional fine-tuning and our method. 6.2 Models The highest attained scores resulting from the application of our method on various models are shown in Table 6.3. Corresponding hyperparameters can be found in Appendix C.2. Results indicate that DeBERTa-V3 was the best performing model, with XLM-RoBERTa being the best performing multilingual model. Gemma, which trained using LoRA and quantisation, manages to exceed the performance of BERT and mBERT but falls short of the other models. Model Language* Orientation Power BERT Translation 0.8493 0.8152 mBERT Original 0.8251 0.7941 RoBERTa Translation 0.8729 0.8440 XLM-RoBERTa Original 0.8621 0.8379 DeBERTa-V3 Translation 0.8870 0.8630 Gemma Translation 0.8541 0.8358 *Translation corresponds to training on automatic translations to English instead of the original language. Table 6.3: Highest attained macro-average F1-scores of our examined models. To investigate the impact of re-evaluating DeBERTa-V3 using a longer sequence length, we also evaluated different sequence lengths. The results are shown in Ta- ble 6.4 and indicate that there was a significant improvement in performance from increasing the sequence length initially but that these increases diminish. The im- provement by going from 512 to 1024 tokens was noticeable (+0.0059 and +0.0084) whilst the improvement by going from 2048 to 4096 tokens was minor (+0.0001 and +0.0003). These findings are not surprising, after all, successive increases in maxi- mum sequence length add fewer and fewer tokens since more speeches become fully covered. 30 6. Results Sequence length Orientation Power 512 0.8788 0.8526 1024 0.8847 0.8610 2048 0.8869 0.8627 4096 0.8870 0.8630 Table 6.4: Macro-average F1-scores of DeBERTa when evaluated using different sequence lengths. 6.3 Translated vs. multilingual To investigate whether models pre-trained for multilingual tasks outperform their mainly English-comprehending base models, we compared BERT and mBERT as well as RoBERTa and XLM-RoBERTA. Each pair used the same hyperparameters internally (see Appendix C.2). The multilingual models processed the original texts whilst their counterparts processed the automatic translations. Results indicate that the multilingual models lag behind by a consistent amount. The macro-average F1- scores over the training epochs can be found in Appendix D.1. 6.4 Ensemble modelling In order to validate that ensemble modelling was a beneficial approach, we selected DeBERTa-V3 and XLM-RoBERTa, fine-tuned on our base training set. The output logits of these models on the base validation set where then averaged to create validation predictions. The macro-average F1-scores of the yielded predictions, as seen in Table 6.5, show that the predictions of our best performing stand-alone model could be improved by also considering the outputs of our best performing multilingual model. Model Orientation Power XLM-RoBERTa 0.8621 0.8379 DeBERTa-V3 0.8870 0.8630 Ensemble 0.8972 0.8697 Table 6.5: Macro-average F1-scores on our base validation set of base models and the ensemble of them. 6.5 Test set results Baseline The macro-average F1-scores attained by the baseline model on the com- petition test set are shown in Table 6.6. This baseline model was fitted to the entirety of the original provided datasets. When comparing this baseline to the baseline on our validation set, we see a decrease in macro-average F1-score of 0.1152 and 0.0748 31 6. Results for the orientation and power task respectively. This indicates that the test set is much more challenging, which is not surprising due to the nature of its construc- tion. For the orientation task the test set contains speakers that do not appear in the training set, and for the power task it contains speakers which appear with a different role than they do in the training data. The test set also does not share the same distributions in parliament representation, for further details see Chapter 4. Final ensemble model Our final ensemble consisted of two DeBERTa-V3 mod- els and one XLM-RoBERTa model, fine-tuned using disjoint selected validation sets, as detailed in Section 5.6. We have made these models available on Hug- ging Face under oscpalML/DeBERTa-political-classification, oscpalML/DeBERTa- political-classification-alternative and oscpalML/XLM-RoBERTa-political-classification. Our final ensemble, averaging the logits of these models, yields macro-average F1- scores as seen in Table 6.6. Model Orientation Power baseline 0.5603 0.6401 Ensemble 0.7945 0.8271 Table 6.6: Macro-average F1-scores on test set of our final ensemble and the provided baseline model Additional data extraction The additional data extraction improved the per- formance on the orientation task. Our ensemble, without considering the other available speeches by a speaker, yielded a macro-average F1-score for the orienta- tion task of 0.7854 whilst when utilising the other speeches the score increased to 0.7945. 32 7 Discussion In the following chapter, we discuss and reason about the effectiveness of the pro- posed solution. We further discuss the task itself and the limitations of our project. Method effectiveness We show that our method improves performance for BERT and RoBERTa. Whether or not these findings translate to other models and tasks is, of course, a prudent question. The limitations of this project, enforced upon us by our limited computational resources, prevent us from examining this quandary fully. However, due to the relatively similar architectures of our examined models and the nature of our method, we hypothesise that the benefits of our method does translate. This is because our method does not closely depend on the internals of a model, instead aiming to provide a more representative loss function and better input data. We leave more extensive empirical confirmation for future research. Just as our method likely translates well to other models, it might transfer well to other tasks. Our method is not reliant on the specifics of ideology and power identification, and can therefore potentially be applied to other similar tasks. It is in all likelihood important for other tasks to share useful features. Combined training and further synthetic label extension relies on the existence of cross-task useful features, and therefore combining very unrelated tasks might not yield the same benefits. Our utilisation of back-translation can potentially see broader application since it is essentially just a method of language-based data augmentation. Meta data inclusion might be very task and problem dependent. One can imagine that when classifying political tweets that including the year might be beneficial whilst for other tasks meta data might in many cases be either unavailable or irrelevant. Multi-task training The combined training yields an increase in performance for the orientation task, which is not surprising since we extract additional orientation labels and therefore provide the model with more data. On the other hand, the fact that there is also a significant improvement for the power task is very intriguing. We speculate that this improvement is due to our LML-loss providing a more repre- sentative loss, which incentivises extracting features which are useful for both tasks and discourages over-fitting. This seems to be supported by the additional increase in performance provided by also adding the prediction of a polarity label. This increase in performance is even more impressive when considering that the polarity label was synthetically generated and likely to add at least some amount of noise. 33 7. Discussion Since polarity likely shares some important similarities with features useful for our tasks, as detailed in Section 3.5, we speculate that our LML-loss was improved as to further incentivise cross-task useful features. Data preprocessing The observed impact of prepending available meta data to each speech, as shown in Fig. 6.1, is reasonable. We suspect that the prepended sentence is very different from the pre-training material of the base model, since it is simply two words and does not follow the form of a regular phrase or sentence. This disruption, we speculate, might essentially confuse the model until it is able to learn it in later epochs. Once the model has understood how to interact with the prepended sentence however, it is able to leverage it into making better predictions. It would be interesting for future research to compare the difference between adding new tokens representing the meta data and prepending the meta data in English, as we did. It might be the case that base models are able to leverage prior understand- ing of countries, be it their general political environment or some other aspect. On the other hand, it might also be the case that prior bias hurts the models ability to predict accurately and fairly. Nature of tasks On the validation set, it is interesting to note that the linear baseline performed better on the power task than the orientation task given that the models utilising our method show the opposite behaviour. We may also note that whilst the difference is small, the power task benefited more from a longer sequence length. It is therefore not entirely unreasonable to suggest that the power task might rely more on specific words and phrases, as the linear baseline does. In other words, it might be the case that specific words are more important for predicting political power whilst how you speak in general is more indicative for predicting political orientation. Difference between validation and test scores Another notable aspect is the discrepancy between the achieved scores on the validation and test sets. Without using additional data extraction, our top-performing single model achieved a macro- average F1-score on the validation set that was 10.1 percentage points higher than that of our best ensemble model on the test set. In contrast, the gap for the power task was just 3.6 percentage points. We attribute the majority of this gap to two factors: the difference in the distribution of the amount of parliamentary data and the nature of the test sets’ construction. Since all parliaments have the same amount of data in the test sets whilst having greatly differing amounts of data in the training and validation sets, it is not sur- prising that the achieved scores differ. In the likely case that the models perform better on parliaments with more data, then the test set represents an increased presence of the harder-to-predict parliaments and a decreased presence of the easier parliaments. Regardless of this factor, just the difference in distribution itself likely also introduces a challenging condition. In other words, that there is a drift in label and parliament distributions is likely detrimental, even if the nature of that drift 34 7. Discussion was not suspected to be particularly damaging. This is because the model likely to some extent relies on the statistical trends, such as favouring to predict the more common label when a speech is ambiguous. The nature of the test sets’ construction may also account for the difference in discrepancy between our two tasks. The power test set largely contains the same speakers as our training data, just with a different power label. Given that we saw a relatively small decrease in performance, it seems that our models have been able to avoid over-fitting to a specific speakers power label. Such a case was likely aided by speakers exhibiting multiple power labels in the training data. This behaviour is not shared with the orientation task. Since the orientation test set largely consists of speakers who do not appear in the training data, the gap in performance may indicate that our models have over-fit to specific speakers. An additional factor is that the new speakers may cover topics our models have previ- ously not encountered. If a speaker is mainly exhibiting a political idea which the model has not previously learnt the ideological connotations of, then the classifi- cation likely becomes much more challenging. It would therefore be interesting to investigate whether these previously not encountered speakers are contemporary to and covering the same topics as the speakers in the training set. Weak ideological signals Some parliamentary speeches might not indicate strong political beliefs. They could solely cover practical proceedings, not expressing any opinions or making any arguments. These types of speeches are likely more challeng- ing to classify, especially if the provided dataset does not exhibit clear rhetorical or linguistic differences for labels within a given class. This likely introduces an upper limit on the performance of any model performing this task with similar data. Limitations As previously discussed, our limited access to computational re- sources determined what methods and models we could examine. This prevented us from examining large models such as the 70-billion parameter version Llama-3 or even the the 7-billion parameter version of Gemma. Not only were we limited in the selection of models, but also in the number of experiments which we could perform. More time and computational resources would have allowed us to attempt more techniques and further search for optimal hyperparameters. Techniques which could not be examined include using different learning rates for different layers and balancing the loss-function. We were also limited by the data which we had access to. In real life, these speeches are not stand-alone but most often parts of exchanges and debates. The problem of weak ideological signals could likely be mitigated by considering all the speeches a speaker makes in an exchange together. By representing speeches as parts of a larger debate, not only could a model base its prediction on all of the speakers speeches, but also on the speeches made by the other participants in the debate. 35 7. Discussion 36 8 Ethics and Risk Analysis In creating tools for political identification, there is potential for misuse. Methods for automatically identifying the ideologies of individuals based on their speech could become a tool for political targeting in the wrong hands. However, actors which have the resources and means to spy on people to such an extent as to require tools like these most likely already possess them. By using open-source models and being transparent about the capabilities of modern machine learning, we aim to raise awareness of what is possible with modern techniques. The names of the speakers in the dataset are encoded which means that it is chal- lenging to find information about a specific individual from the dataset. However, parliamentary debates are usually public information which means that our dataset is unlikely to leak any sensitive information anyway. Furthermore, the original dataset developed by ParlaMint [64] is a collaboration with the governments and is intended for projects such as this one. Even though transparency and explainability may be limited in our model, it is important to note that this limitation is acceptable for our specific context. The model is intended for research purposes and as a submission to a competition and not directly for real-world application. In this competition-focused scenario, the reduced transparency and explainability are deemed acceptable and do not pose a significant issue. As previously stated though, the capabilities of the presented models and method could also have real world use and provide useful information to citizens. In this case transparency and explainability are essential as providing a false or misleading sentiment might instead spread more confusion than before. This could then lead to citizens making conclusions based on faulty premises. We try to mitigate this risk by clearly outlining the data which our models trained on as well as the methods which where used to train them. 37 8. Ethics and Risk Analysis 38 9 Conclusion In this study, we proposed a method for improved fine-tuning of LLMs for ideology and power identification. Our research questions where as stated below. RQ1 Investigate what the best methods and practices are for identifying the polit- ical orientation in a parliamentary speech. RQ2 Investigate what the best methods and practices are for identifying whether a parliamentary speech is made by a speaker in opposition or in power. In answering our research questions, we found that modern LLMs are an effective approach for identifying both ideology and power in parliamentary debates. We further found that ideology and power likely share useful features and therefore fine- tuning to predict them jointly yields improved performance for both tasks. This improvement also extends to fine-tuning to predict synthetic labels, in our case polarity. We also note that performance can be improved by making the context of the speech, as available by meta data, available to the models. Furthermore, back-translation can be utilised to boost performance of countries with a smaller presence in a given dataset. Finally, we found that English models predicting on automatic translations tend to outperform multilingual models predicting on the original languages but that an ensemble of both types of models is the best approach. Our approach obtained the number one ranking for the task of Ideology and Power Identification in Parliamentary Debates as part of the Touché lab at CLEF 2024 [1]. 39 9. Conclusion 40 Bibliography [1] J. Kiesel, Ç. Ç. ltekin, M. Heinrich, M. Fröbe, M. Alshomary, B. nd De Longueville, T. Erjavec, N. Handke, M. K. pp, N. Ljubešić, K. Meden, N. Mirzakhmedova, V. as Morkevičius, T. Reitis-Münstermann, M. Scharfbil- lig, N. Stefanovitch, H. Wachsmuth, M. Potthast, and B. Stein, “Overview of Touché 2024: Argumentation Systems,” in Experimental IR Meets Multilin- guality, Multimodality, and Interaction. Proceedings of the Fifteenth Interna- tional Conference of the CLEF Association (CLEF 2024), ser. Lecture Notes in Computer Science, L. Goeuriot, P. Mulhem, G. Quénot, D. Schwab, L. Soulier, G. M. D. Nunzio, P. Galuščáková, A. G. S. de Herrera, G. Faggioli, and N. Ferro, Eds. Berlin Heidelberg New York: Springer, Sep. [2] U. N. Group, “Exist: sexism identification in social networks,” 2021, accessed on January 28, 2024. [Online]. Available: http://nlp.uned.es/exist2021/ [3] S. Coleman, “Meaningful political debate in the age of the soundbite,” in Tele- vised election debates: International perspectives. Springer, 2000, pp. 1–24. [4] H. Wasmuth and E. Nitecki, “(un) intended consequences in current ecec poli- cies: Revealing and examining hidden agendas,” Policy futures in education, vol. 18, no. 6, pp. 686–699, 2020. [5] M. J. Hinich and M. C. Munger, Analytical politics. Cambridge University Press, 1997. [6] Pew Research Center, “Americans’ dismal views of the na- tion’s politics,” https://www.pewresearch.org/politics/2023/09/19/ americans-dismal-views-of-the-nations-politics/, 2023, accessed on November 29, 2023. [7] M. K. David, “Language, power and manipulation: The use of rhetoric in maintaining political influence,” Frontiers of Language and Teaching, vol. 5, no. 1, pp. 164–170, 2014. [8] Ö. Sahin and Ö. Sahin, “A gentle introduction to ML and NLP,” Develop Intelligent iOS Apps with Swift: Understand Texts, Classify Sentiments, and Autodetect Answers in Text Using NLP, pp. 1–15, 2021. 41 http://nlp.uned.es/exist2021/ https://www.pewresearch.org/politics/2023/09/19/americans-dismal-views-of-the-nations-politics/ https://www.pewresearch.org/politics/2023/09/19/americans-dismal-views-of-the-nations-politics/ Bibliography [9] W. Medhat, A. Hassan, and H. Korashy, “Sentiment analysis algorithms and applications: A survey,” Ain Shams engineering journal, vol. 5, no. 4, pp. 1093– 1113, 2014. [10] P. Törnberg, “Chatgpt-4 outperforms experts and crowd workers in anno- tating political twitter messages with zero-shot learning,” arXiv preprint arXiv:2304.06588, 2023. [11] Université Grenoble Alpes, “CLEF 2024 - conference and labs of the evaluation forum,” https://clef2024.imag.fr/, 2024, accessed on January 17, 2024. [12] Webis Group, “Touche,” https://touche.webis.de/, accessed on January 17, 2024. [13] ——, “Ideology and Power Identification in Parliamentary Debates 2024,” https://touche.webis.de/clef24/touche24-web/ ideology-and-power-identification-in-parliamentary-debates.html, accessed on January 17, 2024. [14] L. Derczynski, “Complementarity, F-score, and NLP Evaluation,” in Proceed- ings of the Tenth International Conference on Language Resources and Evalu- ation (LREC’16), 2016, pp. 261–266. [15] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017. [16] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018. [17] T. Pires, E. Schlinger, and D. Garrette, “How multilingual is multilingual bert?” arXiv preprint arXiv:1906.01502, 2019. [18] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov, “Roberta: A robustly optimized bert pre- training approach,” arXiv preprint arXiv:1907.11692, 2019. [19] H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bash- lykov, S. Batra, P. Bhargava, S. Bhosale et al., “Llama 2: Open foundation and fine-tuned chat models,” arXiv preprint arXiv:2307.09288, 2023. [20] P. He, X. Liu, J. Gao, and W. Chen, “Deberta: Decoding-enhanced bert with disentangled attention,” arXiv preprint arXiv:2006.03654, 2020. [21] P. He, J. Gao, and W. Chen, “Debertav3: Improving deberta using electra-style pre-training with gradient-disentangled embedding sharing,” arXiv preprint arXiv:2111.09543, 2021. 42 https://clef2024.imag.fr/ https://touche.webis.de/ https://touche.webis.de/clef24/touche24-web/ideology-and-power-identification-in-parliamentary-debates.html https://touche.webis.de/clef24/touche24-web/ideology-and-power-identification-in-parliamentary-debates.html Bibliography [22] J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat et al., “Gpt-4 technical report,” arXiv preprint arXiv:2303.08774, 2023. [23] S. Behdenna, F. Barigou, and G. Belalem, “Document level sentiment analysis: a survey,” EAI endorsed transactions on context-aware systems and applica- tions, vol. 4, no. 13, pp. e2–e2, 2018. [24] A. Meena and T. V. Prabhakar, “Sentence level sentiment analysis in the pres- ence of conjuncts using linguistic analysis,” in Advances in Information Re- trieval: 29th European Conference on IR Research, ECIR 2007, Rome, Italy, April 2-5, 2007. Proceedings 29. Springer, 2007, pp. 573–580. [25] K. Schouten and F. Frasincar, “Survey on aspect-level sentiment analysis,” IEEE Transactions on Knowledge and Data Engineering, vol. 28, no. 3, pp. 813–830, 2015. [26] T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer, “Qlora: Efficient finetuning of quantized llms,” arXiv preprint arXiv:2305.14314, 2023. [27] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen, “Lora: Low-rank adaptation of large language models,” arXiv preprint arXiv:2106.09685, 2021. [28] Hugging Face, “Quantization,” https://huggingface.co/docs/optimum/ concept_guides/quantization, accessed on January 19, 2024. [29] D. Kalamkar, D. Mudigere, N. Mellempudi, D. Das, K. Banerjee, S. Avancha, D. T. Vooturi, N. Jammalamadaka, J. Huang, H. Yuen et al., “A study of bfloat16 for deep learning training,” arXiv preprint arXiv:1905.12322, 2019. [30] M. A. Ganaie, M. Hu, A. K. Malik, M. Tanveer, and P. N. Suganthan, “Ensem- ble deep learning: A review,” Engineering Applications of Artificial Intelligence, vol. 115, p. 105151, 2022. [31] A. F. M. de Paula, G. Rizzi, E. Fersini, and D. Spina, “Ai-upv at exist 2023– sexism characterization using large language models under the learning with disagreements regime,” arXiv preprint arXiv:2307.03385, 2023. [32] O. Rubin, J. Herzig, and J. Berant, “Learning to retrieve prompts for in-context learning,” arXiv preprint arXiv:2112.08633, 2021. [33] C. Olsson, N. Elhage, N. Nanda, N. Joseph, N. DasSarma, T. Henighan, B. Mann, A. Askell, Y. Bai, A. Chen et al., “In-context learning and induction heads,” arXiv preprint arXiv:2209.11895, 2022. [34] M. Bosley, M. Jacobs-Harukawa, H. Licht, and A. Hoyle, “Do we still need bert in the age of gpt? comparing the benefits of domain-adaptation and in-context- learning approaches to using llms for political science research,” 2023. 43 https://huggingface.co/docs/optimum/concept_guides/quantization https://huggingface.co/docs/optimum/concept_guides/quantization Bibliography [35] H. Yu, Z. Yang, K. Pelrine, J. F. Godbout, and R. Rabbany, “Open, closed, or small language models for text classification?” arXiv preprint arXiv:2308.10092, 2023. [36] Y. Chae and T. Davidson, “Large language models for text classification: From zero-shot learning to fine-tuning,” Open Science Foundation, 2023. [37] W. Zhang, Y. Deng, B. Liu, S. J. Pan, and L. Bing, “Sentiment analysis in the era of large language models: A reality check. arxiv,” arXiv preprint arXiv:2305.15005, 2023. [38] M. Abdullah, O. Altiti, and R. Obiedat, “Detecting propaganda techniques in english news articles using pre-trained transformers,” in 2022 13th International Conference on Information and Communication Systems (ICICS). IEEE, 2022, pp. 301–308. [39] K. Sprenkamp, D. G. Jones, and L. Zavolokina, “Large language models for propaganda detection,” arXiv preprint arXiv:2310.06422, 2023. [40] A. Benayas, M. A. Sicilia, and M. Mora-Cantallops, “A comparative analysis of encoder only and decoder only models in intent classification and sentiment analysis: Navigating the trade-offs in model size and performance,” 2024. [41] C. Sung, T. Dhamecha, S. Saha, T. Ma, V. Reddy, and R. Arora, “Pre- training bert on domain resources for short answer grading,” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2019, pp. 6071–6075. [42] Y. Gu, R. Tinn, H. Cheng, M. Lucas, N. Usuyama, X. Liu, T. Naumann, J. Gao, and H. Poon, “Domain-specific language model pretraining for biomedical nat- ural language processing,” ACM Transactions on Computing for Healthcare (HEALTH), vol. 3, no. 1, pp. 1–23, 2021. [43] Smartling, “What is back translation and why is it impor- tant?” 2023. [Online]. Available: https://www.smartling.com/resources/ 101/what-is-back-translation-and-why-is-it-important/ [44] S. Edunov, M. Ott, M. Ranzato, and M. Auli, “On the evaluation of machine translation systems trained with back-translation,” arXiv preprint arXiv:1908.05204, 2019. [45] D. R. Beddiar, M. S. Jahan, and M. Oussalah, “Data expansion using back translation and paraphrasing for hate speech detection,” Online Social Networks and Media, vol. 24, p. 100153, 2021. [46] S. Shleifer, “Low resource text classification with ulmfit and backtranslation,” arXiv preprint arXiv:1903.09244, 2019. 44 https://www.smartling.com/resources/101/what-is-back-translation-and-why-is-it-important/ https://www.smartling.com/resources/101/what-is-back-translation-and-why-is-it-important/ Bibliography [47] Y. Yu, Z. Zhou, X. Zheng, J. Gou, W. Ou, and F. Yuan, “Enhancing label cor- relations in multi-label classification through global-local label specific feature learning to fill missing labels,” Computers and Electrical Engineering, vol. 113, p. 109037, 2024. [48] Y. Zhang, Y. Cheng, X. Huang, F. Wen, R. Feng, Y. Li, and Y. Guo, “Sim- ple and robust loss design for multi-label learning with missing labels,” arXiv preprint arXiv:2112.07368, 2021. [49] X. Zhang, R. Abdelfattah, Y. Song, and X. Wang, “An effective approach for multi-label classification with missing labels,” in 2022 IEEE 24th Int Conf on High Performance Computing & Communications; 8th Int Conf on Data Sci- ence & Systems; 20th Int Conf on Smart City; 8th Int Conf on Dependability in Sensor, Cloud & Big Data Systems & Application (HPCC/DSS/SmartCi- ty/DependSys). IEEE, 2022, pp. 1713–1720. [50] Z. Ma and S. Chen, “Expand globally, shrink locally: Discriminant multi-label learning with missing labels,” Pattern Recognition, vol. 111, p. 107675, 2021. [51] B. Wu, Z. Liu, S. Wang, B.-G. Hu, and Q. Ji, “Multi-label learning with missing labels,” in 2014 22nd International conference on pattern recognition. IEEE, 2014, pp. 1964–1968. [52] Q. Jia, J. Cui, Y. Xiao, C. Liu, P. Rashid, and E. F. Gehringer, “All-in-one: Multi-task learning bert models for evaluating peer assessments,” arXiv preprint arXiv:2110.03895, 2021. [53] S. O’Hagan and A. Schein, “Measurement in the age of llms: An application to ideological scaling,” arXiv preprint arXiv:2312.09203, 2023. [54] S. Santurkar, E. Durmus, F. Ladhak, C. Lee, P. Liang, and T. Hashimoto, “Whose opinions do language models reflect?” arXiv preprint arXiv:2303.17548, 2023. [55] F. Motoki, V. Pinho Neto, and V. Rodrigues, “More human than human: Mea- suring chatgpt political bias,” Available at SSRN 4372349, 2023. [56] J. L. Martin, “The ethico-political universe of chatgpt,” Journal of Social Com- puting, vol. 4, no. 1, pp. 1–11, 2023. [57] M. Thomas, B. Pang, and L. Lee, “Get out the vote: Determining sup- port or opposition from congressional floor-debate transcripts,” arXiv preprint cs/0607062, 2006. [58] R. Malouf and T. Mullen, “Graph-based user classification for informal online political discourse,” in Proceedings of the 1st Workshop on Information Credi- bility on the Web, 2007. 45 Bibliography [59] M. Laver, K. Benoit, and J. Garry, “Extracting policy positions from political texts using words as data,” American political science review, vol. 97, no. 2, pp. 311–331, 2003. [60] D. Diermeier, J.-F. Godbout, B. Yu, and S. Kaufmann, “Language and ideology in congress,” British Journal of Political Science, vol. 42, no. 1, pp. 31–55, 2012. [61] Y. Riabinin, “Computational identification of ideology in text : A study of canadian parliamentary debates,” MSc paper, Department of Computer Sci- ence, University of Toronto, 2009. [62] J. Wäckerle, “Data set description for chapter 9: The parliamentary speech dataset (parlspeech) dataset,” 2020. [63] Çöltekin, Ç., M. Kopp, V. Morkevičius, N. Ljubešić, K. Meden, and T. Er- javec, “Training data for the shared task ideology and power identification in parliamentary debates,” https://doi.org/10.5281/zenodo.10450641, 2024. [64] CLARIN ERIC, “Parlamint: Harmonised parliamentary corpora,” 2021, accessed on November 24, 2023. [Online]. Available: https://www.clarin.eu/ parlamint [65] K. K. Dobbin and R. M. Simon, “Optimally splitting cases for training and testing high dimensional classifiers,” BMC medical genomics, vol. 4, no. 1, pp. 1–8, 2011. [66] Q. H. Nguyen, H.-B. Ly, L. S. Ho, N. Al-Ansari, H. V. Le, V. Q. Tran, I. Prakash, and B. T. Pham, “Influence of data splitting on performance of machine learning models in prediction of shear strength of soil,” Mathematical Problems in Engineering, vol. 2021, pp. 1–15, 2021. [67] A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. d. l. Casas, F. Bressand, G. Lengyel, G. Lample, L. Saulnier et al., “Mistral 7b,” arXiv preprint arXiv:2310.06825, 2023. [68] A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, E. Grave, M. Ott, L. Zettlemoyer, and V. Stoyanov, “Unsupervised cross-lingual representation learning at scale,” arXiv preprint arXiv:1911.02116, 2019. [69] G. Team, T. Mesnard, C. Hardin, R. Dadashi, S. Bhupatiraju, S. Pathak, L. Sifre, M. Rivière, M. S. Kale, J. Love et al., “Gemma: Open models based on gemini research and technology,” arXiv preprint arXiv:2403.08295, 2024. 46 https://doi.org/10.5281/zenodo.10450641 https://www.clarin.eu/parlamint https://www.clarin.eu/parlamint A List of countries in the dataset • Austria (at) • Bosnia and Herzegovina (ba) • Belgium (be) • Czechia (cz) • Denmark (dk) • Estonia (ee) [only political orientation] • Spain (es) • Catalonia (es-ct) • Galicia (es-ga) • Basque Country (es-pv) [only power] • Finland (fi) • France (fr) • Great Britain (gb) • Greece (gr) • Croatia (hr) • Hungary (hu) • Iceland (is) [only political orientation] • Italy (it) • Latvia (lv) • The Netherlands (nl) • Norway (no) [only political orientation] • Poland (pl) • Portugal (pt) • Serbia (rs) • Sweden (se) [only political orientation] • Slovenia (si) • Turkey (tr) • Ukraine (ua) I A. List of countries in the dataset II B Polarity base prompt Label the polarity of the following text, similarly to the provided examples. Your answer needs to start with “positive”, “negative” or “neutral”, followed by a short justification for your answer. It is important that you only assign a positive or negative label if you are sure of your answer. Here is your first example. Text: The south-west was cut off from the UK last winter and Network Rail per- formed miracles in getting that line back up and running. I therefore find it ex- traordinary that reasons such as the weather have been used to excuse the chaos and incompetence of this debacle, particularly out of King’s Cross. Why did the Secretary of State feel that it was not necessary for Ministers to ask for a basic reassurance that an overrun on any of the big programmes could be managed? Why were contingency plans not in place, and why was the rail regulator warning not adhered to? Negative. The text expresses frustration and criticism towards the handling of in- frastructure issues, particularly the failure to address problems with the rail system despite previous incidents. It highlights perceived incompetence and lack of plan- ning, suggesting a negative sentiment towards the situation. Here is your second example. Text: We are committed to ensuring that claimants receive high-quality, objec- tive, fair and accurate assessments. The Department monitors assessment quality through independent audit. Assessments deemed unacceptable are returned to the provider for reworking. A range of measures, including provider improvement plans, address performance falling below expected standards.

I do agree with the hon. Lady, which is why we have been trying to work more strategically with Motability, thrashing through the issues I am very aware of on appeals and on matters such as when an individual leaves the country. We are looking to reduce the amount of time that appeals take and at what we can do with the running of the scheme so that the precise scenario she outlines does not happen. Neutral. The text describes the commitment to ensuring quality assessments for claimants and outlines measures taken to monitor and address assessment quality. Additionally, it mentions efforts to work with Motability to improve processes and reduce appeal times. The tone is informative and focused on addressing issues, III B. Polarity base prompt without expressing overt positivity or negativity. Here is your third example. Text: I congratulate the hon. Gentleman on bringing this much needed debate to the Floor of the House. Will he join me in paying tribute to local MND associations across the United Kingdom for the invaluable support they provide? I know of the excellent work of my local Leicestershire and Rutland association, having heard at first hand from a constituent and friend of mine, Ruth Morrison, about her tragic personal experience. The support that is availabl