Ideology and Power Identification in
Parliamentary Debates
Master’s thesis in Data Science and AI

Johan Jiremalm
Oscar Palmqvist
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

CHALMERS UNIVERSITY OF TECHNOLOGY AND UNIVERSITY OF GOTHENBURG
Gothenburg, Sweden 2024
www.chalmers.se

www.chalmers.se


Master’s thesis 2024

Ideology and Power Identification in
Parliamentary Debates

Johan Jiremalm
Oscar Palmqvist

Department of Computer Science and Engineering
Division of Data Science and AI

Chalmers University of Technology and University of Gothenburg
Gothenburg, Sweden 2024


Ideology and Power Identification in Parliamentary Debates
Johan Jiremalm
Oscar Palmqvist

Supervisor: Pablo Picazo-Sanchez, Computer Science and Engineering
Examiner: Moa Johansson, Computer Science and Engineering

Master’s Thesis 2024
Computer science and Engineering
Division of Data Science and AI
Chalmers University of Technology and University of Gothenburg
SE-412 96 Gothenburg
Telephone +46 31 772 1000

Typeset in LATEX, template by Kyriaki Antoniadou-Plytaria
Gothenburg, Sweden 2024

iv


Johan Jiremalm
Oscar Palmqvist
Department of Computer science and Engineering
Chalmers University of Technology and University of Gothenburg

Abstract
Political debates are vital in shaping public opinion and influencing policy decisions.
However, understanding the complex linguistic structures used by politicians to as-
certain their orientations and power dynamics can be challenging. In this paper
we explore Natural Language Processing techniques for identifying political orien-
tation and power structures in parliamentary debates. We introduce a Located
Missing Labels-loss in order to train jointly to predict both power and ideology.
Furthermore, our proposed method also trains to predict a third synthetically gen-
erated polarity label. Finally, we combine this training method with pre-processing
steps including back-translation and meta data inclusion. Our results show that our
method manages to improve upon conventional methods of fine-tuning. We take
part in the Touché competition as part of CLEF 2024 and find that our method
achieves the highest performance out of all participants [1].

Keywords: Political Classification, NLP, LLM, MLML

v


Acknowledgments
We would like to acknowledge our supervisor, Pablo Picazo-Sanchez, for his contin-
uous guidance, feedback and assistance in this project. Furthermore, we want to
express our gratitude for the computational resources provided by the Data Science
and AI division at Chalmers University of Technology and University of Gothen-
burg. Finally, we would like to thank our examiner Moa Johansson for providing
feedback on early and intermediary versions of this paper.

vii


List of Acronyms

AI Artificial Intelligence.

BERT Bidirectional Encoder Representations from
Transformers.

CE Cross Entropy.
CLEF Conference and Labs of the Evaluation Forum.
CNN Convolutional Neural Network.

DeBERTa Decoding-Enhanced BERT with Distangled
Attention.

LLM Large Language Model.
LML Located Missing Labels.
LoRA Low Rank Adaptation.
LR Logistic Regression.

mBERT Multilingual BERT.
MLM Masked Language Model.
MLML Multi-label Learning with Missing Labels.

NER Named-Entity Recognition.
NLP Natural Language Processing.
NSP Next Sentence Prediction.

RF Random Forest.
RNN Recurrent Neural Network.

SA Sentiment Analysis.
SVM Support Vector Machine.

ix


Contents

List of Acronyms ix

List of Figures xiii

List of Tables xv

1 Introduction 1

2 Background 3
2.1 Transformers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Modern LLMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.3 Sentiment Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.4 Efficient fine-tuning of large models . . . . . . . . . . . . . . . . . . . 4
2.5 Ensemble Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.6 In-context learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

3 Related Work 7
3.1 Model and training method . . . . . . . . . . . . . . . . . . . . . . . 7
3.2 Domain-specific pre-training . . . . . . . . . . . . . . . . . . . . . . . 8
3.3 Back-translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.4 Multi-label learning with missing labels . . . . . . . . . . . . . . . . . 8
3.5 Political identification in NLP . . . . . . . . . . . . . . . . . . . . . . 9
3.6 Performance of models in similar contemporary competitions . . . . . 10

4 Dataset 13

5 Method 19
5.1 Dataset and Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . 19

5.1.1 Back-translation . . . . . . . . . . . . . . . . . . . . . . . . . 20
5.1.2 Meta data inclusion . . . . . . . . . . . . . . . . . . . . . . . . 20

5.2 Combined training . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
5.2.1 Located missing labels loss function . . . . . . . . . . . . . . . 21

5.3 Polarity label extension . . . . . . . . . . . . . . . . . . . . . . . . . . 21
5.4 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5.5 Hyperparameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5.6 Ensemble modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
5.7 Additional data extraction for test set . . . . . . . . . . . . . . . . . . 24

xi


Contents

6 Results 27
6.1 Method components . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
6.2 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
6.3 Translated vs. multilingual . . . . . . . . . . . . . . . . . . . . . . . . 31
6.4 Ensemble modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
6.5 Test set results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

7 Discussion 33

8 Ethics and Risk Analysis 37

9 Conclusion 39

A List of countries in the dataset I

B Polarity base prompt III

C Hyperparameters V
C.1 BERT method components . . . . . . . . . . . . . . . . . . . . . . . . V
C.2 Model hyperparameters . . . . . . . . . . . . . . . . . . . . . . . . . . V

D Figures and illustrations VII
D.1 Multilingual model performance figures . . . . . . . . . . . . . . . . . VII
D.2 Back-translation impact on specific parliaments . . . . . . . . . . . . VIII
D.3 Power dataset illustrations . . . . . . . . . . . . . . . . . . . . . . . . VIII

xii


List of Figures

4.1 Distribution of the Opposition, Power, Left and Right-wing labels for
each country in the orientation dataset. . . . . . . . . . . . . . . . . . 14

4.2 Mean and standard deviation for text lengths per country in the ori-
entation dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

4.3 Density distribution of text lengths for the 3 countries with the high-
est, and lowest average text lengths in the orientation dataset. . . . 15

4.4 Venn diagram where the red region represents the number of speakers
that only appear in the orientation dataset (Orientation \ Power).
The green region represents the number of speakers that only appear
in the power dataset (Power \ Orientation). The centre region rep-
resents the number of speakers that appear in both datasets, in other
words the intersection (Orientation ∩ Power). . . . . . . . . . . . . 16

5.1 Overarching method illustration. The method is divided into a data
preprocessing stage and a model training and prediction stage. The
test set is inserted as input, but it could potentially be any speech. . 19

5.2 Visualised process for extracting the speaker for texts in the test data
for the orientation task. The bottom flow chart then shows how we
take the average prediction for all texts by speaker sp1. Note that we
illustrate the predictions as ranging from 0 to 1 for simplicity whilst
in actuality the logits were used. . . . . . . . . . . . . . . . . . . . . . 25

6.1 Comparison of base BERT and with different combinations of the
components of our method over the training epochs. CT stands for
combined training, BT for back-translation, PLE for polarity label
extension and MDI for meta data inclusion. . . . . . . . . . . . . . . 28

6.2 Comparison of RoBERTa trained for political orientation classifica-
tion using conventional fine-tuning and our method. . . . . . . . . . . 30

D.1 The comparison of BERT and RoBERTa fine-tuned on the English
translations vs. mBERT and XLM-RoBERTa fine-tuned on the orig-
inal texts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . VII

D.2 Difference in macro-average F1-score comparing combined training
BERT with and without back-translation on the orientation task.
The countries are ordered by the number of speeches before back-
translation, from left to right in increasing order. . . . . . . . . . . . VIII

xiii


List of Figures

D.3 Mean and standard deviation for text lengths per country in the power
dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . IX

D.4 Density distribution of text lengths for the 3 countries with the high-
est, and lowest average text lengths in the power dataset. . . . . . . IX

xiv


List of Tables

2.1 Example of zero- and one-shot classification. In zero-shot, the model
is not given an exemplary stance classification before being asked to
provide its own. The prompt column contains the prompts for the
model and the output column, in blue, represents the next word as
predicted by the model. . . . . . . . . . . . . . . . . . . . . . . . . . 6

3.1 Results and techniques used in the EXIST 2021 competition [2] sorted
in order of average ranking. Random Forest (RF) stands for Random
Forest, Logistic Regression (LR) stands for Logistic Regression. . . . 11

4.1 Distribution of the opposition and power labels in the power dataset
as well as Left and Right labels in the orientation dataset. . . . . . . 14

6.1 Baseline macro-average F1-scores on our validation set for Orientation
and Power tasks from provided linear logistic regression model. . . . . 27

6.2 BERT component study showing how combinations of components in
our method impacted the macro-average F1-score for the two tasks
compared to conventional fine-tuning. CT stands for combined train-
ing; BT for back-translation; PLE for polarity label extension, and;
MDI for meta data inclusion. . . . . . . . . . . . . . . . . . . . . . . 28

6.3 Highest attained macro-average F1-scores of our examined models. . . 30
6.4 Macro-average F1-scores of DeBERTa when evaluated using different

sequence lengths. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
6.5 Macro-average F1-scores on our base validation set of base models

and the ensemble of them. . . . . . . . . . . . . . . . . . . . . . . . . 31
6.6 Macro-average F1-scores on test set of our final ensemble and the

provided baseline model . . . . . . . . . . . . . . . . . . . . . . . . . 32

C.1 Non-default hyperparameter values for the BERT method compo-
nents experiment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . V

C.2 Best performing examined non-default hyperparameter values for var-
ious models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . VI

C.3 Best performing examined non-default hyperparameter values for Gemma VI

xv


List of Tables

xvi


1
Introduction

Parliamentary debates play a vital role in political communication and society [3].
During these debates, representatives from diverse parties and ideologies share their
opinions, arguments, and stances on issues impacting society. Making debates more
accessible and easy to follow not only serves to inform people but offers a basis for
seeking further information and engaging in the democratic process [3].

The ability to detect and classify political motives from speech may also be utilised
when the speaker is not forthcoming with their political agenda. Detecting hidden
political motives in media such as news reporting and advertisements may benefit
society by providing transparency [4].

The complex nature of politics makes these debates challenging to understand [5].
According to a recent survey, 65% of Americans say they always or often feel ex-
hausted when thinking about politics [6]. In addition, 42% of adults in the U.S.
reported to have watched none, or very few, of the presidential debates in 2020.
Moreover, analysing political speeches and making classifications can be challenging
due to complex rhetorical strategies such as metaphors, parallelism, and suggesting
answers [7]. Political context and the speaker’s background influence how messages
are conveyed and interpreted.

The challenge of analysing and making classification on political speech may be
approached from the perspective of Natural Language Processing (NLP). NLP is
a field in artificial intelligence that focuses on analysing, understanding, and pro-
cessing natural language data using computers [8]. Sub-tasks within NLP involve,
among others, text summarisation, machine translation, and sentiment analysis.
More recently, the field of NLP has surged in popularity with the development of
chatbots such as ChatGPT. Besides the massive Large Language Models (LLMs)
such as GPT-4, there have also been multiple other different approaches for NLP
such as rule-based and probabilistic approaches [9]. The incredible performance of
these LLM can be applied to the complex political realm with great success [10].

Conference and Labs of the Evaluation Forum (CLEF) [11] hosts an open compe-
tition in 2024 called Touché as part of one of their so called labs [12]. The task,
Ideology and Power Identification in Parliamentary Debates, is one of the four com-
petitions as part of Touché’s presence at CLEF 2024 [13]. This document is an entry
to that competition. It will, therefore, investigate which NLP tools are best suited

1


1. Introduction

for identifying ideology and power in parliamentary debate speeches.

Research goals. The research goals1 for this project are inspired by and corre-
spond to the two sub-tasks of the Touché competition Ideology and Power Identifi-
cation in Parliamentary Debates mentioned above [13].

RQ1 Investigate what the best methods and practices are for identifying the polit-
ical orientation in a parliamentary speech.

RQ2 Investigate what the best methods and practices are for identifying whether
a parliamentary speech is made by a speaker in opposition or in power.

Evaluation The results of both research questions will be evaluated against a test
set provided by Touché using a macro-averaged F1-score [14], as it is the performance
metric of the Touché competition [13].

Paper structure The rest of this document is organised as follows. In Chapter 2,
we explore the foundational aspects of modern NLP. Chapter 3 investigates a much
more narrow selection of related work which is closely related to our problem at
hand and which influenced our method. We outline and describe the datasets set
in Chapter 4. We present our full proposed method in Chapter 5 and share the
results from its application in Chapter 6. We discuss and explain these findings
in Chapter 7. In Chapter 8 we provide a short ethical disclaimer and risk analysis
before concluding the document in Chapter 9.

1Also referred to as sub-tasks.

2


2
Background

In the following, we explore the foundational aspects of NLP, discussing its evolution,
common practices, and prevalent tools and models.

2.1 Transformers
In 2017, the field of Artificial Intelligence (AI) significantly changed with the in-
troduction of transformers [15]. This innovation was an improvement compared to
previous methods like Convolutional Neural Networks (CNNs) and Recurrent Neural
Networks (RNNs) by offering parallelised training and enhanced capability to grasp
long-term relationships. The core of the transformer relies on attention, a function
that gauges the relevance or similarity among different elements within a sequence
of, for example, words, tokens, or pixels. The typical transformer architecture con-
sists of an encoder and a decoder. However, variations exist where structures may
include only encoders or decoders.

2.2 Modern LLMs
One widely adopted pre-trained transformer model is Bidirectional Encoder Repre-
sentations from Transformers (BERT), an encoder-only network featuring 12 trans-
former layers and approximately 110 million parameters [16]. The bidirectional
encoder ensures that the model is able to consider proceeding and following text
from a certain point simultaneously. This is utilised in the pre-training which uses
Masked Language Model (MLM). Conclusively, this means that the model is made
to predict randomly missing words from a corpus. BERT is also fine-tuned for
specific downstream tasks such as question-answering and sentiment analysis. The
context length of BERT, that is, the maximum number of tokens it can process at
once, is 512 tokens.

Multilingual BERT (mBERT) shares the same architecture as BERT but is pre-
trained on a multilingal corpus with the 104 highest resource languages on Wikipedia
which is different compared to BERT which is primarily trained on English data [17].
The tokenisation of mBERT is also made to better represent multilingual data.

RoBERTa is a re-implementation of BERT [18]. Some of the main differences be-

3


2. Background

tween BERT and RoBERTa include the pretraining procedure where RoBERTa, for
instance, uses dynamic masking, trained on full sentences without Next Sentence
Prediction (NSP) loss. RoBERTa is also trained on larger batches and for longer
duration compared to BERT. It also has the same context length of 512 tokens.

Llama-2, developed by Meta, uses the decoder from the traditional transformers ar-
chitecture [19]. This is in contrast to BERT which only uses the encoder. There are
multiple Llama-2 models with different sizes ranging from 7B to 70B parameters.
They are all pre-trained on 2 trillion tokens from publicly available multilingual
sources. The 34B and 70B models use Grouped-Query Attention during the pre-
training to improve the inference scalability. LLama2 has a context length of 4096
tokens which is 8 times larger than the context length of BERT and RoBERTa.

Decoding-Enhanced BERT with Distangled Attention (DeBERTa) is another LLM
that improves on BERT and RoBERTa [20]. This is mainly done through the in-
troduction of the distangled attention mechanism and an enhanced mask decoder.
There are multiple versions of DeBERTa, and one of those are DeBERTaV3, which
improves the efficiency of previous DeBERTa versions [21]. It does this by us-
ing ELECTRA-Style pre-training with Gradient Disentangled Embedding Sharing.
There are also different versions of DeBERTaV3, and one of those is DeBERTaV3-
large which has 304M backbone parameters.

GPT-4 developed by OpenAI is a decoder only architecture which is different from
the BERT architectures [22]. Moreover, the model is also multimodal which means
that it can process, and produce both text and images. The parameters and the
weights are not publicly available which means that it is unfeasible to use GPT-4
for this project.

2.3 Sentiment Analysis
The two sub-tasks covered in this project are similar to textual binary classification
tasks, drawing parallels to a well-explored domain in NLP—Sentiment Analysis
(SA). The task of SA is to identify the sentiment expressed in a text, and then
classify it. This task is generally split into three classification levels: document-
level [23], sentence-level [24], and aspect-level [25]. There are many techniques for
SA, such as lexicon-based approaches and machine learning approaches [9]. The
sentiment of a text can also be called polarity.

2.4 Efficient fine-tuning of large models
Full fine-tuning of pre-trained LLMs with billions of parameters is expensive, and is
not feasible for most local machines [26]. To combat these problems, various tech-
niques have been developed to both reduce the amount of memory, and computing
power required to fine-tune LLMs.

Low Rank Adaptation (LoRA) is a technique where, instead of retraining all the

4


2. Background

models pre-trained weights, they are frozen and a trainable rank decomposition
matrix is injected into each layer of the transformer architecture [27]. This can
significantly reduce both the number of trainable parameters and the memory re-
quirements. LoRA is based upon the low intrinsic rank hypothesis, specifically that
the matrix of weight updates could be effectively represented by a lower-dimensional
representation. Essentially stating that fewer wider changes to the weights can effec-
tively represent the necessary updates instead of updating each weight individually.

Quantisation refers to the practice of reducing the computational cost and memory
size of a model by representing weights using a smaller and less precise data type,
such as 16-bit float or 8-bit integer [28]. Studies have also shown that specialised
datatypes such as 16-bit brain floating point (bfloat16) [29] and 4-bit normal float
(nf4) [26] yield increased performance for machine learning applications over regular
floating point units of equal or even greater size. QLoRA combines quantisation (Q)
with LoRA, along with other innovations [26].

2.5 Ensemble Learning
Ensemble learning involves the process of combining multiple individual models to
get a better generalisation performance [30]. There are a few different strategies
of how to combine the models, including: stacking, boosting, and bagging. The
latter of the three involves the process of training multiple models independently on
different subsets of the training data. The outputs of the models are then usually
combined via majority voting or averaging the outputs. An example of a bagging
ensemble model is the random forest classifier which averages the predictions of
multiple decision tree classifiers. Ensemble modelling has previously shown great
results in classification task competitions [31].

2.6 In-context learning
In-context learning is the technique of having a language model predict based on a
textual input consisting of any number of labeled and unlabeled examples, without
any update to the model itself [32]. In-context learning is useful since it allows a
model to be adapted to perform many different tasks without having to perform
the often times expensive process of retraining the model [33]. An example of zero-
shot and one-shot classification, based upon in-context learning later utilised in
Section 5.3, is provided in Table 2.1.

5


2. Background

Prompt Output
Zero-shot Label the polarity of the following text.

Text: The south-west was cut off from the UK [...]
Polarity: Negative

One-shot Label the polarity of the following text.
Text: I congratulate the hon. Gentleman on [...]
Polarity: Positive
Label the polarity of the following text.
Text: The south-west was cut off from the UK [...]
Polarity: Negative

Table 2.1: Example of zero- and one-shot classification. In zero-shot, the model is
not given an exemplary stance classification before being asked to provide its own.
The prompt column contains the prompts for the model and the output column, in
blue, represents the next word as predicted by the model.

6


3
Related Work

Here, we discuss previously explored techniques used for political classification within
NLP and their relevance to our project. We also examine a range of studies show-
casing the performance of models such as RoBERTa and BERT in tasks such as
multi-lingual political orientation classification and stance classification of politi-
cal tweets. Additionally, we explore domain-specific pre-training, back-translation,
multi-label learning with missing labels, and other methodologies relevant to our
research objectives.

3.1 Model and training method
Fine-tuned RoBERTa has demonstrated significant superiority over zero- and few-
shot GPT-3 for multi-lingual political orientation classification [34]. For the task of
implicit ideology prediction, fine-tuned RoBERTa has also been shown to outperform
GPT-4, Llama-2-13B and Llama-2-70B using in-context learning, as well as Llama-
2-70B utilising LoRA-fine tuning [35]. Furthermore, it is noteworthy that in the
same study, the only model which beat fine-tuned RoBERTa for Named-Entity
Recognition (NER) was the LoRA fine-tuned Llama-2-70B.

It has been shown that even when only fine-tuning a classification head, BERT has
approached the performance of few-shot GPT-3 models for stance classification of
political tweets [36].

In a similar manner, a comparison of the performance of “Small Language Models”
and modern LLMs for sentiment analysis tasks has been performed [37]. The study
compares T5large (770M parameters) to Flan-T5, Flan-UL2, text-davinci-003, gpt-
3.5-turbo (11B, 20B, 175B, undisclosed, parameters respectively) across 13 different
sentiment analysis tasks and 26 datasets. For context, the base version of BERT
has 110M parameters whilst the large version totals 340M parameters. The T5-
model was trained on the entire training dataset whilst the other LLMs utilised
zero- or few-shot classification. They divide their sentiment analysis tasks into
three categories, and find that the smaller fully trained T5-model achieves the best
results for all categories, outperforming zero-, one-, five-, and ten-shot classification
versions of the larger models. The authors conclude that whilst the larger LLMs
perform adequately in simpler tasks, they are outperformed in complex tasks which
require structured sentiment information or deeper understanding.

7


3. Related Work

RoBERTa has been shown to be state-of-the-art for propaganda classification [38].
The same RoBERTa-model outperforms fine-tuned versions of GPT-3 and multiple
in-context learning versions of GPT-4 [39], yielding a micro average F1-score of
63.4% as compared to that of the best GPT-model (base GPT-4) of 58.11%.

New and large decoder-only models have been compared to encoder-only models for
the tasks of intent classification and sentiment analysis [40]. The study reveals that,
in general, encoder-only models provide superior performance, at a fraction of the
computational demand, for natural language understanding tasks.

3.2 Domain-specific pre-training
Domain-specific pre-training refers to the process of training a model on domain-
specific texts before fine-tuning for a specific task within that domain [41]. Domain-
specific pre-training has shown great utility for domains with abundant unlabeled
text, such as the biomedical field [42].

For the task of multi-lingual political orientation classification, however, some au-
thors recently showed that domain-specific pre-training does not greatly impact
results [34]. Also, after a threshold of approximately 10,000 sentences, the general-
domain pre-training seems to be sufficient as to not benefit from additional domain-
specific pre-training.

3.3 Back-translation
Back-translation involves the process of translating one text into another language,
and then translating the new text back into the original language [43]. In our case
this technique will be used to create artificial data that is similar to the original data,
as further explained in Section 5.1.1. Back-translation has shown widespread utility
for machine translation tasks [44]. Furthermore, using back-translation to artificially
extend datasets has also shown promising results for hate speech detection tasks [45].
Back-translation for classification tasks has also shown itself to be particularly useful
when there is less training data [46].

3.4 Multi-label learning with missing labels
Multi-label Learning with Missing Labels (MLML) has shown great utility for image
classification tasks [47]. In these tasks, a “missing” label most often refers to a
false negative, and the challenge is to differentiate between true negatives and false
negatives caused by incomplete or faulty annotation. Many methods have been
proposed to handle these missing labels [48, 49, 50]. As will be shown in Chapter 5,
our two sub-tasks can be combined into a single multi-label classification problem
with located missing labels. Traditional MLML methods are, however, not suited
for our task since the locations of our missing labels are known and not hidden as
false negatives. A study on MLML for image and facial-expression classification

8


3. Related Work

from 2014 shares our definition of MLML where missing labels are located but the
technique is not appropriate for our project since it is tailored for a label-space
magnitudes larger than our own [51].

Our two sub-tasks can be combined into a single multi-label classification problem,
as will be shown in Chapter 5. Similar approaches of combining tasks have shown to
be beneficial. For the task of peer-assessment evaluation, multi-task learning BERT
has been shown to outperform its single-task counterparts [52]. For the multi-task
learning BERT, three separate classification heads were added to the same base
BERT model and the loss for fine-tuning was the sum of the Cross Entropy (CE)-
loss from each classification task.

3.5 Political identification in NLP
Many approaches for analysing political textual data have been previously proposed.
For instance, it has been shown that GPT-4 exhibits a deep understanding of and
ability to quantify broad political terms such as “ideology” and “power” [53]. How-
ever, it has also been shown that the GPT-family of models, specifically GPT-3 and
GPT-4, exhibit a liberal-leaning political bias [54, 55, 56].

NLP has already been used to process political texts. A minimum-cut classification
framework that combined Support Vector Machines (SVMs) and speaker agreement
whose goal was to classify speeches from the house of representatives as either being
for, or against a proposed legislation was proposed in [57]. In a similar manner,
relationships between debating participants yielded a 70% accuracy on a testing set
with 10 debates summing up to 860 speech segments [58].

Furthermore, the analysis of relative word frequencies as a method for examining
political texts has been explored. Relative word frequencies have been utilised in
order to extract policy positions by examining manifestos and legislative speeches
of various British and Irish parties in the years 1991 and 1992 [59]. This technique
was employed to infer policy stances not only from parties but also from individual
politicians. Whilst the authors deemed their approach successful for both tasks,
they did not provide a precise evaluation metric.

Attempts to classify ideology from political speech have also been performed. SVMs
have been used to classify ideology of speakers in legislative speech records from
the 101st to 108th Congresses of the US Senate [60]. This approach yielded a 92%
accuracy score. It was also found that identifying which party a certain individual
belonged to was easier than extrapolating their ideology. However, they found that
there is a high correlation between party membership and ideology. Building on
these findings, similar strategies were used to identify party affiliation from speech
[61]. This approach achieved an accuracy of up to 98% on debates in the Canadian
House of Commons.

Sentiment analysis has also been used to identify trends in parliamentary speeches for
parties in power and parties in opposition. For instance, using SA on the ParlSpeech

9


3. Related Work

V2 dataset, political parties in power tend to use a speech with a more positive
sentiment than parties in opposition [62]. Also, political parties that transition
from being in opposition to in power tend to exhibit a more positive sentiment in
their speeches and vice versa. Thus, indicating that the correlation is not strictly
a result of positive parties being more popular and therefore having an easier time
getting in power.

3.6 Performance of models in similar contempo-
rary competitions

In 2021, amongst other years, CLEF organised a competition called EXIST [2]. The
two task for the competition were:

• Task 1: Identifying Sexist Content In this task, the system is supposed to
perform binary classification. It must determine whether a given text (tweet
or gab) exhibits sexism, whether directly, by describing a sexist scenario, or
by criticising sexist behaviour.

• Task 2: Categorising Sexist Content Following the identification of sexist
content, the subsequent task involves categorising the content based on the
type of sexism present.

The results, and techniques used in the two sub-tasks in the competition are com-
piled in Table 3.1. Some takeaways from the approaches in the competition comes
from the datasets including Spanish and English text. When participants used Beto
(Spanish version of BERT), it was exclusively used to analyse the Spanish texts
which means that it had to be combined with other models for English. The same
is true for BERT, almost all participants that used BERT for the English texts
also ended up using other models for the Spanish texts. Lastly, the most common
and best performing LLMs for handling multiple languages in this competition were
mBERT and XLM-R.

10


3. Related Work

Task1 Task 2 Average
Ranking Ranking Ranking

Be
rt

Be
to

m
Be

rt

X
LM

-R

R
oB

ER
Ta R
F LR

SV
M

fa
st

Te
xt

Ai-UPV 1 1 1 x x x
SINAI-TL 2 3 2,5 x x
AIT FHSTP 3 5 4 x
Multiaztertest 4 4 x x
LHZ 8 2 5 x
nlp uned team 5 9 7 x
QMUL-SDS 11 4 7,5 x
Alclatos 10 6 8 x x
ZK 9 8 8,5 x x x
GuillemGSubies 7 14 10,5 x x
IREL hatespeech group 14 7 10,5 x
Codec 12 12 x x
S_exist 12 13 12,5 x x
MiniTrue 13 13 x x x
UMUTeam 16 11 13,5 x x
Free 6 24 15 x x
ZZW 15 18 16,5 x
Zimtstern 17 16 16,5 x
LaSTUS 18 15 16,5 x
Recognai 26 10 18 x
Andrea Lisa 21 17 19 x
CIC 19 19 19 x x x
MessGroupELL 20 20 x x
MB-Courage 22 22 22 x
Nerin 24 20 22 x x
Soumya 23 23 23 x x
UNEDBiasTeam 25 21 23 x
BilaUnwanPk1 27 25 26 x
Almuoes3 27 27 x x x
ORDS_CLAN 29 26 27,5 x
Uja 28 28 x x

Table 3.1: Results and techniques used in the EXIST 2021 competition [2] sorted
in order of average ranking. RF stands for Random Forest, LR stands for Logistic
Regression.

11


3. Related Work

12


4
Dataset

Touché provides a dataset [63] for our task which contains a selection of speeches
from the ParlaMint corpora [64]. This dataset contains data from parliamentary
speeches in multiple European parliaments—we include the countries covered in the
dataset in Appendix A. More precisely, the dataset consists of two separate subsets,
one for each sub-task. These subsets are also divided into multiple sub-subsets each
containing data from only one country.

The organisers altered the dataset to provide less information than the original one,
but also includes an automatic translation to English for most non-English texts.
The provided training dataset consists of 6.5GB of text files [63] divided among the
orientation and power datasets and contains the following fields:

id is a unique (arbitrary) ID for each text.

speaker is a unique (arbitrary) ID for each speaker. There may be multiple
speeches from the same speaker.

sex is the (binary/biological) sex of the speaker. This information is collected
from varying sources (typically data published by the respective parlia-
ment), and in some cases it may be unspecified or unknown.

text is the transcribed text of the parliamentary speech. Real examples may
include line breaks, and other special sequences escaped or quoted.

text_en is an automatic English translation of the corresponding text. This field
may be empty for speeches in English. There might be missing transla-
tions for a small number of non-English speeches.

label is the binary/numeric label. For political orientation, 0 is left and 1 is
right. For power identification 0 indicates coalition (or governing party)
and 1 indicates opposition.

Data imbalance There is an uneven distribution of data where some countries
have more data than others. In addition, the label distribution among the countries
is also skewed. For instance, in Figure 4.1 we can see that Serbia has more data for
speeches connected to right wing parties while for the power dataset Serbia has more

13


4. Dataset

Spain Sweden The Netherlands Turkey Ukraine

Latvia Norway Poland Portugal Serbia Slovenia

Galicia Great Britain Greece Hungary Iceland Italy

Croatia Czechia Denmark Estonia Finland France

Austria Basque Country Belgium Bosnia and Herzegovina Bulgaria Catalonia

0

5000

10000

15000

0

5000

10000

15000

0

5000

10000

15000

0

5000

10000

15000

0

5000

10000

15000

#S
pe

ec
h

Power Opposition Left Right

Figure 4.1: Distribution of the Opposition, Power, Left and Right-wing labels for
each country in the orientation dataset.

speeches from speakers in power. Table 4.1 represents the overall label distribution
of the dataset.

The text lengths of each country also vary and are displayed in Figures 4.2 and 4.3
for the orientation dataset. Shorter texts may contain less helpful information for
the predictions and thus, decrease the performance. Moreover, the limited context
length of the models may not be able to capture all relevant information in the
longer texts. For instance, BERT has a context length of 512 tokens which means
that it can not process the entirety of most texts at once.

Shared information Even though the datasets are split, 47.5% of the speakers
that appear in one of the datasets appear in both datasets, as shown in Figure 4.4.

# of speeches % of task data % of all data
Left 58,146 39.0 16.2
Right 90,797 61.0 25.3
Power 111,127 53.1 31.0
Opposition 98,114 46.9 27.4

Table 4.1: Distribution of the opposition and power labels in the power dataset as
well as Left and Right labels in the orientation dataset.

14


4. Dataset

Aus
tria

Bas
qu

e C
ou

ntr
y

Bos
nia

 an
d H

erz
eg

ov
ina

Belg
ium

Bulg
ari

a

Cata
lon

ia

Croa
tia

Cze
ch

ia

Den
mark

Esto
nia

Finl
an

d

Fran
ce

Gali
cia

Grea
t B

rita
in

Gree
ce

Hun
ga

ry

Ice
lan

d
Ita

ly
La

tvi
a

Norw
ay

Pola
nd

Port
ug

al

Serb
ia
Spa

in

Swed
en

Slov
en

ia

The
 N

eth
erl

an
ds

Tu
rke

y

Ukra
ine

Country

0

200

400

600

800

1000

1200

1400

1600

M
ea

n

Figure 4.2: Mean and standard deviation for text lengths per country in the
orientation dataset.

500 0 500 1000 1500 2000 2500 3000 3500
Word Count

0.000

0.001

0.002

0.003

0.004

D
en

si
ty

Finland
Estonia
Ukraine
Spain
Greece
Galicia

Figure 4.3: Density distribution of text lengths for the 3 countries with the highest,
and lowest average text lengths in the orientation dataset.

15


4. Dataset

1832 65407572

Orientation
Power

Overlap of Speakers Between Orientation and Power Datasets

Figure 4.4: Venn diagram where the red region represents the number of speakers
that only appear in the orientation dataset (Orientation \ Power). The green
region represents the number of speakers that only appear in the power dataset
(Power \ Orientation). The centre region represents the number of speakers that
appear in both datasets, in other words the intersection (Orientation ∩ Power).

Moreover, 51.4% of speeches in the power dataset are made by a speaker who also
appears in the orientation dataset. This amount equates to 72.2% of the total
number of speeches in the orientation dataset.

Isolated speeches The dataset does not include the date of when the parliamen-
tary speeches occurred. This means that modeling changes in party ideas over time
or fully encapsulate how politics shift is infeasible. Moreover, simply connecting
a speech to a certain ideology will not be as useful when it comes to predicting
whether the speaker is currently in a governing position or opposition. This is be-
cause a country with a left leaning government one year could have a right leaning
government the next.

Even though speeches are originally part of debates and exchanges in parliament, the
dataset contains no data for connecting multiple speeches to a single conversation or
exchange. This prevents approaches which would model an entire debate and label
the participants in the debate rather than the individual speeches themselves.

Privacy in the dataset The dataset uses arbitrary codes for people’s names.
This makes it difficult to check if our results match up with real-world politicians.
Additionally, the test set for the orientation task does not contain any speakers
in the original orientation dataset. As a result, a solution which attempts to con-
nect specific speeches to the correct political parties using the speakers identity is
infeasible.

Test set The test set for the orientation sub-task contains randomly sampled
speeches by speakers who do not appear in the training set. The test set for the power

16


4. Dataset

sub-task does, to a large extent, contain speakers that also appear in the training
set. However, speakers recurring in the test set will tend to have a different label
distribution compared to the training set. For both sub-tasks, the test set contains
approximately 2,000 speeches for each parliament, whose general label distributions
resemble those in the training set. The test data follows a similar structure to the
training data apart from the speaker_id and label field being hidden.

17


4. Dataset

18


5
Method

In the following, we describe how we processed and prepared the data as well as how
we selected and trained models. An overarching view of our method is illustrated
in Figure 5.1.

5.1 Dataset and Preprocessing
We decided not to extend the dataset using external sources. This was partly due
to other parliamentary debates in our selection of countries either being unavailable
or already in the original ParlaMint corpora. It would also require a lot of work in
order to create properly labeled datasets in the same format as our base dataset.
Finally, the amount of available data is of substantial size and a larger amount would
put further pressure on the need for computational resources.

We generated the training dataset for each task by sampling 70% of the provided
datasets, leaving the remaining 30% of the datasets for the validation set. Note that
the 70/30 split is a commonly used rule-of-thumb which has shown some empirical
optimality [65, 66].

We used this split for evaluating the individual and combined parts of our model.
Once we got the most effective techniques, we switched to a different split for our

Process

Orientation

Power

Validation Train

ModelModelFine-tuned 
Models Ensemble Prediction

Finetuning

Input

Data 
preprocessing

Model training 
and predictions

ModelModelModels

Model predictions

Combining 
datasets

Back-
translation

Mistral-7B

Splitting
Polarity 
Label 

Extension

Meta Data 
Inclusion

Figure 5.1: Overarching method illustration. The method is divided into a data
preprocessing stage and a model training and prediction stage. The test set is
inserted as input, but it could potentially be any speech.

19


5. Method

final ensemble model as explained in Section 5.6.

Missing translations Firstly, some English translations were missing in the pro-
vided Finnish dataset. Specifically, there were 271 translations missing in the train-
ing dataset and 88 in the test set. These texts had to be manually machine trans-
lated. For the translations we used a Python package called mrTranslate.

5.1.1 Back-translation
To address country distribution imbalance, we applied back-translation to the data
from countries with fewer than 15,000 entries in the power and orientation datasets
combined. We chose this threshold in order to strike a balance between improv-
ing the amount of speeches available for parliaments with less representation in the
dataset and not increasing the training time excessively. The back-translation pro-
cess involved translating English text to the original language and back, followed
by translating the original text to English and back to the original language. We
also used mrTranslate for the translations and appended the resulting data to the
dataset, keeping all other fields unchanged from the original entry. Sometimes, how-
ever, the translations would fail. In these cases, we manually translated the texts
using Google Translate.

5.1.2 Meta data inclusion
By prepending each text with the corresponding country and gender of the speaker,
i.e., “Germany, Female”, models got access to the available contextual information
not included in the speeches themselves. We hypothesise that since parliaments and
contexts of debates vary, so should their analyses. By giving the model access to all
available metadata, i.e., all available context, we suspect that models might be able
to better adapt their predictions. An example of adapting to such a context might be
adapting the prediction of someone advocating for a certain law depending on what
the law is currently in a given country. There might also be useful information in
the meta data itself, such as gender making a politician more likely of belonging to a
certain ideology in some countries. These examples are of course purely speculative
and therefore part of the reason why we chose to include all meta data instead of
selecting fields based only on our own speculations.

5.2 Combined training
We merged the datasets by combining them where the label column was split into
orientation and power labels. Despite being separate, both datasets contain shared
elements, allowing the extraction of some data in the power dataset data into the
orientation dataset, as explained in Chapter 4. We first verified that each speaker
in the orientation dataset consistently had the same label. Once a speaker was
confirmed as label y in the orientation dataset, all texts by that speaker in the
power dataset were also classified as label y for orientation. When the datasets
shared a text, the text from the orientation dataset was removed (since the text

20


5. Method

from the power dataset already had received the orientation label). This increased
the number of orientation entries by 72.2% which equates to 51.4% of the original
power dataset.

5.2.1 Located missing labels loss function
Since the combined dataset has many missing labels, we needed to create a custom
loss function. We calculated a filter tensor for each batch and label, marking rows
with true labels as one and rows without true labels as zero. Then, we used this
filter tensor as weights for the entries in the batch when computing the CE-loss. To
account for the number of incomplete labels, we divided each label-specific loss by
the sum of its filter tensor. By summing the loss for our two labels, we had created
a multi-label loss function which could account for Located Missing Labels (LML).

Let us consider this loss function for a single label, i.e., orientation. Formally, let C
be the set of classes (i.e., left and right) and pi the output predictions for all classes
c ∈ C for entry i in the batch. pc,i is then the predicted probability of class c for
entry i in the input batch such that pc,i ∈ [0, 1] and ∑

c pc,i = 1, ∀i. Also let y be
the true labels such that yi ∈ {−1} ∪ C (where −1 corresponds to a missing label)
represents the true label of entry i. Then our custom LML-Loss can be expressed
as Equation (5.1).

LMLLoss(p, y) = 1∑
i fi

∑
i

fiCELoss(pi, yi)

fi =
1, yi ∈ {C}

0, yi = −1

(5.1)

As each label-specific loss is divided by the sum of its filter tensor, the resulting loss
maintains a consistent size regardless of the number of samples with a true label
in the batch. Consequently, the final summed loss represents the combined loss for
each task, regardless of its prevalence in the batch.

5.3 Polarity label extension
If training to predict both the power and orientation labels at the same time yields
increased performance, then training to predict a third label, which shares useful
features with the first two, might yield further improvements. We chose polarity as
the label to add to our dataset, i.e., whether a text carries a positive, negative or
neutral sentiment. We chose polarity since it is an effective metric for identifying
trends in parliamentary speeches, as explained in Section 3.5.

To obtain polarity labels for our dataset, we used version 0.2 of the instruction fine-
tuned Mistral-7B, as available under mistralai/Mistral-7B-Instruct-v0.2 on Hugging
Face. We chose this model since it outperforms other open source LLMs of similar
or larger size, such as 7- and 13-billion parameter versions of Llama-2 [67]. We

21


5. Method

chose one example text for each polarity label and had GPT-3.5 explain why it
assigned that label to that text. With these examples, we constructed our in-context
learning prompt using 3-shot classification. The base textual prompt, which we then
formatted into the instruction chat format of the model, can be found in Appendix B.

We double-quantised the Mistral model to a 4-bit normal float with a 16-bit float
compute type to fit the model in memory and for faster inference. To fit the entire
base prompt along with the text to label, we defined a context length of 4,096.
To generate from the model, we used sampling beam search with 3 beams along
with forcing the output to contain “Positive”, “Negative”, or “Neutral”. Finally, we
assigned the polarity label as 0, 1 or 2 depending on whether the first word of the
output was “Negative”, “Positive” or “Neutral” respectively, assigning −1 otherwise.

To perform the polarity classification, we added three output nodes to our classifica-
tion head, corresponding to each polarity class. To account for holes in the polarity
data caused by failed generations, we also used our LML-loss to calculate the loss
from the polarity predictions. We only added half of the polarity LML-loss to the
base LML-loss in order to prioritise our two core tasks.

5.4 Models
We restricted the models to those we could effectively train. Therefore, we ex-
cluded more extensive and capable models, such as the 70-billion parameter version
of Llama-2 [19]. Furthermore, we were forced to limit hyperparameters, such as
batch size and learning rate, to less-than-ideal values to comply with our limited
computational resources. Our selection of models was also influenced by the notion
that encoder only models outperform modern large decoder models for similar tasks,
at a lower computational demand [40].

We compared different modern transformer-based models to find which models per-
formed the best for our task. Different models necessitated different hyperparameter
values due to their differing sizes and designs. For all models, any implemented clas-
sification heads took the place of the last layer of the model as provided by their
Hugging Face sequence classification implementation, leaving the method of pooling
as implemented in the base model. Finally all models discussed bellow can be found
on Hugging Face.

BERT and mBERT We evaluated the uncased version of BERT [16], available
under bert-base-uncased, using the provided English translations. We chose this
model since it is competent while being much smaller than other modern models
(see Section 3.1) and since it had previously shown outstanding performance in
similar competitions (see Section 3.6).

We also evaluated the uncased version of multilingual BERT, available under bert-
base-multilingual-uncased, on the speeches in their original languages. We chose this
model because it is a multilingual version of BERT and because it demonstrated
excellent results for multilingual tasks in similar competitions (see Section 3.6).

22


5. Method

RoBERTa and XLM-RoBERTa We evaluated the large version of RoBERTa,
available under FacebookAI/roberta-large, since it has been shown to be state-of-
the-art for similar tasks (see Section 3.1). We also evaluated XLM-RoBERTa,
available under FacebookAI/xlm-roberta-large, which is RoBERTa pre-trained for
multi-lingual tasks [68].

DeBERTa V3 DeBERTa V3 is an improvement upon the original DeBERTa
model [21]. The original DeBERTa model outperforms the large version of RoBERTa
using less training data on a wide range of NLP tasks [20]. The DeBERTa family of
models also utilize distangled attention and relative position embeddings which al-
low it to process longer sequences than BERT and RoBERTa. Due to these factors,
we chose to evaluate DeBERTa V3, as available under microsoft/deberta-v3-large.

Gemma The 7-billion parameter version of Gemma outperforms similar models
of equal size such as Mistral-7B and Llama-2-7B [69]. Limited by our computational
resources, we evaluated the smaller 2-billion parameter version of Gemma, as avail-
able under google/gemma-2b. This smaller version still necessitated using techniques
such as LoRA and double quantising the model to 4-bit normal float. We applied
LoRA to all matrices in the self attention- and mlp-layers of Gemma.

5.5 Hyperparameters
Due to our limited computational resources, unfortunately, no experiments could be
exhaustive. When we discovered that a certain hyperparameter value worked well,
for instance using a warm-up period, we could not then afford to repeat experiments
for all previous models to include this choice. Efforts where instead directed as to
for each model balance the necessity to fit to the data in a manageable time frame
with the desire not to cause excessive unlearning in the base model.

The main parameters which had to be adapted depending on the size of the models
was the learning rate and the warm-up ratio. For instance, if a large model was
trending downwards by the second epoch, then the learning rate might be lowered
and/or a warm-up period added, in this way experiments where exploratory.

We set specific hyperparameters to increase training speed and reduce memory con-
sumption to accommodate larger models. For instance, all models except BERT
and mBERT used 16-bit floating point mixed-precision training to accelerate the
training process. BERT and mBERT used regular full-precision training. Addition-
ally, Gemma required LoRA and 4-bit quantisation to make fine-tuning feasible for
us.

All models trained using a maximum sequence length of 512 tokens. Whilst we
would have preferred to train with longer sequence length for the models which
can handle it (DeBERTa-V3 and Gemma), this was not computationally feasible.
However, to still utilise the longer context lengths DeBERTa-V3 can handle, it was
re-evaluated using a maximum sequence length of 4096 tokens after training was

23


5. Method

complete. In this way we could train and evaluate the models more efficiently with
512 tokens and then afterwards leverage longer context lengths for only the final
version of the model. We would have preferred to also re-evaluate Gemma using
a longer context length, however, due to unforeseen limitations on computational
resources, only DeBERTa could be re-evaluated using a longer sequence length.

5.6 Ensemble modelling
After identifying the best performing models and training methods using our val-
idation set, we created new training and validation sets which we used to re-train
the selected models for ensemble modelling. These new validation sets contained
disjoint selections of 10% of the available data, with a minimum of 5 samples for
each country and label.

We chose this data split to increase the amount of data available to our models. By
using bagging (letting each model in the ensemble having a separate validation set),
each model could be monitored for over-fitting whilst the ensemble as a whole had
still trained on the entire dataset. We speculate that by using this approach, our
ensemble will be able to leverage the entire training set. This is since if a single
model has not been able to learn something useful due to the required data being
in its validation set, the other models of the ensemble will have had access to that
data.

We also chose to decrease the ratio of the validation set as we deemed the necessity of
it being representative and reliable to be diminished once we had already determined
and validated our method.

We created the ensemble by selecting the best performing multilingual and English-
only models. We then trained two instances of the best performing of the two models
as well as one instance of the other model using our newly created ensemble training
and validation sets. The main reason for using two of the best model was to make
sure that model had the most influence over the final prediction. We decided to
use one multilingual model to capture potentially new information not available in
the English translations. For a given prediction, we ran each of these models and
averaged their output logits before applying the sigmoid function and rounding to
receive a final prediction.

5.7 Additional data extraction for test set
Roughly 23% of the texts appearing in the provided test data for the orientation
task also appear in the training data for the power task. Using this information
we can extract the speaker id for the overlapping texts. Then, since the orientation
label is always the same for each speaker, we can use these additional speeches to
influence our prediction on the test set. For a given speech in the orientation test
set, we averaged the logits of the examined model on that speech and the logits
of our best performing model on all other speeches by the same speaker. In other

24


5. Method

Test data for orientation task:
id,      text, text_en, sex
se1, txt1, txt1_en, M

Training data for power task:
id, speaker, text, text_en, sex
se6, sp1, txt1, txt1_en, M
se5, sp1, txt4, txt4_en, M
se8, sp1, txt3, txt3_en, M

Prediction:

0.51

Speaker
Predictions:

0.51
0.001
0.01

Base
Output:

1

Average
Prediction:

0.17

Updated
Output:

0

Figure 5.2: Visualised process for extracting the speaker for texts in the test data
for the orientation task. The bottom flow chart then shows how we take the average
prediction for all texts by speaker sp1. Note that we illustrate the predictions as
ranging from 0 to 1 for simplicity whilst in actuality the logits were used.

words, predictions on the test data were averaged with those produced by our best
model on speeches by the same speaker. This way if the text in the test data lacks
clear ideological signals, we can instead rely on other texts by that speaker to make
our prediction. The process is visualised in Figure 5.2.

25


5. Method

26


6
Results

In the following chapter, we present the results of our method application to the
provided dataset and evaluated models.

We fine-tuned the models using the transformers library for Python as provided
by Hugging Face, running on a NVIDIA V100 with 32GB of memory. Also, if not
mentioned, we set the hyper-parameter values to their default values as provided by
the library.

Baseline The competition organisers provided a baseline of a simple linear logistic
regression model. When we fitted this model to our training set and then applied
this model to our validation set, we achieved the macro-average F1-scores as shown
in Table 6.1:

Sub-task Macro-average F1-score
Orientation 0.6755
Power 0.7149

Table 6.1: Baseline macro-average F1-scores on our validation set for Orientation
and Power tasks from provided linear logistic regression model.

6.1 Method components
To understand the effects of each component of our method, we fine-tuned BERT
multiple times with different combinations of the components in our base method.
These components, as presented in Chapter 5, were combined training, back-translation,
polarity label extension and meta data inclusion. All runs used the same hyper-
parameters, which can be found in Appendix C.1. Results are illustrated over the
training epochs in Figure 6.1 and summarised in Table 6.2.

The results as a whole show that the combined training beats both the conventionally
trained BERT and the baseline. The other components all individually improve
performance of the combined training further. Finally, all components together
yielded the best performance.

27


6. Results

(a) Orientation (b) Power

Figure 6.1: Comparison of base BERT and with different combinations of the com-
ponents of our method over the training epochs. CT stands for combined training,
BT for back-translation, PLE for polarity label extension and MDI for meta data
inclusion.

Model Orientation Power
baseline 0.6755 0.7149
BERT 0.7596 0.7865
+CT 0.8271 0.8041
+CT+BT 0.8333 0.8059
+CT+PLE 0.8317 0.8101
+CT+MDI 0.8377 0.8111
+(all) 0.8493 0.8152

Table 6.2: BERT component study showing how combinations of components
in our method impacted the macro-average F1-score for the two tasks compared to
conventional fine-tuning. CT stands for combined training; BT for back-translation;
PLE for polarity label extension, and; MDI for meta data inclusion.

28


6. Results

Combined training (CT) Training to predict both labels at once using our
LML-loss showed significant improvements in comparison to when only training for
a single label at a time.

Back-translation (BT) Our results indicate that back-translation yielded an
improvement in the orientation task over all epochs whilst only yielding an non-
marginal improvement in the first epochs of the power task when training for both
tasks using CT. To further investigate if back-translation helped improve the perfor-
mance of countries with less data, we also visualised the results for each parliament
individually. Results can be found in Appendix D.2 and show that, on average, par-
liaments with back-translation saw a significant improvement whilst the remaining
parliaments did not.

Polarity label extension (PLE) Extending the combined training by adding a
third label yielded an increase in performance over all epochs.

Meta data inclusion (MDI) Including the available meta data by prepending
it to each text resulted in a improvement by the second epoch and onwards for both
tasks. However, for the first epoch it caused a decrease in performance for the power
task whilst not impacting the orientation task.

Method components conclusions The examination of the components in our
method seems to indicate that all components of our method are beneficial for
BERT. This is especially clear due to the combination of two factors. The first
factor is that the conventionally trained model had seemingly started to stagnate
or over-fit whilst our proposed method was still improving through out all training
epochs. The second factor is that our method exceeds the conventional fine-tuning
already by the first epoch. In combination then, we may reason that our method
provides an intrinsic advantage since it both converges faster, by the second factor,
and does seems to cause less over-fitting or unlearning, by the first factor. These
factors may also be reasoned as to guaranteeing that we actually make better update
steps instead of just smaller (by the second factor) or larger (by the first factor).

In order to to validate that the improvement in performance on the orientation task
from our method is not only due to the increased amount of data, we conventionally
fine-tuned RoBERTa on the full set of available orientation data. We compare this
to RoBERTa fine-tuned using the same hyperparameters, as found in Appendix C.2,
but using our full method. Results are illustrated in Fig. 6.2 and show that even
when using the same training data, our method exceeds conventional fine-tuning
over all epochs for orientation classification. As discussed in Section 3.1, fine-tuned
RoBERTa has been shown to outperform very capable models and to be state-of-
the-art for similar tasks. It is therefore very encouraging to note that our method
managed to significantly improve upon the performance of fine-tuned RoBERTa for
political orientation classification.

29


6. Results

2 4 6 8 10
epoch

0.78

0.80

0.82

0.84

0.86

M
ac

ro
-a

ve
ra

ge
 F

1-
sc

or
e

RoBERTa Conventional Training vs. Our Method

Our method
Conventional

Figure 6.2: Comparison of RoBERTa trained for political orientation classification
using conventional fine-tuning and our method.

6.2 Models
The highest attained scores resulting from the application of our method on various
models are shown in Table 6.3. Corresponding hyperparameters can be found in
Appendix C.2. Results indicate that DeBERTa-V3 was the best performing model,
with XLM-RoBERTa being the best performing multilingual model. Gemma, which
trained using LoRA and quantisation, manages to exceed the performance of BERT
and mBERT but falls short of the other models.

Model Language* Orientation Power
BERT Translation 0.8493 0.8152
mBERT Original 0.8251 0.7941
RoBERTa Translation 0.8729 0.8440
XLM-RoBERTa Original 0.8621 0.8379
DeBERTa-V3 Translation 0.8870 0.8630
Gemma Translation 0.8541 0.8358
*Translation corresponds to training on automatic
translations to English instead of the original language.

Table 6.3: Highest attained macro-average F1-scores of our examined models.

To investigate the impact of re-evaluating DeBERTa-V3 using a longer sequence
length, we also evaluated different sequence lengths. The results are shown in Ta-
ble 6.4 and indicate that there was a significant improvement in performance from
increasing the sequence length initially but that these increases diminish. The im-
provement by going from 512 to 1024 tokens was noticeable (+0.0059 and +0.0084)
whilst the improvement by going from 2048 to 4096 tokens was minor (+0.0001 and
+0.0003). These findings are not surprising, after all, successive increases in maxi-
mum sequence length add fewer and fewer tokens since more speeches become fully
covered.

30


6. Results

Sequence length Orientation Power
512 0.8788 0.8526
1024 0.8847 0.8610
2048 0.8869 0.8627
4096 0.8870 0.8630

Table 6.4: Macro-average F1-scores of DeBERTa when evaluated using different
sequence lengths.

6.3 Translated vs. multilingual
To investigate whether models pre-trained for multilingual tasks outperform their
mainly English-comprehending base models, we compared BERT and mBERT as
well as RoBERTa and XLM-RoBERTA. Each pair used the same hyperparameters
internally (see Appendix C.2). The multilingual models processed the original texts
whilst their counterparts processed the automatic translations. Results indicate that
the multilingual models lag behind by a consistent amount. The macro-average F1-
scores over the training epochs can be found in Appendix D.1.

6.4 Ensemble modelling
In order to validate that ensemble modelling was a beneficial approach, we selected
DeBERTa-V3 and XLM-RoBERTa, fine-tuned on our base training set. The output
logits of these models on the base validation set where then averaged to create
validation predictions. The macro-average F1-scores of the yielded predictions, as
seen in Table 6.5, show that the predictions of our best performing stand-alone
model could be improved by also considering the outputs of our best performing
multilingual model.

Model Orientation Power
XLM-RoBERTa 0.8621 0.8379
DeBERTa-V3 0.8870 0.8630
Ensemble 0.8972 0.8697

Table 6.5: Macro-average F1-scores on our base validation set of base models and
the ensemble of them.

6.5 Test set results
Baseline The macro-average F1-scores attained by the baseline model on the com-
petition test set are shown in Table 6.6. This baseline model was fitted to the entirety
of the original provided datasets. When comparing this baseline to the baseline on
our validation set, we see a decrease in macro-average F1-score of 0.1152 and 0.0748

31


6. Results

for the orientation and power task respectively. This indicates that the test set is
much more challenging, which is not surprising due to the nature of its construc-
tion. For the orientation task the test set contains speakers that do not appear in
the training set, and for the power task it contains speakers which appear with a
different role than they do in the training data. The test set also does not share the
same distributions in parliament representation, for further details see Chapter 4.

Final ensemble model Our final ensemble consisted of two DeBERTa-V3 mod-
els and one XLM-RoBERTa model, fine-tuned using disjoint selected validation
sets, as detailed in Section 5.6. We have made these models available on Hug-
ging Face under oscpalML/DeBERTa-political-classification, oscpalML/DeBERTa-
political-classification-alternative and oscpalML/XLM-RoBERTa-political-classification.
Our final ensemble, averaging the logits of these models, yields macro-average F1-
scores as seen in Table 6.6.

Model Orientation Power
baseline 0.5603 0.6401
Ensemble 0.7945 0.8271

Table 6.6: Macro-average F1-scores on test set of our final ensemble and the
provided baseline model

Additional data extraction The additional data extraction improved the per-
formance on the orientation task. Our ensemble, without considering the other
available speeches by a speaker, yielded a macro-average F1-score for the orienta-
tion task of 0.7854 whilst when utilising the other speeches the score increased to
0.7945.

32


7
Discussion

In the following chapter, we discuss and reason about the effectiveness of the pro-
posed solution. We further discuss the task itself and the limitations of our project.

Method effectiveness We show that our method improves performance for BERT
and RoBERTa. Whether or not these findings translate to other models and tasks
is, of course, a prudent question. The limitations of this project, enforced upon us
by our limited computational resources, prevent us from examining this quandary
fully. However, due to the relatively similar architectures of our examined models
and the nature of our method, we hypothesise that the benefits of our method does
translate. This is because our method does not closely depend on the internals of
a model, instead aiming to provide a more representative loss function and better
input data. We leave more extensive empirical confirmation for future research.

Just as our method likely translates well to other models, it might transfer well
to other tasks. Our method is not reliant on the specifics of ideology and power
identification, and can therefore potentially be applied to other similar tasks. It is in
all likelihood important for other tasks to share useful features. Combined training
and further synthetic label extension relies on the existence of cross-task useful
features, and therefore combining very unrelated tasks might not yield the same
benefits. Our utilisation of back-translation can potentially see broader application
since it is essentially just a method of language-based data augmentation. Meta
data inclusion might be very task and problem dependent. One can imagine that
when classifying political tweets that including the year might be beneficial whilst
for other tasks meta data might in many cases be either unavailable or irrelevant.

Multi-task training The combined training yields an increase in performance for
the orientation task, which is not surprising since we extract additional orientation
labels and therefore provide the model with more data. On the other hand, the fact
that there is also a significant improvement for the power task is very intriguing.
We speculate that this improvement is due to our LML-loss providing a more repre-
sentative loss, which incentivises extracting features which are useful for both tasks
and discourages over-fitting. This seems to be supported by the additional increase
in performance provided by also adding the prediction of a polarity label. This
increase in performance is even more impressive when considering that the polarity
label was synthetically generated and likely to add at least some amount of noise.

33


7. Discussion

Since polarity likely shares some important similarities with features useful for our
tasks, as detailed in Section 3.5, we speculate that our LML-loss was improved as
to further incentivise cross-task useful features.

Data preprocessing The observed impact of prepending available meta data to
each speech, as shown in Fig. 6.1, is reasonable. We suspect that the prepended
sentence is very different from the pre-training material of the base model, since it
is simply two words and does not follow the form of a regular phrase or sentence.
This disruption, we speculate, might essentially confuse the model until it is able to
learn it in later epochs. Once the model has understood how to interact with the
prepended sentence however, it is able to leverage it into making better predictions.

It would be interesting for future research to compare the difference between adding
new tokens representing the meta data and prepending the meta data in English, as
we did. It might be the case that base models are able to leverage prior understand-
ing of countries, be it their general political environment or some other aspect. On
the other hand, it might also be the case that prior bias hurts the models ability to
predict accurately and fairly.

Nature of tasks On the validation set, it is interesting to note that the linear
baseline performed better on the power task than the orientation task given that the
models utilising our method show the opposite behaviour. We may also note that
whilst the difference is small, the power task benefited more from a longer sequence
length. It is therefore not entirely unreasonable to suggest that the power task might
rely more on specific words and phrases, as the linear baseline does. In other words,
it might be the case that specific words are more important for predicting political
power whilst how you speak in general is more indicative for predicting political
orientation.

Difference between validation and test scores Another notable aspect is the
discrepancy between the achieved scores on the validation and test sets. Without
using additional data extraction, our top-performing single model achieved a macro-
average F1-score on the validation set that was 10.1 percentage points higher than
that of our best ensemble model on the test set. In contrast, the gap for the power
task was just 3.6 percentage points. We attribute the majority of this gap to two
factors: the difference in the distribution of the amount of parliamentary data and
the nature of the test sets’ construction.

Since all parliaments have the same amount of data in the test sets whilst having
greatly differing amounts of data in the training and validation sets, it is not sur-
prising that the achieved scores differ. In the likely case that the models perform
better on parliaments with more data, then the test set represents an increased
presence of the harder-to-predict parliaments and a decreased presence of the easier
parliaments. Regardless of this factor, just the difference in distribution itself likely
also introduces a challenging condition. In other words, that there is a drift in label
and parliament distributions is likely detrimental, even if the nature of that drift

34


7. Discussion

was not suspected to be particularly damaging. This is because the model likely to
some extent relies on the statistical trends, such as favouring to predict the more
common label when a speech is ambiguous.

The nature of the test sets’ construction may also account for the difference in
discrepancy between our two tasks. The power test set largely contains the same
speakers as our training data, just with a different power label. Given that we saw
a relatively small decrease in performance, it seems that our models have been able
to avoid over-fitting to a specific speakers power label. Such a case was likely aided
by speakers exhibiting multiple power labels in the training data.

This behaviour is not shared with the orientation task. Since the orientation test
set largely consists of speakers who do not appear in the training data, the gap in
performance may indicate that our models have over-fit to specific speakers. An
additional factor is that the new speakers may cover topics our models have previ-
ously not encountered. If a speaker is mainly exhibiting a political idea which the
model has not previously learnt the ideological connotations of, then the classifi-
cation likely becomes much more challenging. It would therefore be interesting to
investigate whether these previously not encountered speakers are contemporary to
and covering the same topics as the speakers in the training set.

Weak ideological signals Some parliamentary speeches might not indicate strong
political beliefs. They could solely cover practical proceedings, not expressing any
opinions or making any arguments. These types of speeches are likely more challeng-
ing to classify, especially if the provided dataset does not exhibit clear rhetorical or
linguistic differences for labels within a given class. This likely introduces an upper
limit on the performance of any model performing this task with similar data.

Limitations As previously discussed, our limited access to computational re-
sources determined what methods and models we could examine. This prevented
us from examining large models such as the 70-billion parameter version Llama-3
or even the the 7-billion parameter version of Gemma. Not only were we limited
in the selection of models, but also in the number of experiments which we could
perform. More time and computational resources would have allowed us to attempt
more techniques and further search for optimal hyperparameters. Techniques which
could not be examined include using different learning rates for different layers and
balancing the loss-function.

We were also limited by the data which we had access to. In real life, these speeches
are not stand-alone but most often parts of exchanges and debates. The problem of
weak ideological signals could likely be mitigated by considering all the speeches a
speaker makes in an exchange together. By representing speeches as parts of a larger
debate, not only could a model base its prediction on all of the speakers speeches,
but also on the speeches made by the other participants in the debate.

35


7. Discussion

36


8
Ethics and Risk Analysis

In creating tools for political identification, there is potential for misuse. Methods for
automatically identifying the ideologies of individuals based on their speech could
become a tool for political targeting in the wrong hands. However, actors which
have the resources and means to spy on people to such an extent as to require
tools like these most likely already possess them. By using open-source models and
being transparent about the capabilities of modern machine learning, we aim to
raise awareness of what is possible with modern techniques.

The names of the speakers in the dataset are encoded which means that it is chal-
lenging to find information about a specific individual from the dataset. However,
parliamentary debates are usually public information which means that our dataset
is unlikely to leak any sensitive information anyway. Furthermore, the original
dataset developed by ParlaMint [64] is a collaboration with the governments and is
intended for projects such as this one.

Even though transparency and explainability may be limited in our model, it is
important to note that this limitation is acceptable for our specific context. The
model is intended for research purposes and as a submission to a competition and
not directly for real-world application. In this competition-focused scenario, the
reduced transparency and explainability are deemed acceptable and do not pose a
significant issue.

As previously stated though, the capabilities of the presented models and method
could also have real world use and provide useful information to citizens. In this
case transparency and explainability are essential as providing a false or misleading
sentiment might instead spread more confusion than before. This could then lead to
citizens making conclusions based on faulty premises. We try to mitigate this risk
by clearly outlining the data which our models trained on as well as the methods
which where used to train them.

37


8. Ethics and Risk Analysis

38


9
Conclusion

In this study, we proposed a method for improved fine-tuning of LLMs for ideology
and power identification. Our research questions where as stated below.

RQ1 Investigate what the best methods and practices are for identifying the polit-
ical orientation in a parliamentary speech.

RQ2 Investigate what the best methods and practices are for identifying whether
a parliamentary speech is made by a speaker in opposition or in power.

In answering our research questions, we found that modern LLMs are an effective
approach for identifying both ideology and power in parliamentary debates. We
further found that ideology and power likely share useful features and therefore fine-
tuning to predict them jointly yields improved performance for both tasks. This
improvement also extends to fine-tuning to predict synthetic labels, in our case
polarity.

We also note that performance can be improved by making the context of the speech,
as available by meta data, available to the models. Furthermore, back-translation
can be utilised to boost performance of countries with a smaller presence in a given
dataset. Finally, we found that English models predicting on automatic translations
tend to outperform multilingual models predicting on the original languages but that
an ensemble of both types of models is the best approach.

Our approach obtained the number one ranking for the task of Ideology and Power
Identification in Parliamentary Debates as part of the Touché lab at CLEF 2024 [1].

39


9. Conclusion

40


Bibliography

[1] J. Kiesel, Ç. Ç. ltekin, M. Heinrich, M. Fröbe, M. Alshomary,
B. nd De Longueville, T. Erjavec, N. Handke, M. K. pp, N. Ljubešić, K. Meden,
N. Mirzakhmedova, V. as Morkevičius, T. Reitis-Münstermann, M. Scharfbil-
lig, N. Stefanovitch, H. Wachsmuth, M. Potthast, and B. Stein, “Overview of
Touché 2024: Argumentation Systems,” in Experimental IR Meets Multilin-
guality, Multimodality, and Interaction. Proceedings of the Fifteenth Interna-
tional Conference of the CLEF Association (CLEF 2024), ser. Lecture Notes in
Computer Science, L. Goeuriot, P. Mulhem, G. Quénot, D. Schwab, L. Soulier,
G. M. D. Nunzio, P. Galuščáková, A. G. S. de Herrera, G. Faggioli, and N. Ferro,
Eds. Berlin Heidelberg New York: Springer, Sep.

[2] U. N. Group, “Exist: sexism identification in social networks,” 2021, accessed
on January 28, 2024. [Online]. Available: http://nlp.uned.es/exist2021/

[3] S. Coleman, “Meaningful political debate in the age of the soundbite,” in Tele-
vised election debates: International perspectives. Springer, 2000, pp. 1–24.

[4] H. Wasmuth and E. Nitecki, “(un) intended consequences in current ecec poli-
cies: Revealing and examining hidden agendas,” Policy futures in education,
vol. 18, no. 6, pp. 686–699, 2020.

[5] M. J. Hinich and M. C. Munger, Analytical politics. Cambridge University
Press, 1997.

[6] Pew Research Center, “Americans’ dismal views of the na-
tion’s politics,” https://www.pewresearch.org/politics/2023/09/19/
americans-dismal-views-of-the-nations-politics/, 2023, accessed on November
29, 2023.

[7] M. K. David, “Language, power and manipulation: The use of rhetoric in
maintaining political influence,” Frontiers of Language and Teaching, vol. 5,
no. 1, pp. 164–170, 2014.

[8] Ö. Sahin and Ö. Sahin, “A gentle introduction to ML and NLP,” Develop
Intelligent iOS Apps with Swift: Understand Texts, Classify Sentiments, and
Autodetect Answers in Text Using NLP, pp. 1–15, 2021.

41

http://nlp.uned.es/exist2021/
https://www.pewresearch.org/politics/2023/09/19/americans-dismal-views-of-the-nations-politics/
https://www.pewresearch.org/politics/2023/09/19/americans-dismal-views-of-the-nations-politics/


Bibliography

[9] W. Medhat, A. Hassan, and H. Korashy, “Sentiment analysis algorithms and
applications: A survey,” Ain Shams engineering journal, vol. 5, no. 4, pp. 1093–
1113, 2014.

[10] P. Törnberg, “Chatgpt-4 outperforms experts and crowd workers in anno-
tating political twitter messages with zero-shot learning,” arXiv preprint
arXiv:2304.06588, 2023.

[11] Université Grenoble Alpes, “CLEF 2024 - conference and labs of the evaluation
forum,” https://clef2024.imag.fr/, 2024, accessed on January 17, 2024.

[12] Webis Group, “Touche,” https://touche.webis.de/, accessed on January 17,
2024.

[13] ——, “Ideology and Power Identification in Parliamentary
Debates 2024,” https://touche.webis.de/clef24/touche24-web/
ideology-and-power-identification-in-parliamentary-debates.html, accessed
on January 17, 2024.

[14] L. Derczynski, “Complementarity, F-score, and NLP Evaluation,” in Proceed-
ings of the Tenth International Conference on Language Resources and Evalu-
ation (LREC’16), 2016, pp. 261–266.

[15] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,
Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural
information processing systems, vol. 30, 2017.

[16] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of
deep bidirectional transformers for language understanding,” arXiv preprint
arXiv:1810.04805, 2018.

[17] T. Pires, E. Schlinger, and D. Garrette, “How multilingual is multilingual bert?”
arXiv preprint arXiv:1906.01502, 2019.

[18] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis,
L. Zettlemoyer, and V. Stoyanov, “Roberta: A robustly optimized bert pre-
training approach,” arXiv preprint arXiv:1907.11692, 2019.

[19] H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bash-
lykov, S. Batra, P. Bhargava, S. Bhosale et al., “Llama 2: Open foundation and
fine-tuned chat models,” arXiv preprint arXiv:2307.09288, 2023.

[20] P. He, X. Liu, J. Gao, and W. Chen, “Deberta: Decoding-enhanced bert with
disentangled attention,” arXiv preprint arXiv:2006.03654, 2020.

[21] P. He, J. Gao, and W. Chen, “Debertav3: Improving deberta using electra-style
pre-training with gradient-disentangled embedding sharing,” arXiv preprint
arXiv:2111.09543, 2021.

42

https://clef2024.imag.fr/
https://touche.webis.de/
https://touche.webis.de/clef24/touche24-web/ideology-and-power-identification-in-parliamentary-debates.html
https://touche.webis.de/clef24/touche24-web/ideology-and-power-identification-in-parliamentary-debates.html


Bibliography

[22] J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman,
D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat et al., “Gpt-4 technical
report,” arXiv preprint arXiv:2303.08774, 2023.

[23] S. Behdenna, F. Barigou, and G. Belalem, “Document level sentiment analysis:
a survey,” EAI endorsed transactions on context-aware systems and applica-
tions, vol. 4, no. 13, pp. e2–e2, 2018.

[24] A. Meena and T. V. Prabhakar, “Sentence level sentiment analysis in the pres-
ence of conjuncts using linguistic analysis,” in Advances in Information Re-
trieval: 29th European Conference on IR Research, ECIR 2007, Rome, Italy,
April 2-5, 2007. Proceedings 29. Springer, 2007, pp. 573–580.

[25] K. Schouten and F. Frasincar, “Survey on aspect-level sentiment analysis,”
IEEE Transactions on Knowledge and Data Engineering, vol. 28, no. 3, pp.
813–830, 2015.

[26] T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer, “Qlora: Efficient
finetuning of quantized llms,” arXiv preprint arXiv:2305.14314, 2023.

[27] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and
W. Chen, “Lora: Low-rank adaptation of large language models,” arXiv
preprint arXiv:2106.09685, 2021.

[28] Hugging Face, “Quantization,” https://huggingface.co/docs/optimum/
concept_guides/quantization, accessed on January 19, 2024.

[29] D. Kalamkar, D. Mudigere, N. Mellempudi, D. Das, K. Banerjee, S. Avancha,
D. T. Vooturi, N. Jammalamadaka, J. Huang, H. Yuen et al., “A study of
bfloat16 for deep learning training,” arXiv preprint arXiv:1905.12322, 2019.

[30] M. A. Ganaie, M. Hu, A. K. Malik, M. Tanveer, and P. N. Suganthan, “Ensem-
ble deep learning: A review,” Engineering Applications of Artificial Intelligence,
vol. 115, p. 105151, 2022.

[31] A. F. M. de Paula, G. Rizzi, E. Fersini, and D. Spina, “Ai-upv at exist 2023–
sexism characterization using large language models under the learning with
disagreements regime,” arXiv preprint arXiv:2307.03385, 2023.

[32] O. Rubin, J. Herzig, and J. Berant, “Learning to retrieve prompts for in-context
learning,” arXiv preprint arXiv:2112.08633, 2021.

[33] C. Olsson, N. Elhage, N. Nanda, N. Joseph, N. DasSarma, T. Henighan,
B. Mann, A. Askell, Y. Bai, A. Chen et al., “In-context learning and induction
heads,” arXiv preprint arXiv:2209.11895, 2022.

[34] M. Bosley, M. Jacobs-Harukawa, H. Licht, and A. Hoyle, “Do we still need bert
in the age of gpt? comparing the benefits of domain-adaptation and in-context-
learning approaches to using llms for political science research,” 2023.

43

https://huggingface.co/docs/optimum/concept_guides/quantization
https://huggingface.co/docs/optimum/concept_guides/quantization


Bibliography

[35] H. Yu, Z. Yang, K. Pelrine, J. F. Godbout, and R. Rabbany, “Open,
closed, or small language models for text classification?” arXiv preprint
arXiv:2308.10092, 2023.

[36] Y. Chae and T. Davidson, “Large language models for text classification: From
zero-shot learning to fine-tuning,” Open Science Foundation, 2023.

[37] W. Zhang, Y. Deng, B. Liu, S. J. Pan, and L. Bing, “Sentiment analysis
in the era of large language models: A reality check. arxiv,” arXiv preprint
arXiv:2305.15005, 2023.

[38] M. Abdullah, O. Altiti, and R. Obiedat, “Detecting propaganda techniques in
english news articles using pre-trained transformers,” in 2022 13th International
Conference on Information and Communication Systems (ICICS). IEEE, 2022,
pp. 301–308.

[39] K. Sprenkamp, D. G. Jones, and L. Zavolokina, “Large language models for
propaganda detection,” arXiv preprint arXiv:2310.06422, 2023.

[40] A. Benayas, M. A. Sicilia, and M. Mora-Cantallops, “A comparative analysis
of encoder only and decoder only models in intent classification and sentiment
analysis: Navigating the trade-offs in model size and performance,” 2024.

[41] C. Sung, T. Dhamecha, S. Saha, T. Ma, V. Reddy, and R. Arora, “Pre-
training bert on domain resources for short answer grading,” in Proceedings
of the 2019 Conference on Empirical Methods in Natural Language Processing
and the 9th International Joint Conference on Natural Language Processing
(EMNLP-IJCNLP), 2019, pp. 6071–6075.

[42] Y. Gu, R. Tinn, H. Cheng, M. Lucas, N. Usuyama, X. Liu, T. Naumann, J. Gao,
and H. Poon, “Domain-specific language model pretraining for biomedical nat-
ural language processing,” ACM Transactions on Computing for Healthcare
(HEALTH), vol. 3, no. 1, pp. 1–23, 2021.

[43] Smartling, “What is back translation and why is it impor-
tant?” 2023. [Online]. Available: https://www.smartling.com/resources/
101/what-is-back-translation-and-why-is-it-important/

[44] S. Edunov, M. Ott, M. Ranzato, and M. Auli, “On the evaluation of
machine translation systems trained with back-translation,” arXiv preprint
arXiv:1908.05204, 2019.

[45] D. R. Beddiar, M. S. Jahan, and M. Oussalah, “Data expansion using back
translation and paraphrasing for hate speech detection,” Online Social Networks
and Media, vol. 24, p. 100153, 2021.

[46] S. Shleifer, “Low resource text classification with ulmfit and backtranslation,”
arXiv preprint arXiv:1903.09244, 2019.

44

https://www.smartling.com/resources/101/what-is-back-translation-and-why-is-it-important/
https://www.smartling.com/resources/101/what-is-back-translation-and-why-is-it-important/


Bibliography

[47] Y. Yu, Z. Zhou, X. Zheng, J. Gou, W. Ou, and F. Yuan, “Enhancing label cor-
relations in multi-label classification through global-local label specific feature
learning to fill missing labels,” Computers and Electrical Engineering, vol. 113,
p. 109037, 2024.

[48] Y. Zhang, Y. Cheng, X. Huang, F. Wen, R. Feng, Y. Li, and Y. Guo, “Sim-
ple and robust loss design for multi-label learning with missing labels,” arXiv
preprint arXiv:2112.07368, 2021.

[49] X. Zhang, R. Abdelfattah, Y. Song, and X. Wang, “An effective approach for
multi-label classification with missing labels,” in 2022 IEEE 24th Int Conf on
High Performance Computing & Communications; 8th Int Conf on Data Sci-
ence & Systems; 20th Int Conf on Smart City; 8th Int Conf on Dependability
in Sensor, Cloud & Big Data Systems & Application (HPCC/DSS/SmartCi-
ty/DependSys). IEEE, 2022, pp. 1713–1720.

[50] Z. Ma and S. Chen, “Expand globally, shrink locally: Discriminant multi-label
learning with missing labels,” Pattern Recognition, vol. 111, p. 107675, 2021.

[51] B. Wu, Z. Liu, S. Wang, B.-G. Hu, and Q. Ji, “Multi-label learning with missing
labels,” in 2014 22nd International conference on pattern recognition. IEEE,
2014, pp. 1964–1968.

[52] Q. Jia, J. Cui, Y. Xiao, C. Liu, P. Rashid, and E. F. Gehringer, “All-in-one:
Multi-task learning bert models for evaluating peer assessments,” arXiv preprint
arXiv:2110.03895, 2021.

[53] S. O’Hagan and A. Schein, “Measurement in the age of llms: An application to
ideological scaling,” arXiv preprint arXiv:2312.09203, 2023.

[54] S. Santurkar, E. Durmus, F. Ladhak, C. Lee, P. Liang, and T. Hashimoto,
“Whose opinions do language models reflect?” arXiv preprint
arXiv:2303.17548, 2023.

[55] F. Motoki, V. Pinho Neto, and V. Rodrigues, “More human than human: Mea-
suring chatgpt political bias,” Available at SSRN 4372349, 2023.

[56] J. L. Martin, “The ethico-political universe of chatgpt,” Journal of Social Com-
puting, vol. 4, no. 1, pp. 1–11, 2023.

[57] M. Thomas, B. Pang, and L. Lee, “Get out the vote: Determining sup-
port or opposition from congressional floor-debate transcripts,” arXiv preprint
cs/0607062, 2006.

[58] R. Malouf and T. Mullen, “Graph-based user classification for informal online
political discourse,” in Proceedings of the 1st Workshop on Information Credi-
bility on the Web, 2007.

45


Bibliography

[59] M. Laver, K. Benoit, and J. Garry, “Extracting policy positions from political
texts using words as data,” American political science review, vol. 97, no. 2, pp.
311–331, 2003.

[60] D. Diermeier, J.-F. Godbout, B. Yu, and S. Kaufmann, “Language and ideology
in congress,” British Journal of Political Science, vol. 42, no. 1, pp. 31–55, 2012.

[61] Y. Riabinin, “Computational identification of ideology in text : A study of
canadian parliamentary debates,” MSc paper, Department of Computer Sci-
ence, University of Toronto, 2009.

[62] J. Wäckerle, “Data set description for chapter 9: The parliamentary speech
dataset (parlspeech) dataset,” 2020.

[63] Çöltekin, Ç., M. Kopp, V. Morkevičius, N. Ljubešić, K. Meden, and T. Er-
javec, “Training data for the shared task ideology and power identification in
parliamentary debates,” https://doi.org/10.5281/zenodo.10450641, 2024.

[64] CLARIN ERIC, “Parlamint: Harmonised parliamentary corpora,” 2021,
accessed on November 24, 2023. [Online]. Available: https://www.clarin.eu/
parlamint

[65] K. K. Dobbin and R. M. Simon, “Optimally splitting cases for training and
testing high dimensional classifiers,” BMC medical genomics, vol. 4, no. 1, pp.
1–8, 2011.

[66] Q. H. Nguyen, H.-B. Ly, L. S. Ho, N. Al-Ansari, H. V. Le, V. Q. Tran,
I. Prakash, and B. T. Pham, “Influence of data splitting on performance of
machine learning models in prediction of shear strength of soil,” Mathematical
Problems in Engineering, vol. 2021, pp. 1–15, 2021.

[67] A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. d. l.
Casas, F. Bressand, G. Lengyel, G. Lample, L. Saulnier et al., “Mistral 7b,”
arXiv preprint arXiv:2310.06825, 2023.

[68] A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán,
E. Grave, M. Ott, L. Zettlemoyer, and V. Stoyanov, “Unsupervised cross-lingual
representation learning at scale,” arXiv preprint arXiv:1911.02116, 2019.

[69] G. Team, T. Mesnard, C. Hardin, R. Dadashi, S. Bhupatiraju, S. Pathak,
L. Sifre, M. Rivière, M. S. Kale, J. Love et al., “Gemma: Open models based
on gemini research and technology,” arXiv preprint arXiv:2403.08295, 2024.

46

https://doi.org/10.5281/zenodo.10450641
https://www.clarin.eu/parlamint
https://www.clarin.eu/parlamint


A
List of countries in the dataset

• Austria (at)
• Bosnia and Herzegovina (ba)
• Belgium (be)
• Czechia (cz)
• Denmark (dk)
• Estonia (ee) [only political orientation]
• Spain (es)
• Catalonia (es-ct)
• Galicia (es-ga)
• Basque Country (es-pv) [only power]
• Finland (fi)
• France (fr)
• Great Britain (gb)
• Greece (gr)
• Croatia (hr)
• Hungary (hu)
• Iceland (is) [only political orientation]
• Italy (it)
• Latvia (lv)
• The Netherlands (nl)
• Norway (no) [only political orientation]
• Poland (pl)
• Portugal (pt)
• Serbia (rs)
• Sweden (se) [only political orientation]
• Slovenia (si)
• Turkey (tr)
• Ukraine (ua)

I


A. List of countries in the dataset

II


B
Polarity base prompt

Label the polarity of the following text, similarly to the provided examples. Your
answer needs to start with “positive”, “negative” or “neutral”, followed by a short
justification for your answer. It is important that you only assign a positive or
negative label if you are sure of your answer. Here is your first example.

Text: The south-west was cut off from the UK last winter and Network Rail per-
formed miracles in getting that line back up and running. I therefore find it ex-
traordinary that reasons such as the weather have been used to excuse the chaos
and incompetence of this debacle, particularly out of King’s Cross. Why did the
Secretary of State feel that it was not necessary for Ministers to ask for a basic
reassurance that an overrun on any of the big programmes could be managed? Why
were contingency plans not in place, and why was the rail regulator warning not
adhered to?

Negative. The text expresses frustration and criticism towards the handling of in-
frastructure issues, particularly the failure to address problems with the rail system
despite previous incidents. It highlights perceived incompetence and lack of plan-
ning, suggesting a negative sentiment towards the situation.

Here is your second example.

Text: We are committed to ensuring that claimants receive high-quality, objec-
tive, fair and accurate assessments. The Department monitors assessment quality
through independent audit. Assessments deemed unacceptable are returned to the
provider for reworking. A range of measures, including provider improvement plans,
address performance falling below expected standards. <p> I do agree with the hon.
Lady, which is why we have been trying to work more strategically with Motability,
thrashing through the issues I am very aware of on appeals and on matters such as
when an individual leaves the country. We are looking to reduce the amount of time
that appeals take and at what we can do with the running of the scheme so that
the precise scenario she outlines does not happen.

Neutral. The text describes the commitment to ensuring quality assessments for
claimants and outlines measures taken to monitor and address assessment quality.
Additionally, it mentions efforts to work with Motability to improve processes and
reduce appeal times. The tone is informative and focused on addressing issues,

III


B. Polarity base prompt

without expressing overt positivity or negativity.

Here is your third example.

Text: I congratulate the hon. Gentleman on bringing this much needed debate to
the Floor of the House. Will he join me in paying tribute to local MND associations
across the United Kingdom for the invaluable support they provide? I know of the
excellent work of my local Leicestershire and Rutland association, having heard at
first hand from a constituent and friend of mine, Ruth Morrison, about her tragic
personal experience. The support that is availabl