Causal Models Applied to Studies within the Mining Software Repository Domain
Ladda ner
Typ
Examensarbete för masterexamen
Master's Thesis
Master's Thesis
Program
Software engineering and technology (MPSOF), MSc
Publicerad
2024
Författare
LEVINSSON, AMANDA
FRANSSON, LINNÉA
Modellbyggare
Tidskriftstitel
ISSN
Volymtitel
Utgivare
Sammanfattning
Context: Research conducted in the mining software repository domain commonly
utilize observational data, due to software repositories serving as a rich source of such
data. Simultaneously, there is a clear lack regarding the incorporation of causality in
Software Engineering (SE) research, whilst statistical analyses often are conducted.
Objective: To analyse the practical implications of applying causal models to studies
from the Mining Software Repository (MSR) conference. Specifically, it is of
interest to examine whether researchers accidentally have included variables (colliders)
in their analyses which have biased their results.
Method: A computer simulation was utilized as research methodology. This included
the steps of (1) identifying a paper with colliders by sampling from the MSR
conference and constructing Directed Acylic Graphs (DAGs), (2) a theoretical computer
simulation of an SE scenario to prove collider effects, (3) computer simulations
utilizing generated synthetic data based on the identified research paper. In addition,
an analysis was conducted using the original data from chosen paper.
Results: A lack of transparency amongst the research investigated was identified,
where variable selection processes and underlying assumptions were not completely
clear. Three papers were investigated in the first step of constructing DAGs. Subsequently,
colliders were identified in the paper of Nagy and Abdalkareem [46]. Simulations
revealed that the exclusion of collider variables improved the sought after
effect sizes. However, no practical implications were possible to determine. Replication
package available 1.
Conclusion: A lack of transparency hindered the construction of DAGs, and indicated
a threat to advancements in research. This, due to the need of interpreting
authors’ assumptions in their research. An incorporation of causality and DAGs
could, due to the increased transparency it would bring, in the long run result in
more robust advancements in research. Additionally, DAGs are recommended as
tools to mitigate the risk of accidentally conditioning on colliders.
Beskrivning
Ämne/nyckelord
Empirical Software Engineering , Colliders , Directed Acyclic Graphs , DAGs , Mining Software Repository Research , Causal Inference , Bayesian Statistics