Tackling Missing Values in Mass Spectrometry-based Proteomics Data
Publicerad
Författare
Typ
Examensarbete för masterexamen
Modellbyggare
Tidskriftstitel
ISSN
Volymtitel
Utgivare
Sammanfattning
In the development of therapeutics, analysis of differentially abundant proteins (DAPs)
using mass spectrometry (MS) is essential. However, MS-based data suffers from high
rates of missing values that severely complicate downstream analyses. Various imputation
methods have been proposed to deal with the missing data, but there is no standard protocol
for selecting a method. Here we have comprehensively evaluated common methods, to
develop a best practice for imputation to inform downstream statistical analyses of MS
proteomics data. We compared the performance of five imputation methods in their
application to values missing completely at random and missing not at random introduced
into data from the Cancer Cell Line Encyclopedia, and data simulated from a multivariate
mixed-effects model respectively. Performance was measured in true positive rate (TPR)
and false positive (FPR) of detected DAPs (%adj 0 05, est. log2 fold-change ¡1, and an
accuracy metric [&] 103). The FPR was below 5% for all methods under all conditions
tested. If less than 10% of the data was missing, imputation did not increase the TPR
compared to removing missing values. For 30% missingness irrespective of data or
missingness type, the TPR was below 80%; and for 50% missingness the TPR was 25-
75% depending on imputation method. Since the FPR was controlled, no artefacts were
introduced by any methods under any circumstances. For large proportions of missingness
(50%), we recommend imputation with Principal Component Analysis imputation if the
sample size is large (= ¡ 50). With small sample sizes (= = 10) or small proportions of
missingness (10%), imputation is advised against.
Beskrivning
Ämne/nyckelord
imputation, missing data, mass spectrometry, multivariate mixed-effects models, differential abundance, proteomics