Co-clustering of Tensor Data Using Sparse Tensor Factorisation

Typ
Examensarbete för masterexamen
Program
Engineering mathematics and computational science (MPENM), MSc
Publicerad
2020
Författare
Tabakovic, Selma
Modellbyggare
Tidskriftstitel
ISSN
Volymtitel
Utgivare
Sammanfattning
With the ever increasing amounts of data generated from new sources and scientific methods, e.g. high throughput genome sequencing methods in bioinformatics, powerful tools for exploratory data analysis are required. One such tool is clustering, i.e. grouping together coherent observations in data, which is important for categorising vast amounts of observations into a more manageable format for further analysis. However, this task is subject to new challenges as tensor data, i.e. multidimensional data, has become a frequent occurrence in many applications. For tensor data, a clustering approach called co-clustering in particular has recently attracted research attention. Co-clustering means that the clustering is performed on all of the tensor dimensions simultaneously, which enables the detection of joint data expressions that only occur under special circumstances. In this thesis, two methods for co-clustering of tensor data using sparse CP decompositions are proposed. The motivation behind using a tensor factorisation with enforced sparsity is that it can enable the extraction of the most relevant data from the tensor, whilst reducing noise. The first method, called the sCP-S, considers the sign pattern in the vectors, obtained from a sparse CP decomposition, to determine the clustering. The second method instead uses hierarchical clustering on the sparse CP decomposition vectors, and is named sCP-HC. The two methods were compared on simulated data and the more flexible sCP-HC was tested thoroughly on more advanced simulated data sets. The types of predefined co-clusters that can be detected, and the stability of co-cluster detection under perturbations of the input data, were both investigated prior to applying the sCP-HC on real data. These evaluations have been performed through computer simulations on simulated data sets, along with application on a real genomic tensor data set. The obtained results from the simulations show that the sCP-HC has the potential to detect several types of additive coherent co-clusters. Additionally, the stability simulations show that the sCP-HC is quite consistent in its co-clustering, even in the presence of considerable noise. Applying the sCP-HC to real genomic data, several interesting co-clusters were obtained, which can be used for further analysis. As such, this work concludes that the sCP-HC is a useful tool for detecting coherent co-clusters in tensor data, and for exploratory data analysis.
Beskrivning
Ämne/nyckelord
Co-clustering, Tensor decomposition, CANDECOMP/PARAFAC decomposition, Sparsity, Agglomerative hierarchical clustering
Citation
Arkitekt (konstruktör)
Geografisk plats
Byggnad (typ)
Byggår
Modelltyp
Skala
Teknik / material
Index