Co-clustering of Tensor Data Using Sparse Tensor Factorisation
Publicerad
Författare
Typ
Examensarbete för masterexamen
Modellbyggare
Tidskriftstitel
ISSN
Volymtitel
Utgivare
Sammanfattning
With the ever increasing amounts of data generated from new sources and scientific
methods, e.g. high throughput genome sequencing methods in bioinformatics, powerful
tools for exploratory data analysis are required. One such tool is clustering, i.e.
grouping together coherent observations in data, which is important for categorising
vast amounts of observations into a more manageable format for further analysis.
However, this task is subject to new challenges as tensor data, i.e. multidimensional
data, has become a frequent occurrence in many applications. For tensor data, a
clustering approach called co-clustering in particular has recently attracted research
attention. Co-clustering means that the clustering is performed on all of the tensor
dimensions simultaneously, which enables the detection of joint data expressions
that only occur under special circumstances.
In this thesis, two methods for co-clustering of tensor data using sparse CP decompositions
are proposed. The motivation behind using a tensor factorisation with
enforced sparsity is that it can enable the extraction of the most relevant data from
the tensor, whilst reducing noise. The first method, called the sCP-S, considers the
sign pattern in the vectors, obtained from a sparse CP decomposition, to determine
the clustering. The second method instead uses hierarchical clustering on the sparse
CP decomposition vectors, and is named sCP-HC. The two methods were compared
on simulated data and the more flexible sCP-HC was tested thoroughly on more
advanced simulated data sets. The types of predefined co-clusters that can be detected,
and the stability of co-cluster detection under perturbations of the input
data, were both investigated prior to applying the sCP-HC on real data. These
evaluations have been performed through computer simulations on simulated data
sets, along with application on a real genomic tensor data set.
The obtained results from the simulations show that the sCP-HC has the potential
to detect several types of additive coherent co-clusters. Additionally, the stability
simulations show that the sCP-HC is quite consistent in its co-clustering, even in the
presence of considerable noise. Applying the sCP-HC to real genomic data, several
interesting co-clusters were obtained, which can be used for further analysis. As
such, this work concludes that the sCP-HC is a useful tool for detecting coherent
co-clusters in tensor data, and for exploratory data analysis.
Beskrivning
Ämne/nyckelord
Co-clustering, Tensor decomposition, CANDECOMP/PARAFAC decomposition, Sparsity, Agglomerative hierarchical clustering