Clustering software projects at large-scale using time-series

Examensarbete för masterexamen

Please use this identifier to cite or link to this item:
Download file(s):
File Description SizeFormat 
255670.pdfFulltext1.57 MBAdobe PDFView/Open
Type: Examensarbete för masterexamen
Master Thesis
Title: Clustering software projects at large-scale using time-series
Authors: Jungen, Heiko Joshua
Pickerill, Peter
Abstract: Context Within the field of Mining Software Repositories, there are numerous methods employed to filter datasets in order to avoid analysing low-quality projects, such as backups and student projects. Since the rise of GitHub, the world’s largest repository hosting site, large scale analysis has become more common. However, filtering methods have not kept up with this growth and researchers often rely on “quick and dirty” techniques to curate datasets. Objective The objective of this thesis is to develop a method that clusters large quantities of software projects in a limited time frame. This explores the possibility that a fully-automated method can be used to identify high-quality repositories at large scale and in the absence of an established ground-truth. At the same time, the hardware requirements and time limitations of existing approaches should be reduced to remove the barrier for researchers. Method This thesis follows the design science methodology. The proposed method, PHANTOM, extracts five measures from Git logs. Each measure is transformed into a time-series, which is represented as a feature vector for clustering using a k-means algorithm. Results Using the ground-truth from a previous study, PHANTOM was shown to be competitive compared to supervised approaches while reducing the hardware requirements by two orders of magnitude. The ground-truth was rediscovered by several k-means models, with some models achieving up to 87% precision or 94% recall. The highest Mathews correlation coefficient (MCC) was 0.65. The method was later applied to over 1.77 million repositories obtained from GitHub and found that 38% of them are “well-engineered”. The method also shows that cloning repositories is a viable alternative to the GitHub API and GHTorrent for collecting metadata. Conclusions It is possible to use a fully automated, unsupervised approach to identify projects of high-quality. PHANTOM’s reference implementation, called COYOTE, downloaded the metadata of 1,786,601 GitHub repositories in 21.5 days, which is over 33% faster than a similar study using a computer cluster. PHANTOM is flexible and can be improved further, but it already shows excellent results. The method is able to filter repositories very accurately with low hardware requirements and was able to rediscover an established ground-truth. In future work, cluster analysis is needed to identify the characteristics that impact repository quality.
Keywords: Data- och informationsvetenskap;Computer and Information Science
Issue Date: 2018
Publisher: Chalmers tekniska högskola / Institutionen för data- och informationsteknik (Chalmers)
Chalmers University of Technology / Department of Computer Science and Engineering (Chalmers)
Collection:Examensarbeten för masterexamen // Master Theses

Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.