Clustering software projects at large-scale using time-series

dc.contributor.authorJungen, Heiko Joshua
dc.contributor.authorPickerill, Peter
dc.contributor.departmentChalmers tekniska högskola / Institutionen för data- och informationsteknik (Chalmers)sv
dc.contributor.departmentChalmers University of Technology / Department of Computer Science and Engineering (Chalmers)en
dc.description.abstractContext Within the field of Mining Software Repositories, there are numerous methods employed to filter datasets in order to avoid analysing low-quality projects, such as backups and student projects. Since the rise of GitHub, the world’s largest repository hosting site, large scale analysis has become more common. However, filtering methods have not kept up with this growth and researchers often rely on “quick and dirty” techniques to curate datasets. Objective The objective of this thesis is to develop a method that clusters large quantities of software projects in a limited time frame. This explores the possibility that a fully-automated method can be used to identify high-quality repositories at large scale and in the absence of an established ground-truth. At the same time, the hardware requirements and time limitations of existing approaches should be reduced to remove the barrier for researchers. Method This thesis follows the design science methodology. The proposed method, PHANTOM, extracts five measures from Git logs. Each measure is transformed into a time-series, which is represented as a feature vector for clustering using a k-means algorithm. Results Using the ground-truth from a previous study, PHANTOM was shown to be competitive compared to supervised approaches while reducing the hardware requirements by two orders of magnitude. The ground-truth was rediscovered by several k-means models, with some models achieving up to 87% precision or 94% recall. The highest Mathews correlation coefficient (MCC) was 0.65. The method was later applied to over 1.77 million repositories obtained from GitHub and found that 38% of them are “well-engineered”. The method also shows that cloning repositories is a viable alternative to the GitHub API and GHTorrent for collecting metadata. Conclusions It is possible to use a fully automated, unsupervised approach to identify projects of high-quality. PHANTOM’s reference implementation, called COYOTE, downloaded the metadata of 1,786,601 GitHub repositories in 21.5 days, which is over 33% faster than a similar study using a computer cluster. PHANTOM is flexible and can be improved further, but it already shows excellent results. The method is able to filter repositories very accurately with low hardware requirements and was able to rediscover an established ground-truth. In future work, cluster analysis is needed to identify the characteristics that impact repository quality.
dc.subjectData- och informationsvetenskap
dc.subjectComputer and Information Science
dc.titleClustering software projects at large-scale using time-series
dc.type.degreeExamensarbete för masterexamensv
dc.type.degreeMaster Thesisen
local.programmeSoftware engineering and technology (MPSOF), MSc
Ladda ner
Original bundle
Visar 1 - 1 av 1
Bild (thumbnail)
1.53 MB
Adobe Portable Document Format