Clustering software projects at large-scale using time-series

Jungen, Heiko Joshua; Pickerill, Peter

Clustering software projects at large-scale using time-series

dc.contributor.author	Jungen, Heiko Joshua
dc.contributor.author	Pickerill, Peter
dc.contributor.department	Chalmers tekniska högskola / Institutionen för data- och informationsteknik (Chalmers)	sv
dc.contributor.department	Chalmers University of Technology / Department of Computer Science and Engineering (Chalmers)	en
dc.date.accessioned	2019-07-03T14:49:51Z
dc.date.available	2019-07-03T14:49:51Z
dc.date.issued	2018
dc.description.abstract	Context Within the field of Mining Software Repositories, there are numerous methods employed to filter datasets in order to avoid analysing low-quality projects, such as backups and student projects. Since the rise of GitHub, the world’s largest repository hosting site, large scale analysis has become more common. However, filtering methods have not kept up with this growth and researchers often rely on “quick and dirty” techniques to curate datasets. Objective The objective of this thesis is to develop a method that clusters large quantities of software projects in a limited time frame. This explores the possibility that a fully-automated method can be used to identify high-quality repositories at large scale and in the absence of an established ground-truth. At the same time, the hardware requirements and time limitations of existing approaches should be reduced to remove the barrier for researchers. Method This thesis follows the design science methodology. The proposed method, PHANTOM, extracts five measures from Git logs. Each measure is transformed into a time-series, which is represented as a feature vector for clustering using a k-means algorithm. Results Using the ground-truth from a previous study, PHANTOM was shown to be competitive compared to supervised approaches while reducing the hardware requirements by two orders of magnitude. The ground-truth was rediscovered by several k-means models, with some models achieving up to 87% precision or 94% recall. The highest Mathews correlation coefficient (MCC) was 0.65. The method was later applied to over 1.77 million repositories obtained from GitHub and found that 38% of them are “well-engineered”. The method also shows that cloning repositories is a viable alternative to the GitHub API and GHTorrent for collecting metadata. Conclusions It is possible to use a fully automated, unsupervised approach to identify projects of high-quality. PHANTOM’s reference implementation, called COYOTE, downloaded the metadata of 1,786,601 GitHub repositories in 21.5 days, which is over 33% faster than a similar study using a computer cluster. PHANTOM is flexible and can be improved further, but it already shows excellent results. The method is able to filter repositories very accurately with low hardware requirements and was able to rediscover an established ground-truth. In future work, cluster analysis is needed to identify the characteristics that impact repository quality.
dc.identifier.uri	https://hdl.handle.net/20.500.12380/255670
dc.language.iso	eng
dc.setspec.uppsok	Technology
dc.subject	Data- och informationsvetenskap
dc.subject	Computer and Information Science
dc.title	Clustering software projects at large-scale using time-series
dc.type.degree	Examensarbete för masterexamen	sv
dc.type.degree	Master Thesis	en
dc.type.uppsok	H
local.programme	Software engineering and technology (MPSOF), MSc

Ladda ner

Original bundle

Visar 1 - 1 av 1

Namn:: 255670.pdf
Storlek:: 1.53 MB
Format:: Adobe Portable Document Format
Beskrivning:: Fulltext

Ladda ner

Samlingar

Examensarbeten för masterexamen