Searching for Android security defects with the help of an NLP machine learn ing model and existing vulnerability data
Ladda ner
Publicerad
Författare
Typ
Examensarbete för masterexamen
Program
Modellbyggare
Tidskriftstitel
ISSN
Volymtitel
Utgivare
Sammanfattning
Background
Over the years, as the modern code review became a common software engineering
practice, the underlying benefits of the review process shifted away from finding
defects and instead now center around knowledge sharing and communication. As
such, there is demand for new tooling which is able to find defects and integrates into
the modern code review. One of these tools is the Automatic Code Review Assistant,
ACoRA, which by training on previously performed code reviews can automatically
find defects in new code. Although ACoRA could in theory be trained to locate any
type of software defect, this study limits the scope to only security vulnerabilities.
Aim
ACoRA trains on previously performed code reviews, specifically those code reviews
which showcase occurrences of programming defects. The study aims to design
and evaluate a new artifact dubbed SeCoRA, intended to facilitate the process of
acquiring these code reviews by making use of a database containing existing known
security vulnerabilities.
Method
The study follows the design science methodology. Using an unsupervised machine
learning model, SeCoRA is able to compare two fragments of code against each other
and express how similar they are. Based on this ability, the common vulnerabilities
and exposures database can be used to discover code reviews which contain vulnera ble code. The assumption here is that if a code review contains code which is similar
to an existing vulnerability, that code is also potentially defective. SeCoRA is built
specifically to gather these code reviews from the Android Open Source Project.
Results
SeCoRA was evaluated firstly by distinguishing on code in general with both lines
and blocks of code. The bigger size of code fragments did not improve the ability
of comparing code, and hence, lines of code were used to filter out a set of 1194
code reviews for similar code. Using this approach resulted in 11 code reviews to be
found containing potential security defects but did not adhere to the classification
from the original code.
Conclusions
Although SeCoRA was able to distinguish between different lines of code, the tool
is not sufficiently good to find security related code reviews. Therefore, the results
are negative, as the tool does not solve the problem of acquiring the data necessary
to train ACoRA. As part of the final discussion, the authors present a project post mortem and lay the ground for possible future work.