Evaluating Different Approaches for Predicting Task Execution Time A Case Study in a Distributed Production Environment

dc.contributor.authorCarlsson, Jesper
dc.contributor.authorForsström, Erik
dc.contributor.departmentChalmers tekniska högskola / Institutionen för data och informationstekniksv
dc.contributor.examinerDubhashi, Devdatt
dc.contributor.supervisorGulisano, Vincenzo
dc.date.accessioned2019-10-03T13:36:46Z
dc.date.available2019-10-03T13:36:46Z
dc.date.issued2019sv
dc.date.submitted2019
dc.description.abstractThis project details the evaluation of several machine learning models used to predict the required processing times in a complex distributed system used to analyze large amounts of data. Specifically, the system is owned and developed by the company Recorded Future, who specialize in sifting through vast amounts of textual data acquired from a variety of online sources in search of threat intelligence they can provide to their clients. The input is pre-processed in several stages before it arrives at the main analyser process, where natural language processing and other tools are used to perform a threat analysis of the text. Our primary goal is to determine the time needed to analyse one of these texts. The ability to predict the time required to process a given set of input can be used to design scheduling algorithms in cloud computing environments [1]. It is of extra interest to Recorded Future as they use the size of message queues, which might grow large when processing takes too long, in order to decide when to start additional servers. As servers take some time to go online, being able to start them proactively based on estimated queue size, which can be inferred from the required processing time and the available computing resources, can alleviate problems with bottlenecks and other performance issues. RF has defined a maximum error within one order of magnitude compared to the actual time to be acceptable for the purposes of workload estimation. To accomplish this goal, we have developed, trained, and tested several prediction models based on neural networks. Each network considers a different set of input features that may affect the processing time - information extracted from the input data, system performance at the time of analysis, total server workload in terms of input processed in parallel, and past processing times. For evaluation, the prediction error from two naive algorithms that predict the mean and median value of the task execution times in each data set is compared to each models error. Our results show that all but one of the prediction models achieve a lower error than using the naive approach, and all models perform better than the maximum error specified by RF. There is a trade-off between how feasible it would be to implement and use a model in the real system, and the achieved accuracy. The model that considers system performance achieves an error that is half that of the one based purely on input information. Considering the total workload of each server reduces the error by a negligible compared to the first model, and using previous task execution times is shown to provide fluctuating results, indicating it is not a suitable model to use for prediction in this system.sv
dc.identifier.coursecodeDATX05sv
dc.identifier.urihttps://hdl.handle.net/20.500.12380/300389
dc.language.isoengsv
dc.setspec.uppsokTechnology
dc.subjectdistributed systemsv
dc.subjectprocessing time predictionsv
dc.subjectEC2sv
dc.subjecttask execution timesv
dc.subjectneural networksv
dc.subjecttime series predictionsv
dc.titleEvaluating Different Approaches for Predicting Task Execution Time A Case Study in a Distributed Production Environmentsv
dc.type.degreeExamensarbete för masterexamensv
dc.type.uppsokH

Ladda ner

Original bundle

Visar 1 - 1 av 1
Hämtar...
Bild (thumbnail)
Namn:
CSE 19-87 Forsström Carlsson.pdf
Storlek:
31.14 MB
Format:
Adobe Portable Document Format
Beskrivning:

License bundle

Visar 1 - 1 av 1
Hämtar...
Bild (thumbnail)
Namn:
license.txt
Storlek:
1.14 KB
Format:
Item-specific license agreed upon to submission
Beskrivning: