Evaluating Different Approaches for Predicting Task Execution Time A Case Study in a Distributed Production Environment

Carlsson, Jesper; Forsström, Erik

Evaluating Different Approaches for Predicting Task Execution Time A Case Study in a Distributed Production Environment

Ladda ner

CSE 19-87 Forsström Carlsson.pdf (31.14 MB)

Publicerad

2019

Författare

Carlsson, Jesper

Forsström, Erik

Typ

Examensarbete för masterexamen

Sammanfattning

This project details the evaluation of several machine learning models used to predict the required processing times in a complex distributed system used to analyze large amounts of data. Speciﬁcally, the system is owned and developed by the company Recorded Future, who specialize in sifting through vast amounts of textual data acquired from a variety of online sources in search of threat intelligence they can provide to their clients. The input is pre-processed in several stages before it arrives at the main analyser process, where natural language processing and other tools are used to perform a threat analysis of the text. Our primary goal is to determine the time needed to analyse one of these texts. The ability to predict the time required to process a given set of input can be used to design scheduling algorithms in cloud computing environments [1]. It is of extra interest to Recorded Future as they use the size of message queues, which might grow large when processing takes too long, in order to decide when to start additional servers. As servers take some time to go online, being able to start them proactively based on estimated queue size, which can be inferred from the required processing time and the available computing resources, can alleviate problems with bottlenecks and other performance issues. RF has deﬁned a maximum error within one order of magnitude compared to the actual time to be acceptable for the purposes of workload estimation. To accomplish this goal, we have developed, trained, and tested several prediction models based on neural networks. Each network considers a diﬀerent set of input features that may aﬀect the processing time - information extracted from the input data, system performance at the time of analysis, total server workload in terms of input processed in parallel, and past processing times. For evaluation, the prediction error from two naive algorithms that predict the mean and median value of the task execution times in each data set is compared to each models error. Our results show that all but one of the prediction models achieve a lower error than using the naive approach, and all models perform better than the maximum error speciﬁed by RF. There is a trade-oﬀ between how feasible it would be to implement and use a model in the real system, and the achieved accuracy. The model that considers system performance achieves an error that is half that of the one based purely on input information. Considering the total workload of each server reduces the error by a negligible compared to the ﬁrst model, and using previous task execution times is shown to provide ﬂuctuating results, indicating it is not a suitable model to use for prediction in this system.

Ämne/nyckelord

distributed system, processing time prediction, EC2, task execution time, neural network, time series prediction

URI

https://hdl.handle.net/20.500.12380/300389

Samlingar

Examensarbeten för masterexamen

Visa fullständig post

Evaluating Different Approaches for Predicting Task Execution Time A Case Study in a Distributed Production Environment

Ladda ner

Publicerad

Författare

Typ

Program

Modellbyggare

Tidskriftstitel

ISSN

Volymtitel

Utgivare

Sammanfattning

Beskrivning

Ämne/nyckelord

Citation

Arkitekt (konstruktör)

Geografisk plats

Byggnad (typ)

Byggår

Modelltyp

Skala

Teknik / material

Index

URI

Samlingar

item.page.endorsement

item.page.review

item.page.supplemented

item.page.referenced