Exploring Optimized CPU-Inference for Latency-Critical Machine Learning Tasks
Publicerad
Författare
Typ
Examensarbete för masterexamen
Master's Thesis
Master's Thesis
Modellbyggare
Tidskriftstitel
ISSN
Volymtitel
Utgivare
Sammanfattning
In recent years, machine learning has grown to become increasingly prevalent for a
wide range of applications spanning multiple industries. For some of these applica tions, low latency can be critical, which may limit the types of hardware that can
be used. Graphical Processing Units (GPUs) have long been the go-to hardware
for machine learning tasks, often outperforming alternatives like Central Process ing Units (CPUs), but these are not practical in all situations. We explore CPUs,
leveraging modern optimization techniques like pruning and quantization, as a com petitive alternative to GPUs with comparable predictive performance. This thesis
provides a comparison of the two hardware types on a real-time latency-critical vi sion task. On the GPU side, TensorRT in combination with quantization is used
to achieve state-of-the-art inference performance on the hardware. On the CPU
side, the model is optimized using SparseML to introduce unstructured sparsity and
quantization. This optimized model is then used by the DeepSparse runtime engine
for optimized inference. Our findings show that the CPU approach can outperform
the GPU hardware in certain situations. This suggests that CPU hardware could
potentially be used in applications previously limited to GPUs.
Beskrivning
Ämne/nyckelord
machine learning, neural network, model compression, pruning, quanti zation, optimization, CPU, GPU, Neural Magic, NVIDIA