Optimizing Log Data Storage and Query Performance: A Study on Industrial Applications
Ladda ner
Publicerad
Författare
Typ
Examensarbete för masterexamen
Master's Thesis
Master's Thesis
Modellbyggare
Tidskriftstitel
ISSN
Volymtitel
Utgivare
Sammanfattning
This thesis covers an in-depth analysis of storage efficiency optimization and query performance improvement in highly scalable log management systems. As log volumes are experiencing exponential growth, optimization of storage as well as retrieval mechanisms has become more and more important to ensure scalability of the system and responsiveness of operations.
To that end, we systematically applied and assessed a variety of optimization strategies. In particular, the work investigated state-of-the-art compression schemes, used selective partitioning schemes taking into account inherent data distributions, and intertwined outer-layer encodings such as delta encoding, dictionary encoding, and run-length encoding. We also considered using Bloom filters on chosen attributes to facilitate more effective pruning of queries, striking a delicate balance between performance improvements versus metadata costs. All our testing was performed using a Spark-based experimental framework, utilizing industry-grade log data in Parquet columnar form as input to mimic real-world operating scenarios.
Experimental findings affirm the efficiency of the introduced methods. Among compression techniques, Zstandard (Zstd) performed best consistently, achieving high compression ratios coupled with fast decompression speed. Schema flattening along with specialized encoding patterns, further optimized compressibility by taking advantage of structural redundancies in log data. Partitioning significantly cuts down query response time by reducing the extent of data scans. Additionally, Bloom filters offered significant query speedup, especially for selective ones, having almost no negative effect on overall storage overhead.
Briefly, this work illustrates that a strategically combined set of compression, encoding, partitioning, and indexing methods can make substantial progress in both storage space efficiency, as well as query performance, in log-centric data stores. The results provide real-world advice on how to close the gap between academic optimization methods and their utilization in contemporary big data environments in industry settings.
Beskrivning
Ämne/nyckelord
Log Analysis, Compression, Partitioning, Encoding, Storage, Bloom Filter, Query Performance
