Optimizing Log Data Storage and Query Performance: A Study on Industrial Applications

dc.contributor.authorHe, Jialong
dc.contributor.authorGenc, Ediz
dc.contributor.departmentChalmers tekniska högskola / Institutionen för data och informationstekniksv
dc.contributor.departmentChalmers University of Technology / Department of Computer Science and Engineeringen
dc.contributor.examinerGomes de Oliveira Neto Master’s, Francisco
dc.contributor.supervisorStrüber, Daniel
dc.date.accessioned2025-10-07T09:37:48Z
dc.date.issued2025
dc.date.submitted
dc.description.abstractThis thesis covers an in-depth analysis of storage efficiency optimization and query performance improvement in highly scalable log management systems. As log volumes are experiencing exponential growth, optimization of storage as well as retrieval mechanisms has become more and more important to ensure scalability of the system and responsiveness of operations. To that end, we systematically applied and assessed a variety of optimization strategies. In particular, the work investigated state-of-the-art compression schemes, used selective partitioning schemes taking into account inherent data distributions, and intertwined outer-layer encodings such as delta encoding, dictionary encoding, and run-length encoding. We also considered using Bloom filters on chosen attributes to facilitate more effective pruning of queries, striking a delicate balance between performance improvements versus metadata costs. All our testing was performed using a Spark-based experimental framework, utilizing industry-grade log data in Parquet columnar form as input to mimic real-world operating scenarios. Experimental findings affirm the efficiency of the introduced methods. Among compression techniques, Zstandard (Zstd) performed best consistently, achieving high compression ratios coupled with fast decompression speed. Schema flattening along with specialized encoding patterns, further optimized compressibility by taking advantage of structural redundancies in log data. Partitioning significantly cuts down query response time by reducing the extent of data scans. Additionally, Bloom filters offered significant query speedup, especially for selective ones, having almost no negative effect on overall storage overhead. Briefly, this work illustrates that a strategically combined set of compression, encoding, partitioning, and indexing methods can make substantial progress in both storage space efficiency, as well as query performance, in log-centric data stores. The results provide real-world advice on how to close the gap between academic optimization methods and their utilization in contemporary big data environments in industry settings.
dc.identifier.coursecodeDATX05
dc.identifier.urihttp://hdl.handle.net/20.500.12380/310603
dc.language.isoeng
dc.setspec.uppsokTechnology
dc.subjectLog Analysis
dc.subjectCompression
dc.subjectPartitioning
dc.subjectEncoding
dc.subjectStorage
dc.subjectBloom Filter
dc.subjectQuery Performance
dc.titleOptimizing Log Data Storage and Query Performance: A Study on Industrial Applications
dc.type.degreeExamensarbete för masterexamensv
dc.type.degreeMaster's Thesisen
dc.type.uppsokH
local.programmeSoftware engineering and technology (MPSOF), MSc

Ladda ner

Original bundle

Visar 1 - 1 av 1
Hämtar...
Bild (thumbnail)
Namn:
CSE 25-38 JH EG.pdf
Storlek:
2.04 MB
Format:
Adobe Portable Document Format

License bundle

Visar 1 - 1 av 1
Hämtar...
Bild (thumbnail)
Namn:
license.txt
Storlek:
2.35 KB
Format:
Item-specific license agreed upon to submission
Beskrivning: