Retrieval-Augmented Generation for Sustainable Material Data Handling in Automotive Value Chain

Hämtar...
Bild (thumbnail)

Publicerad

Typ

Examensarbete för masterexamen
Master's Thesis

Modellbyggare

Tidskriftstitel

ISSN

Volymtitel

Utgivare

Sammanfattning

Applying large language models (LLMs) to industrial material data workflows has the potential to improve efficiency. However, conventional LLMs are limited by hallucinations, depend on proprietary training data, and are costly to update. This thesis explores Retrieval-Augmented Generation (RAG) as an alternative approach, in which an LLM generates responses grounded in a restricted, domain-specific corpus of documents and databases and provides source citations for them. The study is carried out in an industrial setting with two main data domains: an internal SQL materials database with tabular material properties, and a corpus of unstructured textual documents, including supplier documents, corporate standards, and environmental product declarations. A RAG system is developed that (1) indexes both textual and tabular data, (2) retrieves relevant chunks via dense vector search, and (3) generates source-grounded responses. The work investigates whether a RAG model that explicitly integrates both domains can outperform a baseline tuned for unstructured text and explores which tabular serialization format yields the most semantically informative embeddings for pretrained embedding models. To achieve this, we first constructed LLM-based pipelines to generate documentand table-based test sets with ground-truth chunk annotations, and implemented a modular RAG pipeline with separate indices for textual and tabular data. Then, we experimented with multiple retrieval strategies, ranging from concatenating the retrieval results to using cross-encoders to weigh them. In addition, several fusion strategies were tested to evaluate whether they could improve retrieval accuracy when operating across different domains. Experiments are conducted comparing nine tabular serialization strategies, studying performance as a function of index size, chunk size, and top-k, and evaluating different fusion modes and embedding models. The evaluation metrics used are Hit Rate, Recall, Precision, F1-score, and Mean Reciprocal Rank. Results show that enriched serialization, which converts tabular rows into natural-language statements, yields stronger tabular retrieval performance than a standard key-value-based format, without degrading performance on document retrieval. Larger chunk sizes and higher top-k values systematically improve retrieval metrics, highlighting both the difficulty of relying solely on similarity search and the benefits of cross-encoder reranking on larger candidate sets. A domain-aware weighted fusion retriever further improves overall retrieval performance over the optimized baseline with only moderate computational overhead. These findings demonstrate that semantically rich tabular representations and domain-aware fusion can enhance RAG performance on heterogeneous industrial material data.

Beskrivning

Ämne/nyckelord

Sustainable Material Selection, Retrieval-Augmented Generation (RAG), Applied Artificial Intelligence, Information Retrieval Systems, Large Language Models, Multidomain RAG

Citation

Arkitekt (konstruktör)

Geografisk plats

Byggnad (typ)

Byggår

Modelltyp

Skala

Teknik / material

Index

Endorsement

Review

Supplemented By

Referenced By