Retrieval-Augmented Generation for Sustainable Material Data Handling in Automotive Value Chain
Hämtar...
Ladda ner
Publicerad
Författare
Typ
Examensarbete för masterexamen
Master's Thesis
Master's Thesis
Modellbyggare
Tidskriftstitel
ISSN
Volymtitel
Utgivare
Sammanfattning
Applying large language models (LLMs) to industrial material data workflows has the
potential to improve efficiency. However, conventional LLMs are limited by hallucinations,
depend on proprietary training data, and are costly to update. This thesis explores
Retrieval-Augmented Generation (RAG) as an alternative approach, in which an LLM
generates responses grounded in a restricted, domain-specific corpus of documents and
databases and provides source citations for them. The study is carried out in an industrial
setting with two main data domains: an internal SQL materials database with tabular
material properties, and a corpus of unstructured textual documents, including supplier
documents, corporate standards, and environmental product declarations. A RAG system
is developed that (1) indexes both textual and tabular data, (2) retrieves relevant chunks
via dense vector search, and (3) generates source-grounded responses.
The work investigates whether a RAG model that explicitly integrates both domains can
outperform a baseline tuned for unstructured text and explores which tabular serialization
format yields the most semantically informative embeddings for pretrained embedding
models. To achieve this, we first constructed LLM-based pipelines to generate documentand
table-based test sets with ground-truth chunk annotations, and implemented a modular
RAG pipeline with separate indices for textual and tabular data. Then, we experimented
with multiple retrieval strategies, ranging from concatenating the retrieval results to
using cross-encoders to weigh them. In addition, several fusion strategies were tested to
evaluate whether they could improve retrieval accuracy when operating across different
domains. Experiments are conducted comparing nine tabular serialization strategies,
studying performance as a function of index size, chunk size, and top-k, and evaluating
different fusion modes and embedding models. The evaluation metrics used are Hit Rate,
Recall, Precision, F1-score, and Mean Reciprocal Rank.
Results show that enriched serialization, which converts tabular rows into natural-language
statements, yields stronger tabular retrieval performance than a standard key-value-based
format, without degrading performance on document retrieval. Larger chunk sizes and
higher top-k values systematically improve retrieval metrics, highlighting both the difficulty
of relying solely on similarity search and the benefits of cross-encoder reranking on larger
candidate sets. A domain-aware weighted fusion retriever further improves overall retrieval
performance over the optimized baseline with only moderate computational overhead. These
findings demonstrate that semantically rich tabular representations and domain-aware
fusion can enhance RAG performance on heterogeneous industrial material data.
Beskrivning
Ämne/nyckelord
Sustainable Material Selection, Retrieval-Augmented Generation (RAG), Applied Artificial Intelligence, Information Retrieval Systems, Large Language Models, Multidomain RAG
