Automatic Engineering Report Generation Using Multi-Agent Systems and Large Language Models

Publicerad

Typ

Examensarbete för masterexamen
Master's Thesis

Modellbyggare

Tidskriftstitel

ISSN

Volymtitel

Utgivare

Sammanfattning

This thesis investigates the development and evaluation of a Multi-Agent Systems designed to automatically generate technical reports from unstructured truck testing data, using LLMs and a RAG based framework. The study explores how various architectural configurations and prompting strategies, such as Chain-of-Thought reasoning, one-shot prompting, and partitioned RAG artifacts, affect the factual accuracy, structural quality, and practical utility of the generated reports in an engineering context. Performance is assessed through a comparative analysis of three reports: a Multi- Agent System generated, single-agent baseline, and human-written report. Evaluations involve both human evaluators and LLM-based scoring, supplemented by a RAGAS pipeline for measuring content reliability and relevance. Human evaluations indicate that the Multi-Agent System performs well in terms of structural coherence and linguistic fluency but lacks the analytical depth and data interpretation accuracy of human-authored reports. Despite being outperformed by human-written reports across most human evaluation metrics, the Multi-Agent System demonstrates promising capabilities. It generates fluent, well-organized text and can identify and categorize key technical events. However, the Multi-Agent System also exhibits significant limitations, particularly in domain-specific reasoning and deeper factual understanding, which are critical components of technical report writing. Finally, the study reveals key challenges in evaluation itself. NLP tasks are inherently subjective and difficult to evaluate due to the absence of a ground truth. Furthermore, discrepancies between human and LLM-based assessments suggest that each favors different qualities: while LLMs emphasize coherence and alignment with retrieved artifacts, human evaluators prioritize domain relevance, contextual understanding, and deeper analysis. This divergence underscores the subjective and multifaceted nature of language evaluation, especially within highly technical domains.

Beskrivning

Ämne/nyckelord

LLMs, Agents, Multi-Agent System, AI, Generative AI, Prompting Techniques, RAG, NLP Evaluation, Report Generation.

Citation

Arkitekt (konstruktör)

Geografisk plats

Byggnad (typ)

Byggår

Modelltyp

Skala

Teknik / material

Index

item.page.endorsement

item.page.review

item.page.supplemented

item.page.referenced