Data Access Control for RAG-Based Chatbots
Publicerad
Författare
Typ
Examensarbete för masterexamen
Master's Thesis
Master's Thesis
Modellbyggare
Tidskriftstitel
ISSN
Volymtitel
Utgivare
Sammanfattning
The increasing demand for chatbots accessing enterprise data has resulted in the need to ensure secure data access control and high-quality responses. Thus, this project aims to implement data access control mechanisms, along with fine-tuning techniques, to develop a chatbot capable of generating such responses. Three approaches were explored: two utilizing an agentic Retrieval-Augmented Generation (RAG) architecture with pre-trained Large Language Models (LLMs), with and without fine-tuning, as well as a standalone fine-tuned LLM. The RAG architecture with fine-tuning also employed a response filter, whilst the standalone LLM was fine-tuned on data with incorporated data access control restrictions to prevent information leakage. Their performance was determined by assessing the semantic and linguistic correctness of responses and the amount of information leakage beyond a users access. A combination of the pre-trained LLMs Mistral-NeMo-Instruct-2407 and Qwen2.5-32B-Instruct, applied in the agentic RAG setup, achieved the bestperforming chatbot, having no data leakage and high-quality responses. Fine-tuning LLMs has proven to introduce potential data leakage risks, even when access restrictions are integrated into the training process. Therefore, to guarantee the protection of confidential information, it is advised to use pre-trained LLMs in a RAG setup with access control.
Beskrivning
Ämne/nyckelord
Data access control, RAG, chatbot, data security, response quality, finetuning, LLM, NLP.
