Breaking the Token Barrier: Leveraging Retrieval Methods to Facilitate Text-to-SQL on Extensive Tabular Data

Bergström, Victor

Breaking the Token Barrier: Leveraging Retrieval Methods to Facilitate Text-to-SQL on Extensive Tabular Data

dc.contributor.author	Bergström, Victor
dc.contributor.department	Chalmers tekniska högskola / Institutionen för data och informationsteknik	sv
dc.contributor.department	Chalmers University of Technology / Department of Computer Science and Engineering	en
dc.contributor.examiner	Granath, Mats
dc.contributor.supervisor	Staron, Miroslaw
dc.date.accessioned	2024-10-16T13:57:44Z
dc.date.available	2024-10-16T13:57:44Z
dc.date.issued	2024
dc.date.submitted
dc.description.abstract	Technical analysis of systems is extensively utilized in the industrial sector, particularly in automotive development. This analysis is often conducted on logs generated during a system’s runtime, which typically contain sensor values, timestamps, and events. These logs represent the primary means for analyzing a system’s behavior and identifying failures. But, depending on their size and structure, finding the correct data in these tables can be difficult. At Zenseact, their large table with sensor data remains largely unused by developers due to the challenges associated with acquiring sufficient knowledge to leverage it effectively. Full utilization of this data could speed up development and allow for more robust software to be developed by enabling an easier way to find the cause of errors. This thesis suggests leveraging LLMs to perform text-to-SQL to interface with this table. Here, a developer could use natural language and the LLM would generate the SQL code and query the database. But current text-to-SQL methods are naive, they assume the table’s schemas are small enough to fit inside the LLM’s token limit which is not always the case, especially for Zenseact’s table with 5677 columns. This thesis tackles the overarching problem by crafting an artifact that employs retrieval methods to fetch only the related columns of the schema given a user’s natural language query. If this works, it also has the added benefit of simplifying the problem for the LLM where the columns it is presented with are fewer and relevant to the query. The performance is measured in Errorless Code ratio. This ratio measures how often the generated SQL queries can run, without errors, on the database. This does not take the correctness of the code into account. The results were measured using 0, 1, and 2 shot and came out to 35%, 34%, and 39% respectively. To improve performance, the problem was divided by letting the LLM in two separate steps, first select columns and then generate the SQL code. This approach improved the EC to 57%. The retrieval was benchmarked separately using 18 tests. Default retrieval, where the column is embedded and stored in a vector database yielded a result of 4 out of18 on the benchmark. To improve retrieval performance the columns were simplified using a term frequency removal algorithm. This improved the result to 6 out of 18 correct on the retrieval benchmark. The artifact presents reasonable approaches for text-to-SQL. Retrieval is a wellestablished method in research, and the 2-shot and refined column selection both improve a text-to-SQL system’s performance. However, using systems like this in real scenarios introduces a big hurdle for the models used. It’s almost impossible for the model to have learned enough in its initial training to understand the companyspecific information. The methods presented for text-to-SQL on extensive database tables could be viable, but further improvements to the AI models are required until they can be used in a production environment.
dc.identifier.coursecode	DATX05
dc.identifier.uri	http://hdl.handle.net/20.500.12380/308921
dc.language.iso	eng
dc.setspec.uppsok	Technology
dc.subject	Computer
dc.subject	Science
dc.subject	Computer science
dc.subject	Engineering
dc.subject	Project
dc.subject	Thesis
dc.title	Breaking the Token Barrier: Leveraging Retrieval Methods to Facilitate Text-to-SQL on Extensive Tabular Data
dc.type.degree	Examensarbete för masterexamen	sv
dc.type.degree	Master's Thesis	en
dc.type.uppsok	H
local.programme	Complex adaptive systems (MPCAS), MSc

Ladda ner

Original bundle

Visar 1 - 1 av 1

Namn:: CSE 24-23 VB.pdf
Storlek:: 3.11 MB
Format:: Adobe Portable Document Format

Ladda ner

License bundle

Visar 1 - 1 av 1

Namn:: license.txt
Storlek:: 2.35 KB
Format:: Item-specific license agreed upon to submission
Beskrivning:

Ladda ner

Samlingar

Examensarbeten för masterexamen