Breaking the Token Barrier: Leveraging Retrieval Methods to Facilitate Text-to-SQL on Extensive Tabular Data

dc.contributor.authorBergström, Victor
dc.contributor.departmentChalmers tekniska högskola / Institutionen för data och informationstekniksv
dc.contributor.departmentChalmers University of Technology / Department of Computer Science and Engineeringen
dc.contributor.examinerGranath, Mats
dc.contributor.supervisorStaron, Miroslaw
dc.date.accessioned2024-10-16T13:57:44Z
dc.date.available2024-10-16T13:57:44Z
dc.date.issued2024
dc.date.submitted
dc.description.abstractTechnical analysis of systems is extensively utilized in the industrial sector, particularly in automotive development. This analysis is often conducted on logs generated during a system’s runtime, which typically contain sensor values, timestamps, and events. These logs represent the primary means for analyzing a system’s behavior and identifying failures. But, depending on their size and structure, finding the correct data in these tables can be difficult. At Zenseact, their large table with sensor data remains largely unused by developers due to the challenges associated with acquiring sufficient knowledge to leverage it effectively. Full utilization of this data could speed up development and allow for more robust software to be developed by enabling an easier way to find the cause of errors. This thesis suggests leveraging LLMs to perform text-to-SQL to interface with this table. Here, a developer could use natural language and the LLM would generate the SQL code and query the database. But current text-to-SQL methods are naive, they assume the table’s schemas are small enough to fit inside the LLM’s token limit which is not always the case, especially for Zenseact’s table with 5677 columns. This thesis tackles the overarching problem by crafting an artifact that employs retrieval methods to fetch only the related columns of the schema given a user’s natural language query. If this works, it also has the added benefit of simplifying the problem for the LLM where the columns it is presented with are fewer and relevant to the query. The performance is measured in Errorless Code ratio. This ratio measures how often the generated SQL queries can run, without errors, on the database. This does not take the correctness of the code into account. The results were measured using 0, 1, and 2 shot and came out to 35%, 34%, and 39% respectively. To improve performance, the problem was divided by letting the LLM in two separate steps, first select columns and then generate the SQL code. This approach improved the EC to 57%. The retrieval was benchmarked separately using 18 tests. Default retrieval, where the column is embedded and stored in a vector database yielded a result of 4 out of18 on the benchmark. To improve retrieval performance the columns were simplified using a term frequency removal algorithm. This improved the result to 6 out of 18 correct on the retrieval benchmark. The artifact presents reasonable approaches for text-to-SQL. Retrieval is a wellestablished method in research, and the 2-shot and refined column selection both improve a text-to-SQL system’s performance. However, using systems like this in real scenarios introduces a big hurdle for the models used. It’s almost impossible for the model to have learned enough in its initial training to understand the companyspecific information. The methods presented for text-to-SQL on extensive database tables could be viable, but further improvements to the AI models are required until they can be used in a production environment.
dc.identifier.coursecodeDATX05
dc.identifier.urihttp://hdl.handle.net/20.500.12380/308921
dc.language.isoeng
dc.setspec.uppsokTechnology
dc.subjectComputer
dc.subjectScience
dc.subjectComputer science
dc.subjectEngineering
dc.subjectProject
dc.subjectThesis
dc.titleBreaking the Token Barrier: Leveraging Retrieval Methods to Facilitate Text-to-SQL on Extensive Tabular Data
dc.type.degreeExamensarbete för masterexamensv
dc.type.degreeMaster's Thesisen
dc.type.uppsokH
local.programmeComplex adaptive systems (MPCAS), MSc

Ladda ner

Original bundle

Visar 1 - 1 av 1
Hämtar...
Bild (thumbnail)
Namn:
CSE 24-23 VB.pdf
Storlek:
3.11 MB
Format:
Adobe Portable Document Format

License bundle

Visar 1 - 1 av 1
Hämtar...
Bild (thumbnail)
Namn:
license.txt
Storlek:
2.35 KB
Format:
Item-specific license agreed upon to submission
Beskrivning: