Breaking the Token Barrier: Leveraging Retrieval Methods to Facilitate Text-to-SQL on Extensive Tabular Data

Publicerad

Typ

Examensarbete för masterexamen
Master's Thesis

Modellbyggare

Tidskriftstitel

ISSN

Volymtitel

Utgivare

Sammanfattning

Technical analysis of systems is extensively utilized in the industrial sector, particularly in automotive development. This analysis is often conducted on logs generated during a system’s runtime, which typically contain sensor values, timestamps, and events. These logs represent the primary means for analyzing a system’s behavior and identifying failures. But, depending on their size and structure, finding the correct data in these tables can be difficult. At Zenseact, their large table with sensor data remains largely unused by developers due to the challenges associated with acquiring sufficient knowledge to leverage it effectively. Full utilization of this data could speed up development and allow for more robust software to be developed by enabling an easier way to find the cause of errors. This thesis suggests leveraging LLMs to perform text-to-SQL to interface with this table. Here, a developer could use natural language and the LLM would generate the SQL code and query the database. But current text-to-SQL methods are naive, they assume the table’s schemas are small enough to fit inside the LLM’s token limit which is not always the case, especially for Zenseact’s table with 5677 columns. This thesis tackles the overarching problem by crafting an artifact that employs retrieval methods to fetch only the related columns of the schema given a user’s natural language query. If this works, it also has the added benefit of simplifying the problem for the LLM where the columns it is presented with are fewer and relevant to the query. The performance is measured in Errorless Code ratio. This ratio measures how often the generated SQL queries can run, without errors, on the database. This does not take the correctness of the code into account. The results were measured using 0, 1, and 2 shot and came out to 35%, 34%, and 39% respectively. To improve performance, the problem was divided by letting the LLM in two separate steps, first select columns and then generate the SQL code. This approach improved the EC to 57%. The retrieval was benchmarked separately using 18 tests. Default retrieval, where the column is embedded and stored in a vector database yielded a result of 4 out of18 on the benchmark. To improve retrieval performance the columns were simplified using a term frequency removal algorithm. This improved the result to 6 out of 18 correct on the retrieval benchmark. The artifact presents reasonable approaches for text-to-SQL. Retrieval is a wellestablished method in research, and the 2-shot and refined column selection both improve a text-to-SQL system’s performance. However, using systems like this in real scenarios introduces a big hurdle for the models used. It’s almost impossible for the model to have learned enough in its initial training to understand the companyspecific information. The methods presented for text-to-SQL on extensive database tables could be viable, but further improvements to the AI models are required until they can be used in a production environment.

Beskrivning

Ämne/nyckelord

Computer, Science, Computer science, Engineering, Project, Thesis

Citation

Arkitekt (konstruktör)

Geografisk plats

Byggnad (typ)

Byggår

Modelltyp

Skala

Teknik / material

Index

item.page.endorsement

item.page.review

item.page.supplemented

item.page.referenced