TURSpider: A Turkish Text-to-SQL Dataset and LLM-Based Study

被引：0

作者：

Kanburoglu, Ali Bugra ^{[1
]}

Tek, Faik Boray ^{[2
]}

机构：

[1] Isik Univ, Dept Comp Engn, TR-34980 Istanbul, Turkiye

[2] Istanbul Tech Univ, Dept Artificial Intelligence & Data Engn, TR-34467 Istanbul, Turkiye

来源：

IEEE ACCESS | 2024年 / 12卷

关键词：

Training; Structured Query Language; Accuracy; Error analysis; Benchmark testing; Cognition; Encoding; Text-to-SQL; LLM; large language models; Turkish; dataset; TURSpider;

D O I：

10.1109/ACCESS.2024.3498841

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

This paper introduces TURSpider, a novel Turkish Text-to-SQL dataset developed through human translation of the widely used Spider dataset, aimed at addressing the current lack of complex, cross-domain SQL datasets for the Turkish language. TURSpider incorporates a wide range of query difficulties, including nested queries, to create a comprehensive benchmark for Turkish Text-to-SQL tasks. The dataset enables cross-language comparison and significantly enhances the training and evaluation of large language models (LLMs) in generating SQL queries from Turkish natural language inputs. We fine-tuned several Turkish-supported LLMs on TURSpider and evaluated their performance in comparison to state-of-the-art models like GPT-3.5 Turbo and GPT-4. Our results show that fine-tuned Turkish LLMs demonstrate competitive performance, with one model even surpassing GPT-based models on execution accuracy. We also apply the Chain-of-Feedback (CoF) methodology to further improve model performance, demonstrating its effectiveness across multiple LLMs. This work provides a valuable resource for Turkish NLP and addresses specific challenges in developing accurate Text-to-SQL models for low-resource languages.

引用

页码：169379 / 169387

页数：9

共 50 条

[41] PS-SQL: Phrase-based Schema-Linking with Pre-trained Language Models for Text-to-SQL Parsing
Lan, Zhibo
Li, Shuangyin
2024 6TH INTERNATIONAL CONFERENCE ON NATURAL LANGUAGE PROCESSING, ICNLP 2024, 2024, : 31 - 35
[42] Reproducibility of LLM-based Recommender Systems: the Case Study of P5 Paradigm
Lops, Pasquale
Silletti, Antonio
Polignano, Marco
Musto, Cataldo
Semeraro, Giovanni
PROCEEDINGS OF THE EIGHTEENTH ACM CONFERENCE ON RECOMMENDER SYSTEMS, RECSYS 2024, 2024, : 116 - 125
[43] Improving the Accuracy of Text-to-SQL Tools Based on Large Language Models for Real-World Relational Databases
Coelho, Gustavo M. C.
Nascimento, Eduardo R. S.
Izquierdo, Yenier T.
Garcia, Grettel M.
Feijo, Lucas
Lemos, Melissa
Garcia, Robinson L. S.
de Oliveira, Aiko R.
Pinheiro, Joao P.
Casanova, Marco A.
DATABASE AND EXPERT SYSTEMS APPLICATIONS, PT I, DEXA 2024, 2024, 14910 : 93 - 107
[44] ExSPIN: Explicit Feedback-Based Self-Play Fine-Tuning for Text-to-SQL Parsing
Yan, Liang
Su, Jinhang
Liu, Chuanyi
Duan, Shaoming
Zhang, Yuhao
Li, Jianhang
Han, Peiyi
Liu, Ye
ENTROPY, 2025, 27 (03)
[45] An interaction-modeling mechanism for context-dependent Text-to-SQL translation based on heterogeneous graph aggregation
Yu, Wei
Chang, Tao
Guo, Xiaoting
Wang, Mengzhu
Wang, Xiaodong
NEURAL NETWORKS, 2021, 142 : 573 - 582
[46] Non-Programmers Can Label Programs Indirectly via Active Examples: A Case Study with Text-to-SQL
Zhong, Ruiqi
Snell, Charlie
Klein, Dan
Eisner, Jason
2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, EMNLP 2023, 2023, : 5126 - 5152
[47] IGSQL: Database Schema Interaction Graph Based Neural Model for Context-Dependent Text-to-SQL Generation
Cai, Yitao
Wan, Xiaojun
PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), 2020, : 6903 - 6912
[48] LLM-Based Test-Driven Interactive Code Generation: User Study and Empirical Evaluation
Fakhoury, Sarah
Naik, Aaditya
Sakkas, Georgios
Chakraborty, Saikat
Lahiri, Shuvendu K.
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, 2024, 50 (09) : 2254 - 2268
[49] "Artificial Intelligence - Carrying us into the Future": A Study of Older Adults' Perceptions of LLM-Based Chatbots
Enam, M. D. Atik
Murmu, Chandni
Dixon, Emma
INTERNATIONAL JOURNAL OF HUMAN-COMPUTER INTERACTION, 2025,
[50] An empirical study on LLM-based classification of requirements-related provisions in food-safety regulations
Hassani, Shabnam
Sabetzadeh, Mehrdad
Amyot, Daniel
EMPIRICAL SOFTWARE ENGINEERING, 2025, 30 (03)

← 1 2 3 4 5 →