TURSpider: A Turkish Text-to-SQL Dataset and LLM-Based Study

被引：0

作者：

Kanburoglu, Ali Bugra ^{[1
]}

Tek, Faik Boray ^{[2
]}

机构：

[1] Isik Univ, Dept Comp Engn, TR-34980 Istanbul, Turkiye

[2] Istanbul Tech Univ, Dept Artificial Intelligence & Data Engn, TR-34467 Istanbul, Turkiye

来源：

IEEE ACCESS | 2024年 / 12卷

关键词：

Training; Structured Query Language; Accuracy; Error analysis; Benchmark testing; Cognition; Encoding; Text-to-SQL; LLM; large language models; Turkish; dataset; TURSpider;

D O I：

10.1109/ACCESS.2024.3498841

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

This paper introduces TURSpider, a novel Turkish Text-to-SQL dataset developed through human translation of the widely used Spider dataset, aimed at addressing the current lack of complex, cross-domain SQL datasets for the Turkish language. TURSpider incorporates a wide range of query difficulties, including nested queries, to create a comprehensive benchmark for Turkish Text-to-SQL tasks. The dataset enables cross-language comparison and significantly enhances the training and evaluation of large language models (LLMs) in generating SQL queries from Turkish natural language inputs. We fine-tuned several Turkish-supported LLMs on TURSpider and evaluated their performance in comparison to state-of-the-art models like GPT-3.5 Turbo and GPT-4. Our results show that fine-tuned Turkish LLMs demonstrate competitive performance, with one model even surpassing GPT-based models on execution accuracy. We also apply the Chain-of-Feedback (CoF) methodology to further improve model performance, demonstrating its effectiveness across multiple LLMs. This work provides a valuable resource for Turkish NLP and addresses specific challenges in developing accurate Text-to-SQL models for low-resource languages.

引用

页码：169379 / 169387

页数：9

共 50 条

[21] ConDA: state-based data augmentation for context-dependent text-to-SQL
Wang, Dingzirui
Dou, Longxu
Che, Wanxiang
Wang, Jiaqi
Liu, Jinbo
Li, Lixin
Shang, Jingan
Tao, Lei
Zhang, Jie
Fu, Cong
Song, Xuri
INTERNATIONAL JOURNAL OF MACHINE LEARNING AND CYBERNETICS, 2024, 15 (08) : 3157 - 3168
[22] Enhancing Text-to-SQL Capabilities of Large Language Models: A Study on Prompt Design Strategies
Nan, Linyong
Zhao, Yilun
Zhou, Weijin
Rigi, Narutatsu
Tae, Jaesung
Zhang, Ellen
Cohan, Arman
Radev, Dragomir
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EMNLP 2023), 2023, : 14935 - 14956
[23] CHASE: A Large-Scale and Pragmatic Chinese Dataset for Cross-Database Context-Dependent Text-to-SQL
Guo, Jiaqi
Si, Ziliang
Wang, Yu
Liu, Qian
Fan, Ming
Lou, Jian-Guang
Yang, Zijiang
Liu, Ting
59TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 11TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING, VOL 1 (ACL-IJCNLP 2021), 2021, : 2316 - 2331
[24] ConDefects: A Complementary Dataset to Address the Data Leakage Concern for LLM-Based Fault Localization and Program Repair
Wu, Yonghao
Li, Zheng
Zhang, Jie M.
Liu, Yong
COMPANION PROCEEDINGS OF THE 32ND ACM INTERNATIONAL CONFERENCE ON THE FOUNDATIONS OF SOFTWARE ENGINEERING, FSE COMPANION 2024, 2024, : 642 - 646
[25] Easy-read and large language models: on the ethical dimensions of LLM-based text simplification
Freyer, Nils
Kempt, Hendrik
Kloeser, Lars
ETHICS AND INFORMATION TECHNOLOGY, 2024, 26 (03)
[26] Effects of diversity incentives on sample diversity and downstream model performance in LLM-based text augmentation
Cegin, Jan
Pecher, Branislav
Simko, Jakub
Srba, Ivan
Bielikova, Maria
Brusilovsky, Peter
PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1: LONG PAPERS, 2024, : 13148 - 13171
[27] Effective Context Selection in LLM-Based Leaderboard Generation: An Empirical Study
Kabongo, Salomon
D'Souza, Jennifer
Auer, Soren
NATURAL LANGUAGE PROCESSING AND INFORMATION SYSTEMS, PT II, NLDB 2024, 2024, 14763 : 150 - 160
[28] ConDefects: A New Dataset to Address the Data Leakage Concern for LLM-based Fault Localization and Program Repair
Wu, Yonghao
Zhang, Jie M.
Li, Zheng
Liu, Yong
arXiv, 2023,
[29] MedT5SQL: a transformers-based large language model for text-to-SQL conversion in the healthcare domain
Marshan, Alaa
Almutairi, Anwar Nais
Ioannou, Athina
Bell, David
Monaghan, Asmat
Arzoky, Mahir
FRONTIERS IN BIG DATA, 2024, 7
[30] Structured Case-Based Reasoning for Inference-Time Adaptation of Text-to-SQL Parsers
Awasthi, Abhijeet
Chakrabarti, Soumen
Sarawagi, Sunita
THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 11, 2023, : 12536 - 12544

← 1 2 3 4 5 →