TURSpider: A Turkish Text-to-SQL Dataset and LLM-Based Study

被引:0
|
作者
Kanburoglu, Ali Bugra [1 ]
Tek, Faik Boray [2 ]
机构
[1] Isik Univ, Dept Comp Engn, TR-34980 Istanbul, Turkiye
[2] Istanbul Tech Univ, Dept Artificial Intelligence & Data Engn, TR-34467 Istanbul, Turkiye
来源
IEEE ACCESS | 2024年 / 12卷
关键词
Training; Structured Query Language; Accuracy; Error analysis; Benchmark testing; Cognition; Encoding; Text-to-SQL; LLM; large language models; Turkish; dataset; TURSpider;
D O I
10.1109/ACCESS.2024.3498841
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
This paper introduces TURSpider, a novel Turkish Text-to-SQL dataset developed through human translation of the widely used Spider dataset, aimed at addressing the current lack of complex, cross-domain SQL datasets for the Turkish language. TURSpider incorporates a wide range of query difficulties, including nested queries, to create a comprehensive benchmark for Turkish Text-to-SQL tasks. The dataset enables cross-language comparison and significantly enhances the training and evaluation of large language models (LLMs) in generating SQL queries from Turkish natural language inputs. We fine-tuned several Turkish-supported LLMs on TURSpider and evaluated their performance in comparison to state-of-the-art models like GPT-3.5 Turbo and GPT-4. Our results show that fine-tuned Turkish LLMs demonstrate competitive performance, with one model even surpassing GPT-based models on execution accuracy. We also apply the Chain-of-Feedback (CoF) methodology to further improve model performance, demonstrating its effectiveness across multiple LLMs. This work provides a valuable resource for Turkish NLP and addresses specific challenges in developing accurate Text-to-SQL models for low-resource languages.
引用
收藏
页码:169379 / 169387
页数:9
相关论文
共 50 条
  • [21] ConDA: state-based data augmentation for context-dependent text-to-SQL
    Wang, Dingzirui
    Dou, Longxu
    Che, Wanxiang
    Wang, Jiaqi
    Liu, Jinbo
    Li, Lixin
    Shang, Jingan
    Tao, Lei
    Zhang, Jie
    Fu, Cong
    Song, Xuri
    INTERNATIONAL JOURNAL OF MACHINE LEARNING AND CYBERNETICS, 2024, 15 (08) : 3157 - 3168
  • [22] Enhancing Text-to-SQL Capabilities of Large Language Models: A Study on Prompt Design Strategies
    Nan, Linyong
    Zhao, Yilun
    Zhou, Weijin
    Rigi, Narutatsu
    Tae, Jaesung
    Zhang, Ellen
    Cohan, Arman
    Radev, Dragomir
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EMNLP 2023), 2023, : 14935 - 14956
  • [23] CHASE: A Large-Scale and Pragmatic Chinese Dataset for Cross-Database Context-Dependent Text-to-SQL
    Guo, Jiaqi
    Si, Ziliang
    Wang, Yu
    Liu, Qian
    Fan, Ming
    Lou, Jian-Guang
    Yang, Zijiang
    Liu, Ting
    59TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 11TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING, VOL 1 (ACL-IJCNLP 2021), 2021, : 2316 - 2331
  • [24] ConDefects: A Complementary Dataset to Address the Data Leakage Concern for LLM-Based Fault Localization and Program Repair
    Wu, Yonghao
    Li, Zheng
    Zhang, Jie M.
    Liu, Yong
    COMPANION PROCEEDINGS OF THE 32ND ACM INTERNATIONAL CONFERENCE ON THE FOUNDATIONS OF SOFTWARE ENGINEERING, FSE COMPANION 2024, 2024, : 642 - 646
  • [25] Easy-read and large language models: on the ethical dimensions of LLM-based text simplification
    Freyer, Nils
    Kempt, Hendrik
    Kloeser, Lars
    ETHICS AND INFORMATION TECHNOLOGY, 2024, 26 (03)
  • [26] Effects of diversity incentives on sample diversity and downstream model performance in LLM-based text augmentation
    Cegin, Jan
    Pecher, Branislav
    Simko, Jakub
    Srba, Ivan
    Bielikova, Maria
    Brusilovsky, Peter
    PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1: LONG PAPERS, 2024, : 13148 - 13171
  • [27] Effective Context Selection in LLM-Based Leaderboard Generation: An Empirical Study
    Kabongo, Salomon
    D'Souza, Jennifer
    Auer, Soren
    NATURAL LANGUAGE PROCESSING AND INFORMATION SYSTEMS, PT II, NLDB 2024, 2024, 14763 : 150 - 160
  • [28] ConDefects: A New Dataset to Address the Data Leakage Concern for LLM-based Fault Localization and Program Repair
    Wu, Yonghao
    Zhang, Jie M.
    Li, Zheng
    Liu, Yong
    arXiv, 2023,
  • [29] MedT5SQL: a transformers-based large language model for text-to-SQL conversion in the healthcare domain
    Marshan, Alaa
    Almutairi, Anwar Nais
    Ioannou, Athina
    Bell, David
    Monaghan, Asmat
    Arzoky, Mahir
    FRONTIERS IN BIG DATA, 2024, 7
  • [30] Structured Case-Based Reasoning for Inference-Time Adaptation of Text-to-SQL Parsers
    Awasthi, Abhijeet
    Chakrabarti, Soumen
    Sarawagi, Sunita
    THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 11, 2023, : 12536 - 12544