TURSpider: A Turkish Text-to-SQL Dataset and LLM-Based Study

被引:0
|
作者
Kanburoglu, Ali Bugra [1 ]
Tek, Faik Boray [2 ]
机构
[1] Isik Univ, Dept Comp Engn, TR-34980 Istanbul, Turkiye
[2] Istanbul Tech Univ, Dept Artificial Intelligence & Data Engn, TR-34467 Istanbul, Turkiye
来源
IEEE ACCESS | 2024年 / 12卷
关键词
Training; Structured Query Language; Accuracy; Error analysis; Benchmark testing; Cognition; Encoding; Text-to-SQL; LLM; large language models; Turkish; dataset; TURSpider;
D O I
10.1109/ACCESS.2024.3498841
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
This paper introduces TURSpider, a novel Turkish Text-to-SQL dataset developed through human translation of the widely used Spider dataset, aimed at addressing the current lack of complex, cross-domain SQL datasets for the Turkish language. TURSpider incorporates a wide range of query difficulties, including nested queries, to create a comprehensive benchmark for Turkish Text-to-SQL tasks. The dataset enables cross-language comparison and significantly enhances the training and evaluation of large language models (LLMs) in generating SQL queries from Turkish natural language inputs. We fine-tuned several Turkish-supported LLMs on TURSpider and evaluated their performance in comparison to state-of-the-art models like GPT-3.5 Turbo and GPT-4. Our results show that fine-tuned Turkish LLMs demonstrate competitive performance, with one model even surpassing GPT-based models on execution accuracy. We also apply the Chain-of-Feedback (CoF) methodology to further improve model performance, demonstrating its effectiveness across multiple LLMs. This work provides a valuable resource for Turkish NLP and addresses specific challenges in developing accurate Text-to-SQL models for low-resource languages.
引用
收藏
页码:169379 / 169387
页数:9
相关论文
共 50 条
  • [1] LLM-Based Text-to-SQL for Real-World Databases
    Eduardo R. Nascimento
    Grettel García
    Yenier T. Izquierdo
    Lucas Feijó
    Gustavo M. C. Coelho
    Aiko R. de Oliveira
    Melissa Lemos
    Robinson L. S. Garcia
    Luiz A. P. Paes Leme
    Marco A. Casanova
    SN Computer Science, 6 (2)
  • [2] Decomposition for Enhancing Attention: Improving LLM-based Text-to-SQL through Workflow Paradigm
    Xie, Yuanzhen
    Jin, Xinzhou
    Xie, Tao
    Lin, Mingxiong
    Chen, Liang
    Yu, Chenyun
    Cheng, Lei
    Zhuo, Chengxiang
    Hu, Bo
    Li, Zang
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: ACL 2024, 2024, : 10796 - 10816
  • [3] Text-to-SQL in the Wild: A Naturally-Occurring Dataset Based on Stack Exchange Data
    Hazoom, Moshe
    Malik, Vibhor
    Bogin, Ben
    NLP4PROG 2021: THE 1ST WORKSHOP ON NATURAL LANGUAGE PROCESSING FOR PROGRAMMING (NLP4PROG 2021), 2021, : 77 - 87
  • [4] DuSQL: A Large-Scale and Pragmatic Chinese Text-to-SQL Dataset
    Wang, Lijie
    Zhang, Ao
    Wu, Kun
    Sun, Ke
    Li, Zhenghua
    Wu, Hua
    Zhang, Min
    Wang, Haifeng
    PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), 2020, : 6923 - 6935
  • [5] An Exploratory Study on Model Compression for Text-to-SQL
    Sun, Shuo
    Gao, Yuze
    Zhang, Yuchen
    Su, Jian
    Bin Chen
    Lin, Yingzhan
    Sun, Shuqi
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2023), 2023, : 11647 - 11654
  • [6] Generate Text-to-SQL Queries Based on Sketch Filling
    Fu, Yinpei
    Ye, Songtao
    Fan, Hongjie
    IEEE ACCESS, 2024, 12 : 152392 - 152403
  • [7] A Pilot Study of Text-to-SQL Semantic Parsing for Vietnamese
    Anh Tuan Nguyen
    Mai Hoang Dao
    Dat Quoc Nguyen
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2020, 2020, : 4079 - 4085
  • [8] MedSyn: LLM-Based Synthetic Medical Text Generation Framework
    Kumichev, Gleb
    Blinov, Pavel
    Kuzkina, Yulia
    Goncharov, Vasily
    Zubkova, Galina
    Zenovkin, Nikolai
    Goncharov, Aleksei
    Savchenko, Andrey
    MACHINE LEARNING AND KNOWLEDGE DISCOVERY IN DATABASES-APPLIED DATA SCIENCE TRACK, PT X, ECML PKDD 2024, 2024, 14950 : 215 - 230
  • [9] The GitHub Recent Bugs Dataset for Evaluating LLM-based Debugging Applications
    Lee, Jae Yong
    Kang, Sungmin
    Yoon, Juyeon
    Yoo, Shin
    2024 IEEE CONFERENCE ON SOFTWARE TESTING, VERIFICATION AND VALIDATION, ICST 2024, 2024, : 442 - 444
  • [10] SEOSS-Queries - a software engineering dataset for text-to-SQL and question answering tasks
    Tomova, Mihaela Todorova
    Hofmann, Martin
    Maeder, Patrick
    DATA IN BRIEF, 2022, 42