TURSpider: A Turkish Text-to-SQL Dataset and LLM-Based Study

被引:0
|
作者
Kanburoglu, Ali Bugra [1 ]
Tek, Faik Boray [2 ]
机构
[1] Isik Univ, Dept Comp Engn, TR-34980 Istanbul, Turkiye
[2] Istanbul Tech Univ, Dept Artificial Intelligence & Data Engn, TR-34467 Istanbul, Turkiye
来源
IEEE ACCESS | 2024年 / 12卷
关键词
Training; Structured Query Language; Accuracy; Error analysis; Benchmark testing; Cognition; Encoding; Text-to-SQL; LLM; large language models; Turkish; dataset; TURSpider;
D O I
10.1109/ACCESS.2024.3498841
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
This paper introduces TURSpider, a novel Turkish Text-to-SQL dataset developed through human translation of the widely used Spider dataset, aimed at addressing the current lack of complex, cross-domain SQL datasets for the Turkish language. TURSpider incorporates a wide range of query difficulties, including nested queries, to create a comprehensive benchmark for Turkish Text-to-SQL tasks. The dataset enables cross-language comparison and significantly enhances the training and evaluation of large language models (LLMs) in generating SQL queries from Turkish natural language inputs. We fine-tuned several Turkish-supported LLMs on TURSpider and evaluated their performance in comparison to state-of-the-art models like GPT-3.5 Turbo and GPT-4. Our results show that fine-tuned Turkish LLMs demonstrate competitive performance, with one model even surpassing GPT-based models on execution accuracy. We also apply the Chain-of-Feedback (CoF) methodology to further improve model performance, demonstrating its effectiveness across multiple LLMs. This work provides a valuable resource for Turkish NLP and addresses specific challenges in developing accurate Text-to-SQL models for low-resource languages.
引用
收藏
页码:169379 / 169387
页数:9
相关论文
共 50 条
  • [31] FinSQL: Model-Agnostic LLMs-based Text-to-SQL Framework for Financial Analysis
    Zhang, Chao
    Mao, Yuren
    Fan, Yijiang
    Mi, Yu
    Gao, Yunjun
    Chen, Lu
    Lou, Dongfang
    Lin, Jinshu
    COMPANION OF THE 2024 INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA, SIGMOD-COMPANION 2024, 2024, : 93 - 105
  • [32] On Modern Text-to-SQL Semantic Parsing Methodologies for Natural Language Interface to Databases: A Comparative Study
    Visperas, Moses
    Adoptante, Aunhel John
    Borjal, Christalline Joie
    Abia, Ma. Teresita
    Catapang, Jasper Kyle
    Peramo, Elmer
    2023 INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE IN INFORMATION AND COMMUNICATION, ICAIIC, 2023, : 390 - 396
  • [33] LLM-Based Interaction for Content Generation: A Case Study on the Perception of Employees in an IT Department
    Agossah, Alexandre
    Krupa, Frederique
    Perreira Da Silva, Matthieu
    Le Callet, Patrick
    PROCEEDINGS OF THE 2023 ACM INTERNATIONAL CONFERENCE ON INTERACTIVE MEDIA EXPERIENCES, IMX 2023, 2023, : 237 - 241
  • [34] SV2-SQL: a text-to-SQL transformation mechanism based on BERT models for slot filling, value extraction, and verification
    Chang, Chih-Yung
    Liang, Yuan-Lin
    Wu, Shih-Jung
    Roy, Diptendu Sinha
    MULTIMEDIA SYSTEMS, 2024, 30 (01)
  • [35] SV2-SQL: a text-to-SQL transformation mechanism based on BERT models for slot filling, value extraction, and verification
    Chih-Yung Chang
    Yuan-Lin Liang
    Shih-Jung Wu
    Diptendu Sinha Roy
    Multimedia Systems, 2024, 30
  • [36] Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task
    Yu, Tao
    Zhang, Rui
    Yang, Kai
    Yasunaga, Michihiro
    Wang, Dongxu
    Li, Zifan
    Ma, James
    Li, Irene
    Yao, Qingning
    Roman, Shanelle
    Zhang, Zilin
    Radev, Dragomir R.
    2018 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2018), 2018, : 3911 - 3921
  • [37] DubWise: Video-Guided Speech Duration Control in Multimodal LLM-based Text-to-Speech for Dubbing
    Sahipjohn, Neha
    Gudmalwar, Ashishkumar
    Shah, Nirmesh
    Wasnik, Pankaj
    Shah, Rajiv Ratn
    INTERSPEECH 2024, 2024, : 2960 - 2964
  • [38] Exploring the application of LLM-based AI in UX design: an empirical case study of ChatGPT
    Zhou, Zhibin
    Li, Yaoqi
    Yu, Junnan
    HUMAN-COMPUTER INTERACTION, 2024,
  • [39] An LLM-Based Method for Quality Information Extraction From Web Text for Crowed-Sensing Spatiotemporal Data
    Huang, Zongcai
    Peng, Peng
    Lu, Feng
    Zhang, He
    TRANSACTIONS IN GIS, 2025, 29 (01)
  • [40] Learning to Localize Actions in Instructional Videos with LLM-Based Multi-pathway Text-Video Alignment
    Chen, Yuxiao
    Li, Kai
    Bao, Wentao
    Patel, Deep
    Kong, Yu
    Min, Martin Renqiang
    Metaxas, Dimitris N.
    COMPUTER VISION-ECCV 2024, PT LXXXII, 2025, 15140 : 193 - 210