Data Augmentation with Hierarchical SQL-to-Question Generation for Cross-domain Text-to-SQL Parsing

被引:0
|
作者
Wu, Kun [1 ,2 ]
Wang, Lijie [2 ]
Li, Zhenghua [1 ]
Zhang, Ao [2 ]
Xiao, Xinyan [2 ]
Wu, Hua [2 ]
Zhang, Min [1 ]
Wang, Haifeng [2 ]
机构
[1] Soochow Univ, Sch Comp Sci & Thchnol, Inst Artificial Intelligence, Suzhou, Peoples R China
[2] Baidu Inc, Beijing, Peoples R China
基金
中国国家自然科学基金;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Data augmentation has attracted a lot of research attention in the deep learning era for its ability in alleviating data sparseness. The lack of labeled data for unseen evaluation databases is exactly the major challenge for cross-domain text-to-SQL parsing. Previous works either require human intervention to guarantee the quality of generated data, or fail to handle complex SQL queries. This paper presents a simple yet effective data augmentation framework. First, given a database, we automatically produce a large number of SQL queries based on an abstract syntax tree grammar. For better distribution matching, we require that at least 80% of SQL patterns in the training data are covered by generated queries. Second, we propose a hierarchical SQL-to-question generation model to obtain high-quality natural language questions, which is the major contribution of this work. Finally, we design a simple sampling strategy that can greatly improve training efficiency given large amounts of generated data. Experiments on three cross-domain datasets, i.e., WikiSQL and Spider in English, and DuSQL in Chinese, show that our proposed data augmentation framework can consistently improve performance over strong baselines, and the hierarchical generation component is the key for the improvement.
引用
收藏
页码:8974 / 8983
页数:10
相关论文
共 50 条
  • [1] Bridging Textual and Tabular Data for Cross-Domain Text-to-SQL Semantic Parsing
    Lin, Xi Victoria
    Socher, Richard
    Xiong, Caiming
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2020, 2020, : 4870 - 4888
  • [2] Semantic Decomposition of Question and SQL for Text-to-SQL Parsing
    Eyal, Ben
    Bachar, Amir
    Haroche, Ophir
    Mahabi, Moran
    Elhadad, Michael
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EMNLP 2023), 2023, : 13629 - 13645
  • [3] Selective Demonstrations for Cross-domain Text-to-SQL
    Chang, Shuaichen
    Fosler-Lussier, Eric
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EMNLP 2023), 2023, : 14174 - 14189
  • [4] A Review of Cross-Domain Text-to-SQL Models
    Gan, Yujian
    Purver, Matthew
    Woodward, John R.
    AACL-IJCNLP 2020: THE 1ST CONFERENCE OF THE ASIA-PACIFIC CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 10TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING: PROCEEDINGS OF THE STUDENT RESEARCH WORKSHOP, 2020, : 101 - 108
  • [5] PHOTON: A Robust Cross-Domain Text-to-SQL System
    Zeng, Jichuan
    Lin, Xi Victoria
    Xiong, Caiming
    Socher, Richard
    Lyu, Michael R.
    King, Irwin
    Hoi, Steven C. H.
    58TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2020): SYSTEM DEMONSTRATIONS, 2020, : 204 - 214
  • [6] Evaluating Cross-Domain Text-to-SQL Models and Benchmarks
    Pourreza, Mohammadreza
    Rafiei, Davood
    2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, EMNLP 2023, 2023, : 1601 - 1611
  • [7] Thai Question Text-To-SQL Parsing Using Transformer
    Tungruethaipak, Natthawat
    Prom-on, Santitham
    2024 21ST INTERNATIONAL JOINT CONFERENCE ON COMPUTER SCIENCE AND SOFTWARE ENGINEERING, JCSSE 2024, 2024, : 631 - 637
  • [8] Exploring Underexplored Limitations of Cross-Domain Text-to-SQL Generalization
    Gan, Yujian
    Chen, Xinyun
    Purver, Matthew
    2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), 2021, : 8926 - 8931
  • [9] Clause-Wise and Recursive Decoding for Complex and Cross-Domain Text-to-SQL Generation
    Lee, Dongjun
    2019 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING AND THE 9TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (EMNLP-IJCNLP 2019): PROCEEDINGS OF THE CONFERENCE, 2019, : 6045 - 6051
  • [10] Decoupling SQL query hardness parsing for text-to-SQL
    Yi, Jiawen
    Chen, Guo
    Zhou, Xiaojun
    Neurocomputing, 621