DIR: A Large-Scale Dialogue Rewrite Dataset for Cross-Domain Conversational Text-to-SQL

被引:2
|
作者
Li, Jieyu [1 ]
Chen, Zhi [1 ]
Chen, Lu [1 ]
Zhu, Zichen [1 ]
Li, Hanqi [1 ]
Cao, Ruisheng [1 ]
Yu, Kai [1 ]
机构
[1] Shanghai Jiao Tong Univ, AI Inst, Dept Comp Sci & Engn, X LANCE Lab,MoE Key Lab Artificial Intelligence, Shanghai 200240, Peoples R China
来源
APPLIED SCIENCES-BASEL | 2023年 / 13卷 / 04期
关键词
dialogue rewrite; conversational text-to-SQL; two-stage framework;
D O I
10.3390/app13042262
中图分类号
O6 [化学];
学科分类号
0703 ;
摘要
Semantic co-reference and ellipsis always lead to information deficiency when parsing natural language utterances with SQL in a multi-turn dialogue (i.e., conversational text-to-SQL task). The methodology of dividing a dialogue understanding task into dialogue utterance rewriting and language understanding is feasible to tackle this problem. To this end, we present a two-stage framework to complete conversational text-to-SQL tasks. To construct an efficient rewriting model in the first stage, we provide a large-scale dialogue rewrite dataset (DIR), which is extended from two cross-domain conversational text-to-SQL datasets, SParC and CoSQL. The dataset contains 5908 dialogues involving 160 domains. Therefore, it not only focuses on conversational text-to-SQL tasks, but is also a valuable corpus for dialogue rewrite study. In experiments, we validate the efficiency of our annotations with a popular text-to-SQL parser, RAT-SQL. The experiment results illustrate 11.81 and 27.17 QEM accuracy improvement on SParC and CoSQL, respectively, when we eliminate the semantic incomplete representations problem by directly parsing the golden rewrite utterances. The experiment results of evaluating the performance of the two-stage frameworks using different rewrite models show that the efficiency of rewrite models is important and still needs improvement. Additionally, as a new benchmark of the dialogue rewrite task, we also report the performance results of different baselines for related studies. Our dataset will be publicly available once this paper is accepted.
引用
收藏
页数:13
相关论文
共 50 条
  • [1] Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task
    Yu, Tao
    Zhang, Rui
    Yang, Kai
    Yasunaga, Michihiro
    Wang, Dongxu
    Li, Zifan
    Ma, James
    Li, Irene
    Yao, Qingning
    Roman, Shanelle
    Zhang, Zilin
    Radev, Dragomir R.
    2018 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2018), 2018, : 3911 - 3921
  • [2] DuSQL: A Large-Scale and Pragmatic Chinese Text-to-SQL Dataset
    Wang, Lijie
    Zhang, Ao
    Wu, Kun
    Sun, Ke
    Li, Zhenghua
    Wu, Hua
    Zhang, Min
    Wang, Haifeng
    PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), 2020, : 6923 - 6935
  • [3] Selective Demonstrations for Cross-domain Text-to-SQL
    Chang, Shuaichen
    Fosler-Lussier, Eric
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EMNLP 2023), 2023, : 14174 - 14189
  • [4] A Review of Cross-Domain Text-to-SQL Models
    Gan, Yujian
    Purver, Matthew
    Woodward, John R.
    AACL-IJCNLP 2020: THE 1ST CONFERENCE OF THE ASIA-PACIFIC CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 10TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING: PROCEEDINGS OF THE STUDENT RESEARCH WORKSHOP, 2020, : 101 - 108
  • [5] CSS: A Large-scale Cross-schema Chinese Text-to-SQL Medical Dataset
    Zhang, Hanchong
    Li, Jieyu
    Chen, Lu
    Cao, Ruisheng
    Zhang, Yunyan
    Huang, Yu
    Zheng, Yefeng
    Yu, Kai
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, 2023, : 6970 - 6983
  • [6] Evaluating Cross-Domain Text-to-SQL Models and Benchmarks
    Pourreza, Mohammadreza
    Rafiei, Davood
    2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, EMNLP 2023, 2023, : 1601 - 1611
  • [7] PHOTON: A Robust Cross-Domain Text-to-SQL System
    Zeng, Jichuan
    Lin, Xi Victoria
    Xiong, Caiming
    Socher, Richard
    Lyu, Michael R.
    King, Irwin
    Hoi, Steven C. H.
    58TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2020): SYSTEM DEMONSTRATIONS, 2020, : 204 - 214
  • [8] CoSQL: A Conversational Text-to-SQL Challenge Towards Cross-Domain Natural Language Interfaces to Databases
    Yu, Tao
    Zhang, Rui
    Er, He Yang
    Li, Suyi
    Xue, Eric
    Pang, Bo
    Lin, Xi Victoria
    Tan, Yi Chern
    Shi, Tianze
    Li, Zihan
    Jiang, Youxuan
    Yasunaga, Michihiro
    Shim, Sungrok
    Chen, Tao
    Fabbri, Alexander
    Li, Zifan
    Chen, Luyao
    Zhang, Yuwen
    Dixit, Shreya
    Zhang, Vincent
    Xiong, Caiming
    Socher, Richard
    Lasecki, Walter S.
    Radev, Dragomir
    2019 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING AND THE 9TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (EMNLP-IJCNLP 2019): PROCEEDINGS OF THE CONFERENCE, 2019, : 1962 - 1979
  • [9] Exploring Underexplored Limitations of Cross-Domain Text-to-SQL Generalization
    Gan, Yujian
    Chen, Xinyun
    Purver, Matthew
    2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), 2021, : 8926 - 8931
  • [10] CrossWOZ: A Large-Scale Chinese Cross-Domain Task-Oriented Dialogue Dataset
    Zhu, Qi
    Huang, Kaili
    Zhang, Zheng
    Zhu, Xiaoyan
    Huang, Minlie
    TRANSACTIONS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, 2020, 8 (08) : 281 - 295