PKU Paraphrase Bank: A Sentence-Level Paraphrase Corpus for Chinese

被引:3
|
作者
Zhang, Bowei [1 ,2 ,4 ]
Sun, Weiwei [1 ,2 ,3 ]
Wan, Xiaojun [1 ,2 ]
Guo, Zongming [1 ]
机构
[1] Peking Univ, Inst Comp Sci & Technol, Beijing, Peoples R China
[2] Peking Univ, MOE Key Lab Computat Linguist, Beijing, Peoples R China
[3] Peking Univ, Ctr Chinese Linguist, Beijing, Peoples R China
[4] Peking Univ, Ctr Data Sci, Beijing, Peoples R China
基金
中国国家自然科学基金; 国家重点研发计划;
关键词
Paraphrase; Paraphrase extraction; Sentence embedding; Sentence similarity;
D O I
10.1007/978-3-030-32233-5_63
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
One of the main challenges of conducting research on paraphrase is the lack of large-scale, high-quality corpus, which is particularly serious for non-English investigations. In this paper, we present a simple and effective unsupervised learning model that is able to automatically extract high-quality sentence-level paraphrases from multiple Chinese translations of the same source texts. By applying this new model, we obtain a large-scale paraphrase corpus, which contains 509,832 pairs of paraphrased sentences. The quality of this new corpus is manually examined. Our new model is language-independent, meaning that such paraphrase corpora for other languages can be built in the same way.
引用
收藏
页码:814 / 826
页数:13
相关论文
共 50 条
  • [41] The Word Combinations Method for Phrasal Paraphrase Based on Bilingual Corpus
    He, Yanxiang
    Chen, Qiang
    Tian, Ye
    2011 INTERNATIONAL CONFERENCE ON FUZZY SYSTEMS AND NEURAL COMPUTING (FSNC 2011), VOL I, 2011, : 189 - 191
  • [42] Semantic Similarity Analysis for Corpus Development and Paraphrase Detection in Arabic
    Mahmoud, Adnen
    Zrigui, Mounir
    INTERNATIONAL ARAB JOURNAL OF INFORMATION TECHNOLOGY, 2021, 18 (01) : 1 - 7
  • [43] Attribute Value-Range Detection in Identification of Paraphrase Sentence Pairs
    Kumova, Senem
    Karaoglan, Bahar
    Kisla, Tarik
    2016 24TH SIGNAL PROCESSING AND COMMUNICATION APPLICATION CONFERENCE (SIU), 2016, : 1393 - 1396
  • [44] SIMILARITY MEASURES BASED ON SENTENCE SEMANTIC STRUCTURE FOR RECOGNIZING PARAPHRASE AND ENTAILMENT
    Liu, Xiao-Ying
    Ren, Chuan-Lun
    PROCEEDINGS OF 2013 INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND CYBERNETICS (ICMLC), VOLS 1-4, 2013, : 1601 - 1607
  • [45] Paraphrase Identification Between Two Sentence Using Support Vector Machine
    Saputro, Wahyu Faqih
    Djamal, Esmeralda C.
    Ilyas, Ridwan
    PROCEEDING OF 2019 INTERNATIONAL CONFERENCE ON ELECTRICAL ENGINEERING AND INFORMATICS (ICEEI), 2019, : 406 - 411
  • [46] Paraphrase thought: Sentence embedding module imitating human language recognition
    Jang, Myeongjun
    Kang, Pilsung
    INFORMATION SCIENCES, 2020, 541 : 123 - 135
  • [47] Sentence Paraphrase Graphs: Classification Based on Predictive Models or Annotators' Decisions?
    Pronoza, Ekaterina
    Yagunova, Elena
    Kochetkova, Nataliya
    ADVANCES IN COMPUTATIONAL INTELLIGENCE, MICAI 2016, PT I, 2017, 10061 : 41 - 52
  • [48] Sentence-Level Readability Assessment for L2 Chinese Learning
    Lu, Dawei
    Qiu, Xinying
    Cai, Yi
    CHINESE LEXICAL SEMANTICS (CLSW 2019), 2020, 11831 : 381 - 392
  • [49] Paraphrase and parallel treebank for the comparison of French and Chinese syntax
    Poiret, Rafael
    Mille, Simon
    Liu, Haitao
    LANGUAGES IN CONTRAST, 2021, 21 (02) : 298 - 322
  • [50] Chinese Sentence-level Event Factuality Identification with Recursive Neural Network
    Yi, Qingqing
    Qian, Zhong
    Li, Peifeng
    Zhu, Qiaoming
    2022 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2022,