PKU Paraphrase Bank: A Sentence-Level Paraphrase Corpus for Chinese

被引:3
|
作者
Zhang, Bowei [1 ,2 ,4 ]
Sun, Weiwei [1 ,2 ,3 ]
Wan, Xiaojun [1 ,2 ]
Guo, Zongming [1 ]
机构
[1] Peking Univ, Inst Comp Sci & Technol, Beijing, Peoples R China
[2] Peking Univ, MOE Key Lab Computat Linguist, Beijing, Peoples R China
[3] Peking Univ, Ctr Chinese Linguist, Beijing, Peoples R China
[4] Peking Univ, Ctr Data Sci, Beijing, Peoples R China
基金
中国国家自然科学基金; 国家重点研发计划;
关键词
Paraphrase; Paraphrase extraction; Sentence embedding; Sentence similarity;
D O I
10.1007/978-3-030-32233-5_63
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
One of the main challenges of conducting research on paraphrase is the lack of large-scale, high-quality corpus, which is particularly serious for non-English investigations. In this paper, we present a simple and effective unsupervised learning model that is able to automatically extract high-quality sentence-level paraphrases from multiple Chinese translations of the same source texts. By applying this new model, we obtain a large-scale paraphrase corpus, which contains 509,832 pairs of paraphrased sentences. The quality of this new corpus is manually examined. Our new model is language-independent, meaning that such paraphrase corpora for other languages can be built in the same way.
引用
收藏
页码:814 / 826
页数:13
相关论文
共 50 条
  • [21] A CORPUS SPANISH PARAPHRASE: METHODOLOGY, PROCESSING AND ANALYSIS
    Mota Montoya, Margarita A.
    Da Cunha, Iria
    Lopez-Escobedo, Fernanda
    RLA-REVISTA DE LINGUISTICA TEORICA Y APLICADA, 2016, 54 (02): : 85 - 112
  • [22] Research on the Construction and Application of Paraphrase Parallel Corpus
    Wang Y.
    Liu M.
    Zhang Y.
    Xu J.
    Chen Y.
    Beijing Daxue Xuebao (Ziran Kexue Ban)/Acta Scientiarum Naturalium Universitatis Pekinensis, 2021, 57 (01): : 68 - 74
  • [23] ParaPhraser: Russian Paraphrase Corpus and Shared Task
    Pivovarova, Lidia
    Pronoza, Ekaterina
    Yagunova, Elena
    Pronoza, Anton
    ARTIFICIAL INTELLIGENCE AND NATURAL LANGUAGE, 2018, 789 : 211 - 225
  • [24] SimPA: A Sentence-Level Simplification Corpus for the Public Administration Domain
    Scarton, Carolina
    Paetzold, Gustavo Henrique
    Specia, Lucia
    PROCEEDINGS OF THE ELEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2018), 2018, : 4333 - 4338
  • [25] Comparison of Sentence Similarity Measures for Russian Paraphrase Identification
    Pronoza, Ekaterina
    Yagunova, Elena
    2015 ARTIFICIAL INTELLIGENCE AND NATURAL LANGUAGE AND INFORMATION EXTRACTION, SOCIAL MEDIA AND WEB SEARCH FRUCT CONFERENCE (AINL-ISMW FRUCT), 2015, : 74 - 82
  • [26] Chinese Whispers: Cooperative Paraphrase Acquisition
    Negri, Matteo
    Mehdad, Yashar
    Marchetti, Alessandro
    Giampiccolo, Danilo
    Bentivogli, Luisa
    LREC 2012 - EIGHTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2012, : 2659 - 2665
  • [27] Automatically Ranked Russian Paraphrase Corpus for Text Generation
    Gudkov, Vadim
    Mitrofanova, Olga
    Filippskikh, Elizaveta
    NEURAL GENERATION AND TRANSLATION, 2020, : 54 - 59
  • [28] Neural Paraphrase Generation with Multi-domain Corpus
    Qiao, Lin
    Li, Yida
    Zhong, ChenLi
    ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING - ICANN 2021, PT I, 2021, 12891 : 54 - 66
  • [29] Corpus-Based Paraphrase Detection Experiments and Review
    Vrbanec, Tedo
    Mestrovic, Ana
    INFORMATION, 2020, 11 (05)
  • [30] Description of Turkish Paraphrase Corpus Structure and Generation Method
    Karaoglan, Bahar
    Kisla, Tarik
    Metin, Senem Kumova
    COMPUTATIONAL LINGUISTICS AND INTELLIGENT TEXT PROCESSING, (CICLING 2016), PT I, 2018, 9623 : 208 - 217