PKU Paraphrase Bank: A Sentence-Level Paraphrase Corpus for Chinese

被引:3
|
作者
Zhang, Bowei [1 ,2 ,4 ]
Sun, Weiwei [1 ,2 ,3 ]
Wan, Xiaojun [1 ,2 ]
Guo, Zongming [1 ]
机构
[1] Peking Univ, Inst Comp Sci & Technol, Beijing, Peoples R China
[2] Peking Univ, MOE Key Lab Computat Linguist, Beijing, Peoples R China
[3] Peking Univ, Ctr Chinese Linguist, Beijing, Peoples R China
[4] Peking Univ, Ctr Data Sci, Beijing, Peoples R China
基金
中国国家自然科学基金; 国家重点研发计划;
关键词
Paraphrase; Paraphrase extraction; Sentence embedding; Sentence similarity;
D O I
10.1007/978-3-030-32233-5_63
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
One of the main challenges of conducting research on paraphrase is the lack of large-scale, high-quality corpus, which is particularly serious for non-English investigations. In this paper, we present a simple and effective unsupervised learning model that is able to automatically extract high-quality sentence-level paraphrases from multiple Chinese translations of the same source texts. By applying this new model, we obtain a large-scale paraphrase corpus, which contains 509,832 pairs of paraphrased sentences. The quality of this new corpus is manually examined. Our new model is language-independent, meaning that such paraphrase corpora for other languages can be built in the same way.
引用
收藏
页码:814 / 826
页数:13
相关论文
共 50 条
  • [1] Monolingual Paraphrase Detection Corpus for Low Resource Pashto Language at Sentence Level
    Ali, Iqra
    Kamigaito, Hidetaka
    Watanabe, Taro
    2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation, LREC-COLING 2024 - Main Conference Proceedings, 2024, : 11574 - 11581
  • [2] Urdu Short Paraphrase Detection at Sentence Level
    Hafeez, Hamza
    Muneer, Iqra
    Sharjeel, Muhammad
    Ashraf, Muhammad Adnan
    Nawab, Rao Muhammad Adeel
    ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2023, 22 (04)
  • [3] Construction of a Russian Paraphrase Corpus: Unsupervised Paraphrase Extraction
    Pronoza, Ekaterina
    Yagunova, Elena
    Pronoza, Anton
    INFORMATION RETRIEVAL, (RUSSIR 2015), 2016, 573 : 146 - 157
  • [4] Turkish Paraphrase Corpus
    Demir, Seniz
    El-Kahlout, Ilknur Durgar
    Unal, Erdem
    Kaya, Hamza
    LREC 2012 - EIGHTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2012, : 4087 - 4091
  • [5] Sentence Level Paraphrase Recognition Based on Different Characteristics Combination
    Zhang, Maoyuan
    Zhang, Hong
    Wu, Deyu
    Pan, Xiaohang
    CHINESE COMPUTATIONAL LINGUISTICS AND NATURAL LANGUAGE PROCESSING BASED ON NATURALLY ANNOTATED BIG DATA, CCL 2014, 2014, 8801 : 279 - 289
  • [6] ETPC - A Paraphrase Identification Corpus Annotated with Extended Paraphrase Typology and Negation
    Kovatchev, Venelin
    Antonia Marti, M.
    Salamo, Maria
    PROCEEDINGS OF THE ELEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2018), 2018, : 1384 - 1392
  • [7] CUED REPRODUCTION AND PARAPHRASE OF A SIMPLE SENTENCE
    ITOH, Y
    KOYAZU, T
    JAPANESE JOURNAL OF PSYCHOLOGY, 1981, 52 (03): : 159 - 165
  • [8] Towards Document-Level Paraphrase Generation with Sentence Rewriting and Reordering
    Lin, Zhe
    Cai, Yitao
    Wan, Xiaojun
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2021, 2021, : 1033 - 1044
  • [9] MULTIDIMENTIONAL ANALYSIS OF A PARAPHRASE CORPUS WITH POUVOIR
    CARON, J
    LEGOFF, M
    CARONPARGUE, J
    LANGUE FRANCAISE, 1989, (84): : 117 - 128
  • [10] UPPC - Urdu Paraphrase Plagiarism Corpus
    Sharjeelt, Muhammad
    Rayson, Paul
    Nawab, Rao Muhammad Adeel
    LREC 2016 - TENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2016, : 1832 - 1836