Dataset construction method of cross-lingual summarization based on filtering and text augmentation

被引:0
|
作者
Pan, Hangyu [1 ]
Xi, Yaoyi [1 ]
Wang, Ling [1 ]
Nan, Yu [1 ]
Su, Zhizhong [1 ]
Cao, Rong [1 ]
机构
[1] State Key Laboratory of Mathematical Engineering and Advanced Computing, Zhengzhou, China
基金
中国国家社会科学基金;
关键词
Large dataset - Quality control - Semantics;
D O I
10.7717/PEERJ-CS.1299
中图分类号
学科分类号
摘要
Existing cross-lingual summarization (CLS) datasets consist of inconsistent sample quality and low scale. To address these problems, we propose a method that jointly supervises quality and scale to build CLS datasets. In terms of quality supervision, the method adopts a multi-strategy filtering algorithm to remove low-quality samples of monolingual summarization (MS) from the perspectives of character and semantics, thereby improving the quality of the MS dataset. In terms of scale supervision, the method adopts a text augmentation algorithm based on the pretrained model to increase the size of CLS datasets with quality assurance. This method was used to build an English-Chinese CLS dataset and evaluate it with a reasonable data quality evaluation framework. The evaluation results show that the dataset is of good quality and large size. These outcomes show that the proposed method may comprehensively improve quality and scale, thereby resulting in a high-quality and large-scale CLS dataset at a lower cost. © 2023 Pan et al.
引用
收藏
相关论文
共 50 条
  • [1] Dataset construction method of cross-lingual summarization based on filtering and text augmentation
    Pan, Hangyu
    Xi, Yaoyi
    Wang, Ling
    Nan, Yu
    Su, Zhizhong
    Cao, Rong
    [J]. PEERJ COMPUTER SCIENCE, 2023, 9
  • [2] CATAMARAN: A Cross-lingual Long Text Abstractive Summarization Dataset
    Chen, Zheng
    Lin, Hongyu
    [J]. LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 6932 - 6937
  • [3] Cross-lingual text filtering based on text concepts and kNN
    Li, SZ
    Su, WF
    Li, TQ
    Chen, HW
    [J]. PACLIC 17: Language, Information and Computation, Proceedings, 2003, : 166 - 173
  • [4] Cross-Lingual Speech-to-Text Summarization
    Pontes, Elvys Linhares
    Gonzalez-Gallardo, Carlos-Emiliano
    Torres-Moreno, Juan-Manuel
    Huet, Stephane
    [J]. MULTIMEDIA AND NETWORK INFORMATION SYSTEMS, 2019, 833 : 385 - 395
  • [5] A Cross-Lingual Summarization method based on cross-lingual Fact-relationship Graph Generation
    Zhang, Yongbing
    Gao, Shengxiang
    Huang, Yuxin
    Tan, Kaiwen
    Yu, Zhengtao
    [J]. PATTERN RECOGNITION, 2024, 146
  • [6] Cross-Lingual Korean Speech-to-Text Summarization
    Yoon, HyoJeon
    Dinh Tuyen Hoang
    Ngoc Thanh Nguyen
    Hwang, Dosam
    [J]. INTELLIGENT INFORMATION AND DATABASE SYSTEMS, ACIIDS 2019, PT I, 2019, 11431 : 198 - 206
  • [7] Cross-lingual Cross-temporal Summarization: Dataset, Models, Evaluation
    Zhang, Ran
    Ouni, Jihed
    Eger, Steffen
    [J]. COMPUTATIONAL LINGUISTICS, 2024, 50 (03) : 1001 - 1047
  • [8] WikiLingua: A New Benchmark Dataset for Cross-Lingual Abstractive Summarization
    Ladhak, Faisal
    Durmus, Esin
    Cardie, Claire
    McKeown, Kathleen
    [J]. FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2020, 2020, : 4034 - 4048
  • [9] Cross-lingual timeline summarization
    Cagliero, Luca
    La Quatra, Moreno
    Garza, Paolo
    Baralis, Elena
    [J]. 2021 IEEE FOURTH INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND KNOWLEDGE ENGINEERING (AIKE 2021), 2021, : 45 - 53
  • [10] A Survey on Cross-Lingual Summarization
    Wang, Jiaan
    Meng, Fandong
    Zheng, Duo
    Liang, Yunlong
    Li, Zhixu
    Qu, Jianfeng
    Zhou, Jie
    [J]. TRANSACTIONS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, 2022, 10 : 1304 - 1323