MassiveSumm: a very large-scale, very multilingual, newswire summarisation dataset

被引:0
|
作者
Varab, Daniel [1 ]
Schluter, Natalie [1 ]
机构
[1] IT Univ Copenhagen, Copenhagen, Denmark
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Current research in automatic summarisation is unapologetically anglo-centered-a persistent state-of-affairs, which also predates neural net approaches. High-quality automatic summarisation datasets are notoriously expensive to create, posing a challenge for any language. However, with digitalisation, archiving, and social media advertising of newswire articles, recent work has shown how, with careful methodology application, large-scale datasets can now be simply gathered instead of written. In this paper, we present a large-scale multilingual summarisation dataset containing articles in 92 languages, spread across 28.8 million articles, in more than 35 writing scripts. This is both the largest, most inclusive, existing automatic summarisation dataset, as well as one of the largest, most inclusive, ever published datasets for any NLP task. We present the first investigation on the efficacy of resource building from news platforms in the low-resource language setting. Finally, we provide some first insight on how low-resource language settings impact state-of-the-art automatic summarisation system performance.
引用
收藏
页码:10150 / 10161
页数:12
相关论文
共 50 条
  • [1] DANEWSROOM: A Large-scale Danish Summarisation Dataset
    Varab, Daniel
    Schluter, Natalie
    [J]. PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), 2020, : 6731 - 6739
  • [2] On the Multilingual Capabilities of Very Large-Scale English Language Models
    Armengol-Estape, Jordi
    de Gibert Bonet, Ona
    Melero, Maite
    [J]. LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 3056 - 3068
  • [3] HotelRec: a Novel Very Large-Scale Hotel Recommendation Dataset
    Antognini, Diego
    Faltings, Boi
    [J]. PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), 2020, : 4917 - 4923
  • [4] MultiSubs: A Large-scale Multimodal and Multilingual Dataset
    Wang, Josiah
    Figueiredo, Josiel
    Specia, Lucia
    [J]. LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 6776 - 6785
  • [5] VERY LARGE-SCALE INTEGRATION TO BE VERY BIG AT IEDM
    不详
    [J]. ELECTRONIC DESIGN, 1978, 26 (22) : 23 - 24
  • [6] MLS: A Large-Scale Multilingual Dataset for Speech Research
    Pratap, Vineel
    Xu, Qiantong
    Sriram, Anuroop
    Synnaeve, Gabriel
    Collobert, Ronan
    [J]. INTERSPEECH 2020, 2020, : 2757 - 2761
  • [7] DESIGNING FOR VERY LARGE-SCALE COMPLEXITY
    OWEN, K
    [J]. INFORMATION AGE, 1983, 5 (03): : 163 - 166
  • [8] ARE SUPERCLUSTERS CORRELATED ON A VERY LARGE-SCALE
    BAHCALL, NA
    BURGETT, WS
    [J]. ASTROPHYSICAL JOURNAL, 1986, 300 (02): : L35 - L39
  • [9] Very Large-Scale Integrated Processor
    Takano, Shigeyuki
    [J]. 2012 IEEE 26TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM WORKSHOPS & PHD FORUM (IPDPSW), 2012, : 821 - 828
  • [10] Modelling on the very large-scale connectome
    Odor, Geza
    Gastner, Michael T.
    Kelling, Jeffrey
    Deco, Gustavo
    [J]. JOURNAL OF PHYSICS-COMPLEXITY, 2021, 2 (04):