MassiveSumm: a very large-scale, very multilingual, newswire summarisation dataset

被引:0
|
作者
Varab, Daniel [1 ]
Schluter, Natalie [1 ]
机构
[1] IT Univ Copenhagen, Copenhagen, Denmark
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Current research in automatic summarisation is unapologetically anglo-centered-a persistent state-of-affairs, which also predates neural net approaches. High-quality automatic summarisation datasets are notoriously expensive to create, posing a challenge for any language. However, with digitalisation, archiving, and social media advertising of newswire articles, recent work has shown how, with careful methodology application, large-scale datasets can now be simply gathered instead of written. In this paper, we present a large-scale multilingual summarisation dataset containing articles in 92 languages, spread across 28.8 million articles, in more than 35 writing scripts. This is both the largest, most inclusive, existing automatic summarisation dataset, as well as one of the largest, most inclusive, ever published datasets for any NLP task. We present the first investigation on the efficacy of resource building from news platforms in the low-resource language setting. Finally, we provide some first insight on how low-resource language settings impact state-of-the-art automatic summarisation system performance.
引用
收藏
页码:10150 / 10161
页数:12
相关论文
共 50 条
  • [41] A new framework for very large-scale urban modelling
    Batty, Michael
    Milton, Richard
    [J]. URBAN STUDIES, 2021, 58 (15) : 3071 - 3094
  • [42] Industrial symbiosis of very large-scale photovoltaic manufacturing
    Pearce, Joshua M.
    [J]. RENEWABLE ENERGY, 2008, 33 (05) : 1101 - 1108
  • [43] Very Sparse LSSVM Reductions for Large-Scale Data
    Mall, Raghvendra
    Suykens, Johan A. K.
    [J]. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2015, 26 (05) : 1086 - 1097
  • [44] ADVANCED DEVICE ISOLATION FOR VERY LARGE-SCALE INTEGRATION
    POGGE, HB
    [J]. ACS SYMPOSIUM SERIES, 1985, 290 : 241 - 275
  • [45] Hancock: A language for processing very large-scale data
    Bonachea, D
    Fisher, K
    Rogers, A
    Smith, F
    [J]. USENIX ASSOCIATION PROCEEDINGS OF THE 2ND CONFERENCE ON DOMAIN-SPECIFIC LANGUAGES (DSL'99), 1999, : 163 - 176
  • [46] ION-IMPLANTATION FOR VERY LARGE-SCALE INTEGRATION
    RYSSEL, H
    [J]. ADVANCES IN ELECTRONICS AND ELECTRON PHYSICS, 1982, 58 : 191 - 269
  • [47] Solving very large-scale structural optimization problems
    Huettner, F.
    Grosspietsch, M.
    [J]. AIAA JOURNAL, 2007, 45 (11) : 2729 - 2736
  • [48] Visualizing very large-scale vascular structures interactively
    Wischgoll, T
    Meyer, J
    Kaimovitz, B
    Lanir, Y
    Kassab, GS
    [J]. FASEB JOURNAL, 2005, 19 (04): : A235 - A235
  • [49] Ordered slicing of very large-scale overlay networks
    Jelasity, Mark
    Kermarrec, Anne-Marie
    [J]. SIXTH IEEE INTERNATIONAL CONFERENCE ON PEER-TO-PEER COMPUTING, PROCEEDINGS, 2006, : 117 - +
  • [50] VERY LARGE-SCALE INTEGRATED CMOS BUFFER DESIGN
    RAYAPATI, VN
    MAHAPATRA, S
    [J]. MICROELECTRONICS AND RELIABILITY, 1989, 29 (06): : 1021 - 1033