MassiveSumm: a very large-scale, very multilingual, newswire summarisation dataset

被引：0

作者：

Varab, Daniel ^{[1
]}

Schluter, Natalie ^{[1
]}

机构：

[1] IT Univ Copenhagen, Copenhagen, Denmark

来源：

2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021) | 2021年

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Current research in automatic summarisation is unapologetically anglo-centered-a persistent state-of-affairs, which also predates neural net approaches. High-quality automatic summarisation datasets are notoriously expensive to create, posing a challenge for any language. However, with digitalisation, archiving, and social media advertising of newswire articles, recent work has shown how, with careful methodology application, large-scale datasets can now be simply gathered instead of written. In this paper, we present a large-scale multilingual summarisation dataset containing articles in 92 languages, spread across 28.8 million articles, in more than 35 writing scripts. This is both the largest, most inclusive, existing automatic summarisation dataset, as well as one of the largest, most inclusive, ever published datasets for any NLP task. We present the first investigation on the efficacy of resource building from news platforms in the low-resource language setting. Finally, we provide some first insight on how low-resource language settings impact state-of-the-art automatic summarisation system performance.

引用

页码：10150 / 10161

页数：12

共 50 条

[1] DANEWSROOM: A Large-scale Danish Summarisation Dataset
Varab, Daniel
Schluter, Natalie
[J]. PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), 2020, : 6731 - 6739
[2] On the Multilingual Capabilities of Very Large-Scale English Language Models
Armengol-Estape, Jordi
de Gibert Bonet, Ona
Melero, Maite
[J]. LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 3056 - 3068
[3] HotelRec: a Novel Very Large-Scale Hotel Recommendation Dataset
Antognini, Diego
Faltings, Boi
[J]. PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), 2020, : 4917 - 4923
[4] MultiSubs: A Large-scale Multimodal and Multilingual Dataset
Wang, Josiah
Figueiredo, Josiel
Specia, Lucia
[J]. LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 6776 - 6785
[5] VERY LARGE-SCALE INTEGRATION TO BE VERY BIG AT IEDM
不详
[J]. ELECTRONIC DESIGN, 1978, 26 (22) : 23 - 24
[6] MLS: A Large-Scale Multilingual Dataset for Speech Research
Pratap, Vineel
Xu, Qiantong
Sriram, Anuroop
Synnaeve, Gabriel
Collobert, Ronan
[J]. INTERSPEECH 2020, 2020, : 2757 - 2761
[7] DESIGNING FOR VERY LARGE-SCALE COMPLEXITY
OWEN, K
[J]. INFORMATION AGE, 1983, 5 (03): : 163 - 166
[8] ARE SUPERCLUSTERS CORRELATED ON A VERY LARGE-SCALE
BAHCALL, NA
BURGETT, WS
[J]. ASTROPHYSICAL JOURNAL, 1986, 300 (02): : L35 - L39
[9] Very Large-Scale Integrated Processor
Takano, Shigeyuki
[J]. 2012 IEEE 26TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM WORKSHOPS & PHD FORUM (IPDPSW), 2012, : 821 - 828
[10] VERY LARGE-SCALE INTEGRATION 1983
不详
[J]. MICROPROCESSING AND MICROPROGRAMMING, 1984, 13 (02): : 121 - 130

← 1 2 3 4 5 →