DANEWSROOM: A Large-scale Danish Summarisation Dataset

被引:0
|
作者
Varab, Daniel [1 ]
Schluter, Natalie [1 ]
机构
[1] IT Univ Copenhagen, Copenhagen, Denmark
关键词
automatic text summarisation; data collection; danish corpus;
D O I
暂无
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Dataset development for automatic summarisation systems is notoriously English-oriented. In this paper we present the first large-scale non-English language dataset specifically curated for automatic summarisation. The document-summary pairs are news articles and manually written summaries in the Danish language. There has previously been no work done to establish a Danish summarisation dataset, nor any published work on the automatic summarisation of Danish. We provide therefore the first automatic summarisation dataset for the Danish language (large-scale or otherwise). To support the comparison of future automatic summarisation systems for Danish, we include system performance on this dataset of strong well-established unsupervised baseline systems, together with an oracle extractive summariser, which is the first account of automatic summarisation system performance for Danish. Finally, we make all code for automatically acquiring the data freely available and make explicit how this technology can easily be adapted in order to acquire automatic summarisation datasets for further languages.
引用
收藏
页码:6731 / 6739
页数:9
相关论文
共 50 条
  • [1] MassiveSumm: a very large-scale, very multilingual, newswire summarisation dataset
    Varab, Daniel
    Schluter, Natalie
    [J]. 2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), 2021, : 10150 - 10161
  • [2] DMDD: A Large-Scale Dataset for Dataset Mentions Detection
    Pan, Huitong
    Zhang, Qi
    Dragut, Eduard
    Caragea, Cornelia
    Latecki, Longin Jan
    [J]. TRANSACTIONS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, 2023, 11 : 1132 - 1146
  • [3] Large-scale RDF Dataset Slicing
    Marx, Edgard
    Shekarpour, Saeedeh
    Auer, Soeren
    Ngomo, Axel-Cyrille Ngonga
    [J]. 2013 IEEE SEVENTH INTERNATIONAL CONFERENCE ON SEMANTIC COMPUTING (ICSC 2013), 2013, : 228 - 235
  • [4] Euler Clustering on Large-scale Dataset
    Wu, Jian-Sheng
    Zheng, Wei-Shi
    Lai, Jian-Huang
    Suen, Ching Y.
    [J]. IEEE TRANSACTIONS ON BIG DATA, 2018, 4 (04) : 502 - 515
  • [5] The Jester Dataset: A Large-Scale Video Dataset of Human Gestures
    Materzynska, Joanna
    Berger, Guillaume
    Bax, Ingo
    Memisevic, Roland
    [J]. 2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW), 2019, : 2874 - 2882
  • [6] MIND: A Large-scale Dataset for News Recommendation
    Wu, Fangzhao
    Qiao, Ying
    Chen, Jiun-Hung
    Wu, Chuhan
    Qi, Tao
    Lian, Jianxun
    Liu, Danyang
    Xie, Xing
    Gao, Jianfeng
    Wu, Winnie
    Zhou, Ming
    [J]. 58TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2020), 2020, : 3597 - 3606
  • [7] Pchatbot: A Large-Scale Dataset for Personalized Chatbot
    Qian, Hongjin
    Li, Xiaohe
    Zhong, Hanxun
    Guo, Yu
    Ma, Yueyuan
    Zhu, Yutao
    Liu, Zhanliang
    Dou, Zhicheng
    Wen, Ji-Rong
    [J]. SIGIR '21 - PROCEEDINGS OF THE 44TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 2021, : 2470 - 2477
  • [8] PatchDB: A Large-Scale Security Patch Dataset
    Wang, Xinda
    Wang, Shu
    Feng, Pengbin
    Sun, Kun
    Jajodia, Sushil
    [J]. 51ST ANNUAL IEEE/IFIP INTERNATIONAL CONFERENCE ON DEPENDABLE SYSTEMS AND NETWORKS (DSN 2021), 2021, : 149 - 160
  • [9] openDD: A Large-Scale Roundabout Drone Dataset
    Breuer, Antonia
    Termoehlen, Jan-Aike
    Homoceanu, Silviu
    Fingscheidt, Tim
    [J]. 2020 IEEE 23RD INTERNATIONAL CONFERENCE ON INTELLIGENT TRANSPORTATION SYSTEMS (ITSC), 2020,
  • [10] Large-Scale Analysis of the Docker Hub Dataset
    Zhao, Nannan
    Tarasov, Vasily
    Albahar, Hadeel
    Anwar, Ali
    Rupprecht, Lukas
    Skourtis, Dimitrios
    Warke, Amit S.
    Mohamed, Mohamed
    Butt, Ali R.
    [J]. 2019 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING (CLUSTER), 2019, : 215 - 224