DBpedia Abstracts: A Large-Scale, Open, Multilingual NLP Training Corpus

被引:0
|
作者
Bruemmer, Martin [1 ]
Dojchinovski, Milan [1 ,2 ]
Hellmann, Sebastian [1 ]
机构
[1] Univ Leipzig, InfAI, AKSW, Leipzig, Germany
[2] Czech Tech Univ, FIT, Web Intelligence Res Grp, Prague, Czech Republic
基金
欧盟地平线“2020”; 欧盟第七框架计划;
关键词
training; dbpedia; corpus; named entity recognition; named entity linking; nlp;
D O I
暂无
中图分类号
H [语言、文字];
学科分类号
05 ;
摘要
The ever increasing importance of machine learning in Natural Language Processing is accompanied by an equally increasing need in large-scale training and evaluation corpora. Due to its size, its openness and relative quality, the Wikipedia has already been a source of such data, but on a limited scale. This paper introduces the DBpedia Abstract Corpus, a large-scale, open corpus of annotated Wikipedia texts in six languages, featuring over 11 million texts and over 97 million entity links. The properties of the Wikipedia texts are being described, as well as the corpus creation process, its format and interesting use-cases, like Named Entity Linking training and evaluation.
引用
收藏
页码:3339 / 3343
页数:5
相关论文
共 50 条
  • [41] Quantitative Study of Preposition based on Large-scale Corpus
    Wang, Zhimin
    He, Wei
    Lacasella, Pierangelo
    2015 IEEE/WIC/ACM INTERNATIONAL CONFERENCE ON WEB INTELLIGENCE AND INTELLIGENT AGENT TECHNOLOGY (WI-IAT), VOL 3, 2015, : 177 - 180
  • [42] Itihasa: A large-scale corpus for Sanskrit to English translation
    Aralikatte, Rahul
    de Lhoneux, Miryam
    Kunchukuttan, Anoop
    Sogaard, Anders
    WAT 2021: THE 8TH WORKSHOP ON ASIAN TRANSLATION, 2021, : 191 - 197
  • [43] RadioTalk: a large-scale corpus of talk radio transcripts
    Beeferman, Doug
    Brannon, William
    Roy, Deb
    INTERSPEECH 2019, 2019, : 564 - 568
  • [44] Creating a Corpus of Multilingual Parent-Child Speech Remotely: Lessons Learned in a Large-Scale Onscreen Picturebook Sharing Task
    Woon, Fei Ting
    Yogarrajah, Eshwaaree C.
    Fong, Seraphina
    Salleh, Nur Sakinah Mohd
    Sundaray, Shamala
    Styles, Suzy J.
    FRONTIERS IN PSYCHOLOGY, 2021, 12
  • [45] MassiveSumm: a very large-scale, very multilingual, newswire summarisation dataset
    Varab, Daniel
    Schluter, Natalie
    2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), 2021, : 10150 - 10161
  • [46] A Large-Scale Multilingual Study of Visual Constraints on Linguistic Selection of Descriptions
    Berger, Uri
    Frermann, Lea
    Stanovsky, Gabriel
    Abend, Omri
    17TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EACL 2023, 2023, : 2285 - 2299
  • [47] Romanization-based Large-scale Adaptation of Multilingual Language Models
    Purkayastha, Sukannya
    Ruder, Sebastian
    Pfeiffer, Jonas
    Gurevych, Iryna
    Vulic, Ivan
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS - EMNLP 2023, 2023, : 7996 - 8005
  • [48] Open source tools for large-scale neuroscience
    Freeman, Jeremy
    CURRENT OPINION IN NEUROBIOLOGY, 2015, 32 : 156 - 163
  • [49] LARGE-SCALE MINING OF SMALL OPEN PITS
    EWANCHUK, HG
    CANADIAN MINING AND METALLURGICAL BULLETIN, 1968, 61 (671): : 285 - &
  • [50] Complura: Exploring and Leveraging a Large-scale Multilingual Visual Sentiment Ontology
    Liu, Hongyi
    Jou, Brendan
    Chen, Tao
    Topkara, Mercan
    Pappas, Nikolaos
    Redi, Miriam
    Chang, Shih-Fu
    ICMR'16: PROCEEDINGS OF THE 2016 ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, 2016, : 417 - 420