DBpedia Abstracts: A Large-Scale, Open, Multilingual NLP Training Corpus

被引:0
|
作者
Bruemmer, Martin [1 ]
Dojchinovski, Milan [1 ,2 ]
Hellmann, Sebastian [1 ]
机构
[1] Univ Leipzig, InfAI, AKSW, Leipzig, Germany
[2] Czech Tech Univ, FIT, Web Intelligence Res Grp, Prague, Czech Republic
基金
欧盟地平线“2020”; 欧盟第七框架计划;
关键词
training; dbpedia; corpus; named entity recognition; named entity linking; nlp;
D O I
暂无
中图分类号
H [语言、文字];
学科分类号
05 ;
摘要
The ever increasing importance of machine learning in Natural Language Processing is accompanied by an equally increasing need in large-scale training and evaluation corpora. Due to its size, its openness and relative quality, the Wikipedia has already been a source of such data, but on a limited scale. This paper introduces the DBpedia Abstract Corpus, a large-scale, open corpus of annotated Wikipedia texts in six languages, featuring over 11 million texts and over 97 million entity links. The properties of the Wikipedia texts are being described, as well as the corpus creation process, its format and interesting use-cases, like Named Entity Linking training and evaluation.
引用
收藏
页码:3339 / 3343
页数:5
相关论文
共 50 条
  • [1] DBpedia - A large-scale, multilingual knowledge base extracted from Wikipedia
    Lehmann, Jens
    Isele, Robert
    Jakob, Max
    Jentzsch, Anja
    Kontokostas, Dimitris
    Mendes, Pablo N.
    Hellmann, Sebastian
    Morsey, Mohamed
    van Kleef, Patrick
    Auer, Soeren
    Bizer, Christian
    SEMANTIC WEB, 2015, 6 (02) : 167 - 195
  • [2] GitHub Typo Corpus: A Large-Scale Multilingual Dataset of Misspellings and Grammatical Errors
    Hagiwara, Masato
    Mita, Masato
    PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), 2020, : 6761 - 6768
  • [3] SpeechMatrix: A Large-Scale Mined Corpus of Multilingual Speech-to-Speech Translations
    Duquenne, Paul-Ambroise
    Gong, Hongyu
    Dong, Ning
    Du, Jingfei
    Lee, Ann
    Goswami, Vedanuj
    Wang, Changhan
    Pino, Juan
    Sagot, Benoit
    Schwenk, Holger
    PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2023): LONG PAPERS, VOL 1, 2023, : 16251 - 16269
  • [4] A Large-Scale Multilingual Disambiguation of Glosses
    Camacho-Collados, Jose
    Bovi, Claudio Delli
    Raganato, Alessandro
    Navigli, Roberto
    LREC 2016 - TENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2016, : 1701 - 1708
  • [5] MultiSubs: A Large-scale Multimodal and Multilingual Dataset
    Wang, Josiah
    Figueiredo, Josiel
    Specia, Lucia
    LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 6776 - 6785
  • [6] A Large-Scale Corpus for Conversation Disentanglement
    Kummerfeld, Jonathan K.
    Athreya, Vignesh
    Patel, Siva Sankalp
    Gouravajhala, Sai R.
    Gunasekara, Chulaka
    Polymenakos, Lazaros
    Peper, Joseph J.
    Ganhotra, Jatin
    Lasecki, Walter S.
    57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), 2019, : 3846 - 3856
  • [7] A Corpus for Large-Scale Phonetic Typology
    Salesky, Elizabeth
    Chodroff, Eleanor
    Pimentel, Tiago
    Wiesner, Matthew
    Cotterell, Ryan
    Black, Alan W.
    Eisner, Jason
    58TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2020), 2020, : 4526 - 4546
  • [8] Creating Large-Scale Multilingual Cognate Tables
    Wu, Winston
    Yarowsky, David
    PROCEEDINGS OF THE ELEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2018), 2018, : 3411 - 3418
  • [9] NLP and Large-Scale Information Retrieval on Mathematical Texts
    Dong, Yihe
    MATHEMATICAL SOFTWARE - ICMS 2018, 2018, 10931 : 156 - 164
  • [10] VoxPopuli: A Large-Scale Multilingual Speech Corpus for Representation Learning, Semi-Supervised Learning and Interpretation
    Wang, Changhan
    Riviere, Morgane
    Lee, Ann
    Wu, Anne
    Talnikar, Chaitanya
    Haziza, Daniel
    Williamson, Mary
    Pino, Juan
    Dupoux, Emmanuel
    59TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 11TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING, VOL 1 (ACL-IJCNLP 2021), 2021, : 993 - 1003