DBpedia Abstracts: A Large-Scale, Open, Multilingual NLP Training Corpus

被引:0
|
作者
Bruemmer, Martin [1 ]
Dojchinovski, Milan [1 ,2 ]
Hellmann, Sebastian [1 ]
机构
[1] Univ Leipzig, InfAI, AKSW, Leipzig, Germany
[2] Czech Tech Univ, FIT, Web Intelligence Res Grp, Prague, Czech Republic
基金
欧盟地平线“2020”; 欧盟第七框架计划;
关键词
training; dbpedia; corpus; named entity recognition; named entity linking; nlp;
D O I
暂无
中图分类号
H [语言、文字];
学科分类号
05 ;
摘要
The ever increasing importance of machine learning in Natural Language Processing is accompanied by an equally increasing need in large-scale training and evaluation corpora. Due to its size, its openness and relative quality, the Wikipedia has already been a source of such data, but on a limited scale. This paper introduces the DBpedia Abstract Corpus, a large-scale, open corpus of annotated Wikipedia texts in six languages, featuring over 11 million texts and over 97 million entity links. The properties of the Wikipedia texts are being described, as well as the corpus creation process, its format and interesting use-cases, like Named Entity Linking training and evaluation.
引用
收藏
页码:3339 / 3343
页数:5
相关论文
共 50 条
  • [21] Empowering OCL research: a large-scale corpus of open-source data from GitHub
    Josh G. M. Mengerink
    Jeroen Noten
    Alexander Serebrenik
    Empirical Software Engineering, 2019, 24 : 1574 - 1609
  • [22] LARGE-SCALE PERTURBATIONS IN THE OPEN UNIVERSE
    LYTH, DH
    WOSZCZYNA, A
    PHYSICAL REVIEW D, 1995, 52 (06) : 3338 - 3357
  • [23] FbMultiLingMisinfo: Challenging Large-Scale Multilingual Benchmark for Misinformation Detection
    Barnabo, Giorgio
    Siciliano, Federico
    Castillo, Carlos
    Leonardi, Stefano
    Nakov, Preslav
    Martino, Giovanni Da San
    Silvestri, Fabrizio
    2022 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2022,
  • [24] On the Multilingual Capabilities of Very Large-Scale English Language Models
    Armengol-Estape, Jordi
    de Gibert Bonet, Ona
    Melero, Maite
    LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 3056 - 3068
  • [25] SiDi KWS: A Large-Scale Multilingual Dataset for Keyword Spotting
    Meneses, Michel
    Holanda, Rafael
    Peres, Luis
    Rocha, Gabriela
    INTERSPEECH 2022, 2022, : 4616 - 4620
  • [26] Dialect-to-Standard Normalization: A Large-Scale Multilingual Evaluation
    Kuparinen, Olli
    Miletic, Aleksandra
    Scherrer, Yves
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EMNLP 2023), 2023, : 13814 - 13828
  • [27] THE DESIGN OF LARGE-SCALE TRAINING GAMES
    HARTLEY, DA
    RITCHIE, GN
    FITZSIMONS, EA
    SIMULATION & GAMING, 1981, 12 (02) : 141 - 152
  • [28] MINION: a Large-Scale and Diverse Dataset for Multilingual Event Detection
    Ben Veyseh, Amir Pouran
    Minh Van Nguyen
    Dernoncourt, Franck
    Thien Huu Nguyen
    NAACL 2022: THE 2022 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES, 2022, : 2286 - 2299
  • [29] KNOW: Developing large-scale multilingual technologies for language understanding
    Agirre, Eneko
    Castellon, Irene
    Padro, Lluis
    Climent, Salvador
    Rigau, German
    Alonso, Laura
    Cuadros, Montse
    Coll-Florit, Marta
    PROCESAMIENTO DEL LENGUAJE NATURAL, 2009, (43): : 377 - 378
  • [30] Evaluating large-scale training simulations
    Simpson, H
    Oser, RL
    MILITARY PSYCHOLOGY, 2003, 15 (01) : 25 - 40