DBpedia Abstracts: A Large-Scale, Open, Multilingual NLP Training Corpus

被引:0
|
作者
Bruemmer, Martin [1 ]
Dojchinovski, Milan [1 ,2 ]
Hellmann, Sebastian [1 ]
机构
[1] Univ Leipzig, InfAI, AKSW, Leipzig, Germany
[2] Czech Tech Univ, FIT, Web Intelligence Res Grp, Prague, Czech Republic
基金
欧盟地平线“2020”; 欧盟第七框架计划;
关键词
training; dbpedia; corpus; named entity recognition; named entity linking; nlp;
D O I
暂无
中图分类号
H [语言、文字];
学科分类号
05 ;
摘要
The ever increasing importance of machine learning in Natural Language Processing is accompanied by an equally increasing need in large-scale training and evaluation corpora. Due to its size, its openness and relative quality, the Wikipedia has already been a source of such data, but on a limited scale. This paper introduces the DBpedia Abstract Corpus, a large-scale, open corpus of annotated Wikipedia texts in six languages, featuring over 11 million texts and over 97 million entity links. The properties of the Wikipedia texts are being described, as well as the corpus creation process, its format and interesting use-cases, like Named Entity Linking training and evaluation.
引用
收藏
页码:3339 / 3343
页数:5
相关论文
共 50 条
  • [31] NLP-Fast: A Fast, Scalable, and Flexible System to Accelerate Large-Scale Heterogeneous NLP Models
    Kim, Joonsung
    Hur, Suyeon
    Lee, Eunbok
    Lee, Seungho
    Kim, Jangwoo
    30TH INTERNATIONAL CONFERENCE ON PARALLEL ARCHITECTURES AND COMPILATION TECHNIQUES (PACT 2021), 2021, : 75 - 89
  • [32] Build a large-scale syntactically annotated Chinese corpus
    Qiang, Z
    TEXT, SPEECH AND DIALOGUE, PROCEEDINGS, 2003, 2807 : 106 - 113
  • [33] Mining Preconditions of APIs in Large-Scale Code Corpus
    Hoan Anh Nguyen
    Dyer, Robert
    Nguyen, Tien N.
    Rajan, Hridesh
    22ND ACM SIGSOFT INTERNATIONAL SYMPOSIUM ON THE FOUNDATIONS OF SOFTWARE ENGINEERING (FSE 2014), 2014, : 166 - 177
  • [34] A large-scale corpus system for identifying thesaural relations
    Collier, A
    Pacey, M
    CORPUS-BASED STUDIES IN ENGLISH, 1997, (20): : 87 - 100
  • [35] Development of a Large-Scale Mandarin Radio Speech Corpus
    Chang, Yung-hsiang Shawn
    Liao, Yuan-fu
    Wang, Sheng-ming
    Wang, Jenq-haur
    Wang, Sing-yue
    Chen, Jhih-wei
    Chen, You-dian
    2017 IEEE INTERNATIONAL CONFERENCE ON CONSUMER ELECTRONICS - TAIWAN (ICCE-TW), 2017,
  • [36] Captioning Videos Using Large-Scale Image Corpus
    Xiao-Yu Du
    Yang Yang
    Liu Yang
    Fu-Min Shen
    Zhi-Guang Qin
    Jin-Hui Tang
    Journal of Computer Science and Technology, 2017, 32 : 480 - 493
  • [37] Captioning Videos Using Large-Scale Image Corpus
    Du, Xiao-Yu
    Yang, Yang
    Yang, Liu
    Shen, Fu-Min
    Qin, Zhi-Guang
    Tang, Jin-Hui
    JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY, 2017, 32 (03) : 480 - 493
  • [38] New word detection based on large-scale corpus
    Digital Technology Laboratory, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100080, China
    不详
    不详
    Jisuanji Yanjiu yu Fazhan, 2006, 5 (927-932):
  • [39] LANS: Large-scale Arabic News Summarization Corpus
    Alhamadani, Abdulaziz
    Zhang, Xuchao
    He, Jianfeng
    Khatri, Aadyant
    Lu, Chang-Tien
    ArabicNLP 2023 - 1st Arabic Natural Language Processing Conference, Proceedings, 2023, : 89 - 100
  • [40] Problems on large-scale speech corpus and the applications in TTS
    Zhang S.
    Liu L.
    Diao L.-H.
    Jisuanji Xuebao/Chinese Journal of Computers, 2010, 33 (04): : 687 - 696