DBpedia Abstracts: A Large-Scale, Open, Multilingual NLP Training Corpus

被引：0

作者：

Bruemmer, Martin ^{[1
]}

Dojchinovski, Milan ^{[1
,2
]}

Hellmann, Sebastian ^{[1
]}

机构：

[1] Univ Leipzig, InfAI, AKSW, Leipzig, Germany

[2] Czech Tech Univ, FIT, Web Intelligence Res Grp, Prague, Czech Republic

来源：

LREC 2016 - TENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION | 2016年

基金：

欧盟地平线“2020”; 欧盟第七框架计划;

关键词：

training; dbpedia; corpus; named entity recognition; named entity linking; nlp;

D O I：

暂无

中图分类号：

H [语言、文字];

学科分类号：

05 ;

摘要：

The ever increasing importance of machine learning in Natural Language Processing is accompanied by an equally increasing need in large-scale training and evaluation corpora. Due to its size, its openness and relative quality, the Wikipedia has already been a source of such data, but on a limited scale. This paper introduces the DBpedia Abstract Corpus, a large-scale, open corpus of annotated Wikipedia texts in six languages, featuring over 11 million texts and over 97 million entity links. The properties of the Wikipedia texts are being described, as well as the corpus creation process, its format and interesting use-cases, like Named Entity Linking training and evaluation.

引用

页码：3339 / 3343

页数：5

共 50 条

[41] Quantitative Study of Preposition based on Large-scale Corpus
Wang, Zhimin
He, Wei
Lacasella, Pierangelo
2015 IEEE/WIC/ACM INTERNATIONAL CONFERENCE ON WEB INTELLIGENCE AND INTELLIGENT AGENT TECHNOLOGY (WI-IAT), VOL 3, 2015, : 177 - 180
[42] Itihasa: A large-scale corpus for Sanskrit to English translation
Aralikatte, Rahul
de Lhoneux, Miryam
Kunchukuttan, Anoop
Sogaard, Anders
WAT 2021: THE 8TH WORKSHOP ON ASIAN TRANSLATION, 2021, : 191 - 197
[43] RadioTalk: a large-scale corpus of talk radio transcripts
Beeferman, Doug
Brannon, William
Roy, Deb
INTERSPEECH 2019, 2019, : 564 - 568
[44] Creating a Corpus of Multilingual Parent-Child Speech Remotely: Lessons Learned in a Large-Scale Onscreen Picturebook Sharing Task
Woon, Fei Ting
Yogarrajah, Eshwaaree C.
Fong, Seraphina
Salleh, Nur Sakinah Mohd
Sundaray, Shamala
Styles, Suzy J.
FRONTIERS IN PSYCHOLOGY, 2021, 12
[45] MassiveSumm: a very large-scale, very multilingual, newswire summarisation dataset
Varab, Daniel
Schluter, Natalie
2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), 2021, : 10150 - 10161
[46] A Large-Scale Multilingual Study of Visual Constraints on Linguistic Selection of Descriptions
Berger, Uri
Frermann, Lea
Stanovsky, Gabriel
Abend, Omri
17TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EACL 2023, 2023, : 2285 - 2299
[47] Romanization-based Large-scale Adaptation of Multilingual Language Models
Purkayastha, Sukannya
Ruder, Sebastian
Pfeiffer, Jonas
Gurevych, Iryna
Vulic, Ivan
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS - EMNLP 2023, 2023, : 7996 - 8005
[48] Open source tools for large-scale neuroscience
Freeman, Jeremy
CURRENT OPINION IN NEUROBIOLOGY, 2015, 32 : 156 - 163
[49] LARGE-SCALE MINING OF SMALL OPEN PITS
EWANCHUK, HG
CANADIAN MINING AND METALLURGICAL BULLETIN, 1968, 61 (671): : 285 - &
[50] Complura: Exploring and Leveraging a Large-scale Multilingual Visual Sentiment Ontology
Liu, Hongyi
Jou, Brendan
Chen, Tao
Topkara, Mercan
Pappas, Nikolaos
Redi, Miriam
Chang, Shih-Fu
ICMR'16: PROCEEDINGS OF THE 2016 ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, 2016, : 417 - 420

← 1 2 3 4 5 →