A Dataset for Multilingual Epidemiological Event Extraction

被引：0

作者：

Mutuvi, Stephen ^{[1
,2
]}

Doucet, Antoine ^{[1
]}

Lejeune, Gael ^{[3
]}

Odeo, Moses ^{[2
]}

机构：

[1] Univ La Rochelle, L3i Lab, La Rochelle, France

[2] Multimedia Univ Kenya, Nairobi, Kenya

[3] Sorbonne Univ Paris, Paris, France

来源：

PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020) | 2020年

基金：

欧盟地平线“2020”;

关键词：

Epidemiology; corpus creation; event extraction; classification; multilingual NLP;

D O I：

暂无

中图分类号：

TP39 [计算机的应用];

学科分类号：

081203 ; 0835 ;

摘要：

This paper proposes a corpus for the development and evaluation of tools and techniques for identifying emerging infectious disease threats in online news text. The corpus can not only be used for information extraction, but also for other natural language processing (NLP) tasks such as text classification. We make use of articles published on the Program for Monitoring Emerging Diseases (PROMED) platform, which provides current information about outbreaks of infectious diseases globally. Among the key pieces of information present in the articles is the uniform resource locator (URL) to the online news sources where the outbreaks were originally reported. We detail the procedure followed to build the dataset, which includes leveraging the source URLs to retrieve the news reports and subsequently pre-processing the retrieved documents. We also report on experimental results of event extraction on the dataset using the Data Analysis for Information Extraction in any Language( DANIEL) system. DANIEL is a multilingual news surveillance system that leverages unique attributes associated with news reporting to extract events: repetition and saliency. The system has wide geographical and language coverage, including low-resource languages. In addition, we compare different classification approaches in terms of their ability to differentiate between epidemic-related and unrelated news articles that constitute the corpus.

引用

页码：4139 / 4144

页数：6

共 50 条

[31] JukeBox: A Multilingual Singer Recognition Dataset
Chowdhury, Anurag
Cozzo, Austin
Ross, Arun
[J]. INTERSPEECH 2020, 2020, : 2267 - 2271
[32] Exploiting Multilingual Grammars and Machine Learning Techniques to Build an Event Extraction System for Portuguese
Zavarella, Vanni
Tanev, Hristo
Linge, Jens
Piskorski, Jakub
Atkinson, Martin
Steinberger, Ralf
[J]. COMPUTATIONAL PROCESSING OF THE PORTUGUESE LANGUAGE, PROCEEDINGS, 2010, 6001 : 21 - 24
[33] Introducing ICBe: an event extraction dataset from narratives about international crises
Douglass, Rex W.
Scherer, Thomas Leo
Gannon, J. Andres
Gartzke, Erik
Lindsay, Jon
Carcelli, Shannon
Wilkenfeld, Jonathan
Quinn, David M.
Aiken, Catherine
Navarro, Jose Miguel Cabezas
Lund, Neil
Murauskaite, Egle
Partridge, Diana
[J]. POLITICAL SCIENCE RESEARCH AND METHODS, 2024,
[34] Chinese Document-Level Emergency Event Extraction Dataset and Corresponding Methods
Chu, Kongbin
Yang, Wenzhong
Wei, Fuyuan
Shi, Jiangtao
[J]. APPLIED SCIENCES-BASEL, 2023, 13 (12):
[35] IREE: A Fine-Grained Dataset for Chinese Event Extraction in Investment Research
Ren, Junxiang
Wang, Sibo
Song, Ruilin
Wu, Yuejiao
Gao, Yizhou
An, Borong
Cheng, Zhen
Xu, Guoqiang
[J]. KNOWLEDGE GRAPH AND SEMANTIC COMPUTING: KNOWLEDGE GRAPH EMPOWERS THE DIGITAL ECONOMY, CCKS 2022, 2022, 1669 : 205 - 210
[36] A Multilingual Handwritten Character Dataset: T-H-E Dataset
Bartos, Gaye Ediboglu
Hoscan, Yasar
Kauer, Andras
Hajnal, Eva
[J]. ACTA POLYTECHNICA HUNGARICA, 2020, 17 (09) : 141 - 160
[37] An Annotated Multilingual Dataset to Study Modality in the Gospels
Bermudez-Sabel, Helena
Dell'Oro, Francesca
[J]. DIGITAL HUMANITIES QUARTERLY, 2024, 18 (01): : 1 - 16
[38] XCOPA: A Multilingual Dataset for Causal Commonsense Reasoning
Ponti, Edoardo M.
Glaves, Goran
Majewska, Olga
Liu, Qianchu
Vulic, Ivan
Korhonen, Anna
[J]. PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), 2020, : 2362 - 2376
[39] A Multilingual Evaluation Dataset for MonolingualWord Sense Alignment
Ahmadi, Sina
McCrae, John P.
Nimb, Sanni
Khan, Fahad
Monachini, Monica
Pedersen, Bolette S.
Declerck, Thierry
Wissik, Tanja
Bellandi, Andrea
Pisani, Irene
Troelsgard, Thomas
Olsen, Sussi
Krek, Simon
Lipp, Veronika
Varadi, Tamas
Simon, Laszlo
Gyorffy, Andras
Tiberius, Carole
Schoonheim, Tanneke
Ben Moshe, Yifat
Rudich, Maya
Abu Ahmad, Raya
Lonke, Dorielle
Kovalenko, Kira
Langemets, Margit
Kallas, Jelena
Dereza, Oksana
Fransen, Theodorus
Cillessen, David
Lindemann, David
Alonso, Mikel
Salgado, Ana
Sancho, Jose Luis
Urena-Ruiz, Rafael-J
Porta Zamorano, Jordi
Simov, Kiril
Osenova, Petya
Kancheva, Zara
Radev, Ivaylo
Stankovic, Ranka
Perdih, Andrej
Gabrovsek, Dejan
[J]. PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), 2020, : 3232 - 3242
[40] A new dataset for French and multilingual keyphrase generation
Piedboeuf, Frederic
Langlais, Philippe
[J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35, NEURIPS 2022, 2022,

← 1 2 3 4 5 →