A Dataset for Multilingual Epidemiological Event Extraction

被引:0
|
作者
Mutuvi, Stephen [1 ,2 ]
Doucet, Antoine [1 ]
Lejeune, Gael [3 ]
Odeo, Moses [2 ]
机构
[1] Univ La Rochelle, L3i Lab, La Rochelle, France
[2] Multimedia Univ Kenya, Nairobi, Kenya
[3] Sorbonne Univ Paris, Paris, France
基金
欧盟地平线“2020”;
关键词
Epidemiology; corpus creation; event extraction; classification; multilingual NLP;
D O I
暂无
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
This paper proposes a corpus for the development and evaluation of tools and techniques for identifying emerging infectious disease threats in online news text. The corpus can not only be used for information extraction, but also for other natural language processing (NLP) tasks such as text classification. We make use of articles published on the Program for Monitoring Emerging Diseases (PROMED) platform, which provides current information about outbreaks of infectious diseases globally. Among the key pieces of information present in the articles is the uniform resource locator (URL) to the online news sources where the outbreaks were originally reported. We detail the procedure followed to build the dataset, which includes leveraging the source URLs to retrieve the news reports and subsequently pre-processing the retrieved documents. We also report on experimental results of event extraction on the dataset using the Data Analysis for Information Extraction in any Language( DANIEL) system. DANIEL is a multilingual news surveillance system that leverages unique attributes associated with news reporting to extract events: repetition and saliency. The system has wide geographical and language coverage, including low-resource languages. In addition, we compare different classification approaches in terms of their ability to differentiate between epidemic-related and unrelated news articles that constitute the corpus.
引用
收藏
页码:4139 / 4144
页数:6
相关论文
共 50 条
  • [31] JukeBox: A Multilingual Singer Recognition Dataset
    Chowdhury, Anurag
    Cozzo, Austin
    Ross, Arun
    [J]. INTERSPEECH 2020, 2020, : 2267 - 2271
  • [32] Exploiting Multilingual Grammars and Machine Learning Techniques to Build an Event Extraction System for Portuguese
    Zavarella, Vanni
    Tanev, Hristo
    Linge, Jens
    Piskorski, Jakub
    Atkinson, Martin
    Steinberger, Ralf
    [J]. COMPUTATIONAL PROCESSING OF THE PORTUGUESE LANGUAGE, PROCEEDINGS, 2010, 6001 : 21 - 24
  • [33] Introducing ICBe: an event extraction dataset from narratives about international crises
    Douglass, Rex W.
    Scherer, Thomas Leo
    Gannon, J. Andres
    Gartzke, Erik
    Lindsay, Jon
    Carcelli, Shannon
    Wilkenfeld, Jonathan
    Quinn, David M.
    Aiken, Catherine
    Navarro, Jose Miguel Cabezas
    Lund, Neil
    Murauskaite, Egle
    Partridge, Diana
    [J]. POLITICAL SCIENCE RESEARCH AND METHODS, 2024,
  • [34] Chinese Document-Level Emergency Event Extraction Dataset and Corresponding Methods
    Chu, Kongbin
    Yang, Wenzhong
    Wei, Fuyuan
    Shi, Jiangtao
    [J]. APPLIED SCIENCES-BASEL, 2023, 13 (12):
  • [35] IREE: A Fine-Grained Dataset for Chinese Event Extraction in Investment Research
    Ren, Junxiang
    Wang, Sibo
    Song, Ruilin
    Wu, Yuejiao
    Gao, Yizhou
    An, Borong
    Cheng, Zhen
    Xu, Guoqiang
    [J]. KNOWLEDGE GRAPH AND SEMANTIC COMPUTING: KNOWLEDGE GRAPH EMPOWERS THE DIGITAL ECONOMY, CCKS 2022, 2022, 1669 : 205 - 210
  • [36] A Multilingual Handwritten Character Dataset: T-H-E Dataset
    Bartos, Gaye Ediboglu
    Hoscan, Yasar
    Kauer, Andras
    Hajnal, Eva
    [J]. ACTA POLYTECHNICA HUNGARICA, 2020, 17 (09) : 141 - 160
  • [37] An Annotated Multilingual Dataset to Study Modality in the Gospels
    Bermudez-Sabel, Helena
    Dell'Oro, Francesca
    [J]. DIGITAL HUMANITIES QUARTERLY, 2024, 18 (01): : 1 - 16
  • [38] XCOPA: A Multilingual Dataset for Causal Commonsense Reasoning
    Ponti, Edoardo M.
    Glaves, Goran
    Majewska, Olga
    Liu, Qianchu
    Vulic, Ivan
    Korhonen, Anna
    [J]. PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), 2020, : 2362 - 2376
  • [39] A Multilingual Evaluation Dataset for MonolingualWord Sense Alignment
    Ahmadi, Sina
    McCrae, John P.
    Nimb, Sanni
    Khan, Fahad
    Monachini, Monica
    Pedersen, Bolette S.
    Declerck, Thierry
    Wissik, Tanja
    Bellandi, Andrea
    Pisani, Irene
    Troelsgard, Thomas
    Olsen, Sussi
    Krek, Simon
    Lipp, Veronika
    Varadi, Tamas
    Simon, Laszlo
    Gyorffy, Andras
    Tiberius, Carole
    Schoonheim, Tanneke
    Ben Moshe, Yifat
    Rudich, Maya
    Abu Ahmad, Raya
    Lonke, Dorielle
    Kovalenko, Kira
    Langemets, Margit
    Kallas, Jelena
    Dereza, Oksana
    Fransen, Theodorus
    Cillessen, David
    Lindemann, David
    Alonso, Mikel
    Salgado, Ana
    Sancho, Jose Luis
    Urena-Ruiz, Rafael-J
    Porta Zamorano, Jordi
    Simov, Kiril
    Osenova, Petya
    Kancheva, Zara
    Radev, Ivaylo
    Stankovic, Ranka
    Perdih, Andrej
    Gabrovsek, Dejan
    [J]. PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), 2020, : 3232 - 3242
  • [40] A new dataset for French and multilingual keyphrase generation
    Piedboeuf, Frederic
    Langlais, Philippe
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35, NEURIPS 2022, 2022,