Selecting Documents Relevant for Chemistry as a Classification Problem

被引:0
|
作者
Zhu, Zhemin [1 ]
Akhondi, Saber A. [1 ]
Nandal, Umesh [1 ]
Doornenbal, Marius [1 ]
Gregory, Michelle [1 ]
机构
[1] Elsevier, Radarweg 29, NL-1043 NX Amsterdam, Netherlands
关键词
Natural language processing; Document classification; Machine learning; Cheminfomatics; INFORMATION;
D O I
10.1007/978-3-319-58694-6_31
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We present a first version of a system for selecting chemical publications for inclusion in a chemistry information database. This database, Reaxys (https://www.elsevier.com/solutions/reaxys), is a portal for the retrieval of structured chemistry information from published journals and patents. There are three challenges in this task: (i) Training and input data are highly imbalanced; (ii) High recall (>= 95%) is desired; and (iii) Data offered for selection is numerically massive but at the same time, incomplete. Our system successfully handles the imbalance with the undersampling technique and achieves relatively high recall using chemical named entities as features. Experiments on a real-world data set consisting of 15,822 documents show that the features of chemical named entities boost recall by 8% over the usual n-gram features being widely used in general document classification applications. For fostering research on this challenging topic, a part of the data set compiled in this paper can be requested.
引用
收藏
页码:198 / 201
页数:4
相关论文
共 50 条
  • [1] A Model for Selecting Relevant Topics in Documents Aimed at Compliance Processes
    da Silva Amaral, Joao Alberto
    de Lima Neto, Fernando Buarque
    2021 IEEE LATIN AMERICAN CONFERENCE ON COMPUTATIONAL INTELLIGENCE (LA-CCI), 2021,
  • [2] Identity Documents Classification as an Image Classification Problem
    Sicre, Ronan
    Awal, Ahmad Montaser
    Furon, Teddy
    IMAGE ANALYSIS AND PROCESSING (ICIAP 2017), PT II, 2017, 10485 : 602 - 613
  • [3] TAXONOMY FOR SELECTING RELEVANT PROBLEM ASSIGNMENTS FOR GROUP LEARNERS
    BOEKAERTS, M
    RESEARCH IN EDUCATION, 1979, (22): : 54 - 73
  • [4] Classification of documents as a theoretical problem in office work and in the archive
    Surovtseva, Nataliya G.
    HERALD OF AN ARCHIVIST, 2022, (03): : 756 - 771
  • [5] SELECTING RELEVANT ENVIRONMENTAL-RESEARCH PROJECTS ESPECIALLY THOSE RELATED TO CHEMISTRY
    MERIAN, E
    CHIMIA, 1977, 31 (04) : 144 - 145
  • [6] Selecting relevant electrode positions for classification tasks based on the electro-encephalogram
    T. Müller
    T. Ball
    R. Kristeva-Feige
    T. Mergner
    J. Timmer
    Medical and Biological Engineering and Computing, 2000, 38 : 62 - 67
  • [7] Selecting relevant electrode positions for classification tasks based on the electro-encephalogram
    Müller, T
    Ball, T
    Kristeva-Feige, R
    Mergner, T
    Timmer, J
    MEDICAL & BIOLOGICAL ENGINEERING & COMPUTING, 2000, 38 (01) : 62 - 67
  • [8] Selecting informative rules with parallel genetic algorithm in classification problem
    Sarkar, Bikash Kanti
    Sana, Shib Sankar
    Chaudhuri, Kripasindhu
    APPLIED MATHEMATICS AND COMPUTATION, 2011, 218 (07) : 3247 - 3264
  • [9] Detecting experimental techniques and selecting relevant documents for protein-protein interactions from biomedical literature
    Wang, Xinglong
    Rak, Rafal
    Restificar, Angelo
    Nobata, Chikashi
    Rupp, C. J.
    Batista-Navarro, Riza Theresa B.
    Nawaz, Raheel
    Ananiadou, Sophia
    BMC BIOINFORMATICS, 2011, 12
  • [10] A comparison of approaches for imbalanced classification problems in the context of retrieving relevant documents for an analysis
    Sandra Wankmüller
    Journal of Computational Social Science, 2023, 6 : 91 - 163