Selecting Documents Relevant for Chemistry as a Classification Problem

被引:0
|
作者
Zhu, Zhemin [1 ]
Akhondi, Saber A. [1 ]
Nandal, Umesh [1 ]
Doornenbal, Marius [1 ]
Gregory, Michelle [1 ]
机构
[1] Elsevier, Radarweg 29, NL-1043 NX Amsterdam, Netherlands
关键词
Natural language processing; Document classification; Machine learning; Cheminfomatics; INFORMATION;
D O I
10.1007/978-3-319-58694-6_31
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We present a first version of a system for selecting chemical publications for inclusion in a chemistry information database. This database, Reaxys (https://www.elsevier.com/solutions/reaxys), is a portal for the retrieval of structured chemistry information from published journals and patents. There are three challenges in this task: (i) Training and input data are highly imbalanced; (ii) High recall (>= 95%) is desired; and (iii) Data offered for selection is numerically massive but at the same time, incomplete. Our system successfully handles the imbalance with the undersampling technique and achieves relatively high recall using chemical named entities as features. Experiments on a real-world data set consisting of 15,822 documents show that the features of chemical named entities boost recall by 8% over the usual n-gram features being widely used in general document classification applications. For fostering research on this challenging topic, a part of the data set compiled in this paper can be requested.
引用
收藏
页码:198 / 201
页数:4
相关论文
共 50 条
  • [31] Classification of text documents
    Li, YH
    Jain, AK
    FOURTEENTH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION, VOLS 1 AND 2, 1998, : 1295 - 1297
  • [32] Classification of text documents
    Li, YH
    Jain, AK
    COMPUTER JOURNAL, 1998, 41 (08): : 537 - 546
  • [33] Classification of XML documents
    Bouchachia, Abdelhamid
    Hassler, Marcus
    2007 IEEE SYMPOSIUM ON COMPUTATIONAL INTELLIGENCE AND DATA MINING, VOLS 1 AND 2, 2007, : 390 - 396
  • [34] Security classification for documents
    Eloff, JHP
    Holbein, R
    Teufel, S
    COMPUTERS & SECURITY, 1996, 15 (01) : 55 - 71
  • [35] Classification of documents by content
    Jaillet, S
    Teisseire, M
    Chauche, J
    Prince, V
    SECOND IEEE INTERNATIONAL CONFERENCE ON COGNITIVE INFORMATICS, PROCEEDINGS, 2003, : 214 - 222
  • [36] Classification: An Enquiry in Documents
    Engberg, Juliana
    MEANJIN, 2011, 70 (04): : 76 - 82
  • [37] Selecting target concept in one-class classification for handling class imbalance problem
    Perez-Sanchez, Beatriz
    Fontenla-Romero, Oscar
    Sanchez-Marono, Noelia
    2015 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2015,
  • [38] Automatic documents classification
    Mohamed, Hoda K.
    2007 INTERNATIONAL CONFERENCE ON COMPUTER ENGINEERING & SYSTEMS: ICCES '07, 2007, : 33 - 37
  • [39] Security classification for documents
    Eloff, J.H.P.
    Holbein, R.
    Teufel, S.
    Computers and Security, 15 (01): : 55 - 71
  • [40] CLASSIFICATION FOR GOVERNMENT DOCUMENTS
    Keller, Lena
    LAW LIBRARY JOURNAL, 1941, 34 (05): : 241 - 263