Automatic Metadata Information Extraction from Scientific Literature using Deep Neural Networks

被引:0
|
作者
Yang, Huichen [1 ]
Hsu, William [1 ]
机构
[1] Kansas State Univ, Manhattan, KS 66506 USA
关键词
Metadata Information Extraction; Document Layout Detection; Text Recognition; Transfer Learning; Deep Learning;
D O I
10.1117/12.2623554
中图分类号
O43 [光学];
学科分类号
070207 ; 0803 ;
摘要
We present a novel computer vision-based deep learning approach for metadata extraction as both a central component of and an ancillary aid to structured information extraction from scientific literature which has various formats. The number of scientific publications is growing rapidly, but existing methods cannot combine the techniques of layout extraction and text recognition efficiently because of the various formats used by scientific literature publishers. In this paper, we introduce an end-to-end trainable neural network for segmenting and labeling the main regions of scientific documents, while simultaneously recognizing text from the detected regions. The proposed framework combines object detection techniques based on Recurrent Convolutional Neural Network (RCNN) for scientific document layout detection with Convolutional Recurrent Neural Network (CRNN) for text recognition. We also contribute a novel data set of main region annotations for scientific literature metadata information extraction to complement the limited availability of high-quality data set. The final outputs of the network are the text content (payload) and the corresponding labels of the major regions. Our results show that our model outperforms state-of-the-field baselines.
引用
收藏
页数:8
相关论文
共 50 条
  • [1] CERMINE - automatic extraction of metadata and references from scientific literature
    Tkaczyk, Dominika
    Szostek, Pawel
    Dendek, Piotr Jan
    Fedoryszak, Mateusz
    Bolikowski, Lukasz
    [J]. 2014 11TH IAPR INTERNATIONAL WORKSHOP ON DOCUMENT ANALYSIS SYSTEMS (DAS 2014), 2014, : 217 - 221
  • [2] CERMINE: automatic extraction of structured metadata from scientific literature
    Dominika Tkaczyk
    Paweł Szostek
    Mateusz Fedoryszak
    Piotr Jan Dendek
    Łukasz Bolikowski
    [J]. International Journal on Document Analysis and Recognition (IJDAR), 2015, 18 : 317 - 335
  • [3] CERMINE: automatic extraction of structured metadata from scientific literature
    Tkaczyk, Dominika
    Szostek, Pawel
    Fedoryszak, Mateusz
    Dendek, Piotr Jan
    Bolikowski, Lukasz
    [J]. INTERNATIONAL JOURNAL ON DOCUMENT ANALYSIS AND RECOGNITION, 2015, 18 (04) : 317 - 335
  • [4] Automatic Document Metadata Extraction Based on Deep Networks
    Liu, Runtao
    Gao, Liangcai
    An, Dong
    Jiang, Zhuoren
    Tang, Zhi
    [J]. NATURAL LANGUAGE PROCESSING AND CHINESE COMPUTING, NLPCC 2017, 2018, 10619 : 305 - 317
  • [6] Automatic Subject Classification of Scientific Literature Using Citation Metadata
    Mahdi, Abdulhussain E.
    Joorabchi, Arash
    [J]. DIGITAL ENTERPRISE AND INFORMATION SYSTEMS, 2011, 194 : 545 - 559
  • [7] Automatic extraction of metadata from scientific publications for CRIS systems
    Kovacevic, Aleksandar
    Ivanovic, Dragan
    Milosavljevic, Branko
    Konjovic, Zora
    Surla, Dusan
    [J]. PROGRAM-ELECTRONIC LIBRARY AND INFORMATION SYSTEMS, 2011, 45 (04) : 376 - 396
  • [8] Automatic Extraction of Causal Relations from Text using Linguistically Informed Deep Neural Networks
    Dasgupta, Tirthankar
    Saha, Rupsa
    Dey, Lipika
    Naskar, Abir
    [J]. 19TH ANNUAL MEETING OF THE SPECIAL INTEREST GROUP ON DISCOURSE AND DIALOGUE (SIGDIAL 2018), 2018, : 306 - 316
  • [9] Deep neural networks for Arabic information extraction
    Saadi, Abdelhalim
    Belhadef, Hacene
    [J]. SMART AND SUSTAINABLE BUILT ENVIRONMENT, 2020, 9 (04) : 467 - 482
  • [10] Scientific Literature Metadata Extraction Based on HMM
    Cui, Binge
    [J]. COOPERATIVE DESIGN, VISUALIZATION, AND ENGINEERING, PROCEEDINGS, 2009, 5738 : 64 - 68