Extracting Figures and Captions from Scientific Publications

被引:5
|
作者
Li, Pengyuan [1 ]
Jiang, Xiangying [1 ]
Shatkay, Hagit [1 ]
机构
[1] Univ Delaware, Dept Comp & Informat Sci, Newark, DE 19716 USA
关键词
Data extraction; scientific document analysis; figure extraction; caption extraction; PDF parsing; CANCER;
D O I
10.1145/3269206.3269265
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Figures and captions convey essential information in scientific publications. As such, there is a growing interest in mining published figures and in utilizing their respective captions as a source of knowledge. There is also much interest in image captioning systems that can automatically generate captions for images, whose training requires large datasets of image-caption pairs. Notably, the first fundamental step of obtaining figures and captions from publications is neither well-studied nor yet well-addressed. In this paper, we introduce a new and effective system for figure and caption extraction, PDFigCapX. Unlike current methods that extract figures by handling raw encoded contents of PDF documents, we separate text from graphical contents and utilize layout information to detect and disambiguate figures and captions. Files containing the figures and their associated captions are then produced as output to the end-user. We test PDFigCapX on both a previously used generic dataset and on two new sets of publications within the biomedical domain. Our experiments and results show a significant improvement in performance compared to the state-of-the-art, and demonstrate the effectiveness of our approach. Our system will be available for use at: https://www.eecis.udel.edu/similar to compbio/PDFigCapX.
引用
收藏
页码:1595 / 1598
页数:4
相关论文
共 50 条
  • [1] SCICAP: Generating Captions for Scientific Figures
    Hsu, Ting-Yao
    Giles, C. Lee
    Huang, Ting-Hao 'Kenneth'
    [J]. FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2021, 2021, : 3258 - 3264
  • [2] Look, Read and Enrich Learning from Scientific Figures and their Captions
    Manuel Gomez-Perez, Jose
    Ortega, Raul
    [J]. PROCEEDINGS OF THE 10TH INTERNATIONAL CONFERENCE ON KNOWLEDGE CAPTURE (K-CAP '19), 2019, : 101 - 108
  • [3] FCENet: An Instance Segmentation Model for Extracting Figures and Captions From Material Documents
    Liu, Yingli
    Si, Changkai
    Jin, Kai
    Shen, Tao
    Hu, Meng
    [J]. IEEE ACCESS, 2021, 9 : 551 - 564
  • [4] An Automatic System for Extracting Figures and Captions in Biomedical PDF Documents
    Lopez, Luis D.
    Yu, Jingyi
    Arighi, Cecilia N.
    Huang, Hongzhan
    Shatkay, Hagit
    Wu, Cathy
    [J]. 2011 IEEE INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOMEDICINE (BIBM 2011), 2011, : 578 - 581
  • [5] Figures in Scientific Open Access Publications
    Sohmen, Lucia
    Charbonnier, Jean
    Bluemel, Ina
    Wartena, Christian
    Heller, Lambert
    [J]. DIGITAL LIBRARIES FOR OPEN KNOWLEDGE, TPDL 2018, 2018, 11057 : 220 - 226
  • [6] MACHINE-DRAWN FIGURES IN SCIENTIFIC PUBLICATIONS
    SEDLACEK, J
    [J]. CHEMICKE LISTY, 1985, 79 (06): : 649 - 657
  • [7] CAPTIONS FOR TABLES AND FIGURES
    HARTLEY, J
    [J]. BRITISH JOURNAL OF EDUCATIONAL TECHNOLOGY, 1991, 22 (02) : 149 - 150
  • [8] Extracting Scientific Figures with Distantly Supervised Neural Networks
    Siegel, Noah
    Lourie, Nicholas
    Power, Russell
    Ammar, Waleed
    [J]. JCDL'18: PROCEEDINGS OF THE 18TH ACM/IEEE JOINT CONFERENCE ON DIGITAL LIBRARIES, 2018, : 223 - 232
  • [9] Automatic Extraction of Figures from Scientific Publications in High-Energy Physics
    Praczyk, Piotr Adam
    Nogueras-Iso, Javier
    Mele, Salvatore
    [J]. INFORMATION TECHNOLOGY AND LIBRARIES, 2013, 32 (04) : 25 - 52
  • [10] Machine Learning Techniques for Automatically Extracting Contextual Information from Scientific Publications
    Klampfl, Stefan
    Kern, Roman
    [J]. SEMANTIC WEB EVALUATION CHALLENGES, 2015, 548 : 105 - 116