Extracting Figures and Captions from Scientific Publications

被引：5

作者：

Li, Pengyuan ^{[1
]}

Jiang, Xiangying ^{[1
]}

Shatkay, Hagit ^{[1
]}

机构：

[1] Univ Delaware, Dept Comp & Informat Sci, Newark, DE 19716 USA

来源：

CIKM'18: PROCEEDINGS OF THE 27TH ACM INTERNATIONAL CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT | 2018年

关键词：

Data extraction; scientific document analysis; figure extraction; caption extraction; PDF parsing; CANCER;

D O I：

10.1145/3269206.3269265

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Figures and captions convey essential information in scientific publications. As such, there is a growing interest in mining published figures and in utilizing their respective captions as a source of knowledge. There is also much interest in image captioning systems that can automatically generate captions for images, whose training requires large datasets of image-caption pairs. Notably, the first fundamental step of obtaining figures and captions from publications is neither well-studied nor yet well-addressed. In this paper, we introduce a new and effective system for figure and caption extraction, PDFigCapX. Unlike current methods that extract figures by handling raw encoded contents of PDF documents, we separate text from graphical contents and utilize layout information to detect and disambiguate figures and captions. Files containing the figures and their associated captions are then produced as output to the end-user. We test PDFigCapX on both a previously used generic dataset and on two new sets of publications within the biomedical domain. Our experiments and results show a significant improvement in performance compared to the state-of-the-art, and demonstrate the effectiveness of our approach. Our system will be available for use at: https://www.eecis.udel.edu/similar to compbio/PDFigCapX.

引用

页码：1595 / 1598

页数：4

共 50 条

[1] SCICAP: Generating Captions for Scientific Figures
Hsu, Ting-Yao
Giles, C. Lee
Huang, Ting-Hao 'Kenneth'
[J]. FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2021, 2021, : 3258 - 3264
[2] Look, Read and Enrich Learning from Scientific Figures and their Captions
Manuel Gomez-Perez, Jose
Ortega, Raul
[J]. PROCEEDINGS OF THE 10TH INTERNATIONAL CONFERENCE ON KNOWLEDGE CAPTURE (K-CAP '19), 2019, : 101 - 108
[3] FCENet: An Instance Segmentation Model for Extracting Figures and Captions From Material Documents
Liu, Yingli
Si, Changkai
Jin, Kai
Shen, Tao
Hu, Meng
[J]. IEEE ACCESS, 2021, 9 : 551 - 564
[4] An Automatic System for Extracting Figures and Captions in Biomedical PDF Documents
Lopez, Luis D.
Yu, Jingyi
Arighi, Cecilia N.
Huang, Hongzhan
Shatkay, Hagit
Wu, Cathy
[J]. 2011 IEEE INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOMEDICINE (BIBM 2011), 2011, : 578 - 581
[5] Figures in Scientific Open Access Publications
Sohmen, Lucia
Charbonnier, Jean
Bluemel, Ina
Wartena, Christian
Heller, Lambert
[J]. DIGITAL LIBRARIES FOR OPEN KNOWLEDGE, TPDL 2018, 2018, 11057 : 220 - 226
[6] MACHINE-DRAWN FIGURES IN SCIENTIFIC PUBLICATIONS
SEDLACEK, J
[J]. CHEMICKE LISTY, 1985, 79 (06): : 649 - 657
[7] CAPTIONS FOR TABLES AND FIGURES
HARTLEY, J
[J]. BRITISH JOURNAL OF EDUCATIONAL TECHNOLOGY, 1991, 22 (02) : 149 - 150
[8] Extracting Scientific Figures with Distantly Supervised Neural Networks
Siegel, Noah
Lourie, Nicholas
Power, Russell
Ammar, Waleed
[J]. JCDL'18: PROCEEDINGS OF THE 18TH ACM/IEEE JOINT CONFERENCE ON DIGITAL LIBRARIES, 2018, : 223 - 232
[9] Automatic Extraction of Figures from Scientific Publications in High-Energy Physics
Praczyk, Piotr Adam
Nogueras-Iso, Javier
Mele, Salvatore
[J]. INFORMATION TECHNOLOGY AND LIBRARIES, 2013, 32 (04) : 25 - 52
[10] Machine Learning Techniques for Automatically Extracting Contextual Information from Scientific Publications
Klampfl, Stefan
Kern, Roman
[J]. SEMANTIC WEB EVALUATION CHALLENGES, 2015, 548 : 105 - 116

← 1 2 3 4 5 →