Machine extraction of polymer data from tables using XML versions of scientific articles

被引:2
|
作者
Oka, Hiroyuki [1 ]
Yoshizawa, Atsushi [1 ]
Shindo, Hiroyuki [2 ,3 ]
Matsumoto, Yuji [3 ]
Ishii, Masashi [1 ]
机构
[1] Natl Inst Mat Sci NIMS, Res & Serv Div Mat Data & Integrated Syst, 1-1 Namiki, Tsukuba, Ibaraki 3050044, Japan
[2] Nara Inst Sci & Technol NAIST, Div Informat Sci, Ikoma, Nara, Japan
[3] RIKEN, Ctr Adv Intelligence Project, Chuo Ku, Tokyo, Japan
关键词
Machine extraction; polymer data; table; XML; informatics;
D O I
10.1080/27660400.2021.1899456
中图分类号
T [工业技术];
学科分类号
08 ;
摘要
In this study, we examined machine extraction of polymer data from tables in scientific articles. The extraction system consists of five processes: table extraction, data formatting, polymer name recognition, property specifier identification, and data extraction. Tables were first extracted in plain text. XML versions of scientific articles were used, and the tabular forms were accurately extracted, even for complicated tables, such as multi-column, multi-row, and merged tables. Polymer name recognition was performed using a named entity recognizer created by deep neural network learning of polymer names. The preparation cost of the training data was reduced using a rule-based algorithm. The target polymer properties in this study were glass transition temperature (T-g), melting temperature (T-m), and decomposition temperature (T-d), and the specifiers were identified using partial string matching. Through these five processes, 2,181 data points for T-g, 1,526 for T-m, and 2,316 for T-d were extracted from approximately 18,000 scientific articles published by Elsevier. Nearly half of them were extracted from complicated tables. The F-scores for the extraction were 0.871, 0.870, and 0.841, respectively. These results indicate that the extraction system created in this study can rapidly and accurately collect large amounts of polymer data from tables in scientific literature. [GRAPHICS]
引用
收藏
页码:12 / 23
页数:12
相关论文
共 50 条
  • [21] Information extraction from full text scientific articles: Where are the keywords?
    Shah, PK
    Perez-Iratxeta, C
    Bork, P
    Andrade, MA
    BMC BIOINFORMATICS, 2003, 4 (1)
  • [22] Information extraction from full text scientific articles: Where are the keywords?
    Parantu K Shah
    Carolina Perez-Iratxeta
    Peer Bork
    Miguel A Andrade
    BMC Bioinformatics, 4
  • [23] Contextual Semantic: A Context-aware Approach for Semantic Web Based Data Extraction from Scientific Articles
    Kumlander, Deniss
    INNOVATIONS IN COMPUTING SCIENCES AND SOFTWARE ENGINEERING, 2010, : 241 - 244
  • [24] An algorithm for data reconstruction from published articles - Application on insect life tables
    Kareithi, D. N.
    Salifu, D.
    Owuor, N.
    Subramanian, S.
    Tonnang, E. Z. H.
    COGENT MATHEMATICS & STATISTICS, 2019, 6
  • [25] Data Extraction of XML Files using Searching and Indexing Techniques
    Satpute, Sushma
    Katkar, Vaishali
    Sahare, Nilesh
    PROCEEDINGS OF WORLD ACADEMY OF SCIENCE, ENGINEERING AND TECHNOLOGY, VOL 29, 2008, 29 : 408 - 414
  • [26] TABLEX: A Benchmark Dataset for Structure and Content Information Extraction from Scientific Tables
    Desai, Harsh
    Kayal, Pratik
    Singh, Mayank
    DOCUMENT ANALYSIS AND RECOGNITION - ICDAR 2021, PT II, 2021, 12822 : 554 - 569
  • [27] Extracting Material Property Measurement Data from Scientific Articles
    Panapitiya, Gihan
    Parks, Fred
    Sepulveda, Jonathan
    Saldanha, Emily
    2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), 2021, : 5393 - 5402
  • [28] Data Extraction from Traffic Videos Using Machine Learning Approach
    Mittal, Anshul
    Gupta, Mridul
    Ghosh, Indrajit
    SOFT COMPUTING FOR PROBLEM SOLVING, SOCPROS 2017, VOL 1, 2019, 816 : 211 - 221
  • [29] Scientific Data Extraction from Oceanographic Papers
    Veyhe, Bartal Eyofnsson
    Sagi, Tomer
    Hose, Katja
    COMPANION OF THE WORLD WIDE WEB CONFERENCE, WWW 2023, 2023, : 800 - 804
  • [30] Data extraction from polymer literature using large language models
    Gupta, Sonakshi
    Mahmood, Akhlak
    Shetty, Pranav
    Adeboye, Aishat
    Ramprasad, Rampi
    Communications Materials, 2024, 5 (01)