Machine extraction of polymer data from tables using XML versions of scientific articles

被引:2
|
作者
Oka, Hiroyuki [1 ]
Yoshizawa, Atsushi [1 ]
Shindo, Hiroyuki [2 ,3 ]
Matsumoto, Yuji [3 ]
Ishii, Masashi [1 ]
机构
[1] Natl Inst Mat Sci NIMS, Res & Serv Div Mat Data & Integrated Syst, 1-1 Namiki, Tsukuba, Ibaraki 3050044, Japan
[2] Nara Inst Sci & Technol NAIST, Div Informat Sci, Ikoma, Nara, Japan
[3] RIKEN, Ctr Adv Intelligence Project, Chuo Ku, Tokyo, Japan
关键词
Machine extraction; polymer data; table; XML; informatics;
D O I
10.1080/27660400.2021.1899456
中图分类号
T [工业技术];
学科分类号
08 ;
摘要
In this study, we examined machine extraction of polymer data from tables in scientific articles. The extraction system consists of five processes: table extraction, data formatting, polymer name recognition, property specifier identification, and data extraction. Tables were first extracted in plain text. XML versions of scientific articles were used, and the tabular forms were accurately extracted, even for complicated tables, such as multi-column, multi-row, and merged tables. Polymer name recognition was performed using a named entity recognizer created by deep neural network learning of polymer names. The preparation cost of the training data was reduced using a rule-based algorithm. The target polymer properties in this study were glass transition temperature (T-g), melting temperature (T-m), and decomposition temperature (T-d), and the specifiers were identified using partial string matching. Through these five processes, 2,181 data points for T-g, 1,526 for T-m, and 2,316 for T-d were extracted from approximately 18,000 scientific articles published by Elsevier. Nearly half of them were extracted from complicated tables. The F-scores for the extraction were 0.871, 0.870, and 0.841, respectively. These results indicate that the extraction system created in this study can rapidly and accurately collect large amounts of polymer data from tables in scientific literature. [GRAPHICS]
引用
收藏
页码:12 / 23
页数:12
相关论文
共 50 条
  • [31] Automated extraction of data from text using an XML parser: An earth science example using fossil descriptions
    Curry, Gordon B.
    Connor, Richard C. H.
    GEOSPHERE, 2008, 4 (01): : 159 - 169
  • [32] Improving Scientific Data Extraction using Metadata Classification
    Chang, Yue Shan
    Lai, Hsuan-Jen
    Cheng, Hsiang-Tai
    2009 10TH INTERNATIONAL SYMPOSIUM ON PERVASIVE SYSTEMS, ALGORITHMS, AND NETWORKS (ISPAN 2009), 2009, : 669 - +
  • [33] A scientific data extraction architecture using classified metadata
    Yue-Shan Chang
    Hsiang-Tai Cheng
    The Journal of Supercomputing, 2012, 60 : 338 - 359
  • [34] A scientific data extraction architecture using classified metadata
    Chang, Yue-Shan
    Cheng, Hsiang-Tai
    JOURNAL OF SUPERCOMPUTING, 2012, 60 (03): : 338 - 359
  • [35] Automating the extraction of data from HTML']HTML tables with unknown structure
    Embley, DW
    Tao, C
    Liddle, SW
    DATA & KNOWLEDGE ENGINEERING, 2005, 54 (01) : 3 - 28
  • [36] Development of a data-driven scientific methodology: From articles to chemometric data products
    Carballo-Meilan, Ara
    McDonald, Lewis
    Pragot, Wanawan
    Starnawski, Lukasz Michal
    Saleemi, Ali Nauman
    Afzal, Waheed
    CHEMOMETRICS AND INTELLIGENT LABORATORY SYSTEMS, 2022, 225
  • [37] Extraction of meaningful tables from the Internet using decision trees
    Jung, SW
    Lee, WH
    Park, SK
    Kwon, HC
    DEVELOPMENTS IN APPLIED ARTIFICIAL INTELLIGENCE, 2003, 2718 : 176 - 186
  • [38] Road Extraction from Lidar Data Using Support Vector Machine Classification
    Matkan, Ali Akbar
    Hajeb, Mohammad
    Sadeghian, Saeed
    PHOTOGRAMMETRIC ENGINEERING AND REMOTE SENSING, 2014, 80 (05): : 409 - 422
  • [39] Automatic Data Extraction from Lists in Web Pages Based on XML
    Xin, Zhou
    Hao, Wang
    ADVANCED TECHNOLOGY IN TEACHING - PROCEEDINGS OF THE 2009 3RD INTERNATIONAL CONFERENCE ON TEACHING AND COMPUTATIONAL SCIENCE (WTCS 2009), VOL 2: EDUCATION, PSYCHOLOGY AND COMPUTER SCIENCE, 2012, 117 : 915 - 921
  • [40] Web Data Extraction from Scientific Publishers' Website Using Hidden Markov Model
    Huang, Jing
    Liu, Ziyu
    Wang, Beibei
    Duan, Mingyue
    Yang, Bo
    KNOWLEDGE SCIENCE, ENGINEERING AND MANAGEMENT (KSEM 2018), PT I, 2018, 11061 : 469 - 476