Machine extraction of polymer data from tables using XML versions of scientific articles

被引:2
|
作者
Oka, Hiroyuki [1 ]
Yoshizawa, Atsushi [1 ]
Shindo, Hiroyuki [2 ,3 ]
Matsumoto, Yuji [3 ]
Ishii, Masashi [1 ]
机构
[1] Natl Inst Mat Sci NIMS, Res & Serv Div Mat Data & Integrated Syst, 1-1 Namiki, Tsukuba, Ibaraki 3050044, Japan
[2] Nara Inst Sci & Technol NAIST, Div Informat Sci, Ikoma, Nara, Japan
[3] RIKEN, Ctr Adv Intelligence Project, Chuo Ku, Tokyo, Japan
关键词
Machine extraction; polymer data; table; XML; informatics;
D O I
10.1080/27660400.2021.1899456
中图分类号
T [工业技术];
学科分类号
08 ;
摘要
In this study, we examined machine extraction of polymer data from tables in scientific articles. The extraction system consists of five processes: table extraction, data formatting, polymer name recognition, property specifier identification, and data extraction. Tables were first extracted in plain text. XML versions of scientific articles were used, and the tabular forms were accurately extracted, even for complicated tables, such as multi-column, multi-row, and merged tables. Polymer name recognition was performed using a named entity recognizer created by deep neural network learning of polymer names. The preparation cost of the training data was reduced using a rule-based algorithm. The target polymer properties in this study were glass transition temperature (T-g), melting temperature (T-m), and decomposition temperature (T-d), and the specifiers were identified using partial string matching. Through these five processes, 2,181 data points for T-g, 1,526 for T-m, and 2,316 for T-d were extracted from approximately 18,000 scientific articles published by Elsevier. Nearly half of them were extracted from complicated tables. The F-scores for the extraction were 0.871, 0.870, and 0.841, respectively. These results indicate that the extraction system created in this study can rapidly and accurately collect large amounts of polymer data from tables in scientific literature. [GRAPHICS]
引用
下载
收藏
页码:12 / 23
页数:12
相关论文
共 50 条
  • [41] VID2XML: Automatic Extraction of a Complete XML Data From Mobile Programming Screencasts
    Alahmadi, Mohammad D.
    IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, 2023, 49 (04) : 1726 - 1740
  • [42] LINEEX: Data Extraction from Scientific Line Charts
    Shivasankaran, V. P.
    Hassan, Muhammad Yusuf
    Singh, Mayank
    2023 IEEE/CVF WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2023, : 6202 - 6210
  • [43] Extraction of Protein-Protein Interaction from Scientific Articles by Predicting Dominant Keywords
    Koyabu, Shun
    Thi Thanh Thuy Phan
    Ohkawa, Takenao
    BIOMED RESEARCH INTERNATIONAL, 2015, 2015
  • [44] Trends in web data extraction using machine learning
    Patnaik, Sudhir Kumar
    Babu, C. Narendra
    WEB INTELLIGENCE, 2021, 19 (03) : 169 - 190
  • [45] Enhancing Keyword Extraction from Academic Articles Using Highlights
    Yi, Xiang
    Xinyi, Yan
    Zhang, Chengzhi
    Proceedings of the Association for Information Science and Technology, 2024, 61 (01) : 1147 - 1149
  • [46] Selection of Tools for Preprocessing and Thematic Modeling of Scientific Articles from the Data Lake
    Gayanova, M. M.
    Sazonova, E. Yu.
    Smetanina, O. N.
    Sulejmanov, A. K.
    PATTERN RECOGNITION AND IMAGE ANALYSIS, 2023, 33 (03) : 313 - 323
  • [47] Selection of Tools for Preprocessing and Thematic Modeling of Scientific Articles from the Data Lake
    M. M. Gayanova
    E. Yu. Sazonova
    O. N. Smetanina
    A. K. Sulejmanov
    Pattern Recognition and Image Analysis, 2023, 33 : 313 - 323
  • [48] Rule Driven Spreadsheet Data Extraction from Statistical Tables: Case Study
    Paramonov, Viacheslav
    Shigarov, Alexey
    Vetrova, Varvara
    INFORMATION AND SOFTWARE TECHNOLOGIES, ICIST 2021, 2021, 1486 : 84 - 95
  • [49] Heuristic Algorithm for Automatic Extraction Relational Data from Spreadsheet Hierarchical Tables
    Awad, Arwa
    Moawad, Ibrahim
    Elgohary, Rania
    Roushdy, Mohamed
    INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2021, 12 (10) : 743 - 748
  • [50] Using machine learning for concept extraction on clinical documents from multiple data sources
    Torii, Manabu
    Wagholikar, Kavishwar
    Liu, Hongfang
    JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2011, 18 (05) : 580 - 587