Machine extraction of polymer data from tables using XML versions of scientific articles

被引:4
|
作者
Oka, Hiroyuki [1 ]
Yoshizawa, Atsushi [1 ]
Shindo, Hiroyuki [2 ,3 ]
Matsumoto, Yuji [3 ]
Ishii, Masashi [1 ]
机构
[1] Natl Inst Mat Sci NIMS, Res & Serv Div Mat Data & Integrated Syst, 1-1 Namiki, Tsukuba, Ibaraki 3050044, Japan
[2] Nara Inst Sci & Technol NAIST, Div Informat Sci, Ikoma, Nara, Japan
[3] RIKEN, Ctr Adv Intelligence Project, Chuo Ku, Tokyo, Japan
关键词
Machine extraction; polymer data; table; XML; informatics;
D O I
10.1080/27660400.2021.1899456
中图分类号
T [工业技术];
学科分类号
08 ;
摘要
In this study, we examined machine extraction of polymer data from tables in scientific articles. The extraction system consists of five processes: table extraction, data formatting, polymer name recognition, property specifier identification, and data extraction. Tables were first extracted in plain text. XML versions of scientific articles were used, and the tabular forms were accurately extracted, even for complicated tables, such as multi-column, multi-row, and merged tables. Polymer name recognition was performed using a named entity recognizer created by deep neural network learning of polymer names. The preparation cost of the training data was reduced using a rule-based algorithm. The target polymer properties in this study were glass transition temperature (T-g), melting temperature (T-m), and decomposition temperature (T-d), and the specifiers were identified using partial string matching. Through these five processes, 2,181 data points for T-g, 1,526 for T-m, and 2,316 for T-d were extracted from approximately 18,000 scientific articles published by Elsevier. Nearly half of them were extracted from complicated tables. The F-scores for the extraction were 0.871, 0.870, and 0.841, respectively. These results indicate that the extraction system created in this study can rapidly and accurately collect large amounts of polymer data from tables in scientific literature. [GRAPHICS]
引用
收藏
页码:12 / 23
页数:12
相关论文
共 50 条
  • [1] Tables to LaTeX: structure and content extraction from scientific tables
    Pratik Kayal
    Mrinal Anand
    Harsh Desai
    Mayank Singh
    [J]. International Journal on Document Analysis and Recognition (IJDAR), 2023, 26 : 121 - 130
  • [2] Tables to LaTeX: structure and content extraction from scientific tables
    Kayal, Pratik
    Anand, Mrinal
    Desai, Harsh
    Singh, Mayank
    [J]. INTERNATIONAL JOURNAL ON DOCUMENT ANALYSIS AND RECOGNITION, 2023, 26 (02) : 121 - 130
  • [3] Information extraction from scientific articles: a survey
    Nasar, Zara
    Jaffry, Syed Waqar
    Malik, Muhammad Kamran
    [J]. SCIENTOMETRICS, 2018, 117 (03) : 1931 - 1990
  • [4] Information extraction from scientific articles: a survey
    Zara Nasar
    Syed Waqar Jaffry
    Muhammad Kamran Malik
    [J]. Scientometrics, 2018, 117 : 1931 - 1990
  • [5] Automatic keyphrase extraction from scientific articles
    Kim, Su Nam
    Medelyan, Olena
    Kan, Min-Yen
    Baldwin, Timothy
    [J]. LANGUAGE RESOURCES AND EVALUATION, 2013, 47 (03) : 723 - 742
  • [6] Automatic keyphrase extraction from scientific articles
    Su Nam Kim
    Olena Medelyan
    Min-Yen Kan
    Timothy Baldwin
    [J]. Language Resources and Evaluation, 2013, 47 : 723 - 742
  • [7] An adaptable and adjustable mapping from XML data to tables in RDB
    Wang, XL
    Luan, JF
    Dong, YS
    [J]. EFFICIENCY AND EFFECTIVENESS OF XML TOOLS AND TECHNIQUES AND DATA INTEGRATION OVER THE WEB, 2003, 2590 : 117 - 130
  • [8] Automatic extraction and learning of keyphrases from scientific articles
    HaCohen-Kerner, Y
    Gross, Z
    Masa, A
    [J]. COMPUTATIONAL LINGUISTICS AND INTELLIGENT TEXT PROCESSING, 2005, 3406 : 657 - 669
  • [9] KIETA: Key-insight extraction from scientific tables
    Kempf, Sebastian
    Krug, Markus
    Puppe, Frank
    [J]. APPLIED INTELLIGENCE, 2023, 53 (08) : 9513 - 9530
  • [10] KIETA: Key-insight extraction from scientific tables
    Sebastian Kempf
    Markus Krug
    Frank Puppe
    [J]. Applied Intelligence, 2023, 53 : 9513 - 9530